COVID-19 PubSeq Uploading Data (part 3)

1. Uploading Data
2. Step 1: Upload sequence
3. Step 2: Add metadata +
1. Uploading Data
2. Step 1: Upload sequence
3. Step 2: Add metadata
4. Step 3: Submit to COVID-19 PubSeq +
4. Step 3: Submit to COVID-19 PubSeq
- 4.1. Trouble shooting
- 4.1. Trouble shooting
5. Step 4: Check output
6. Bulk sequence uploader +
5. Step 4: Check output
6. Bulk sequence uploader

@@ -290,8 +267,8 @@ for the JavaScript code in this tag. -

1 Uploading Data

The COVID-19 PubSeq allows you to upload your SARS-Cov-2 strains to a @@ -301,8 +278,8 @@ gets triggered on upload. Read the ABOUT page for more inf

2 Step 1: Upload sequence

To upload a sequence in the web upload page hit the browse button and @@ -330,8 +307,8 @@ an improved pangenome.

3 Step 2: Add metadata

The web upload page contains fields for adding metadata. Metadata is @@ -357,12 +334,12 @@ the web form. Here we add some extra information.

3.1 Obligatory fields

3.1.1 Sample ID (sample_id)

This is a string field that defines a unique sample identifier by the @@ -380,8 +357,8 @@ Here we add the GenBank ID MT536190.1.

3.1.2 Collection date

Estimated collection date. The GenBank page says April 6, 2020. @@ -389,8 +366,8 @@ Estimated collection date. The GenBank page says April 6, 2020.

3.1.3 Collection location

A search on wikidata says Los Angeles is @@ -399,8 +376,8 @@ A search on wikidata says Los Angeles is

3.1.4 Sequencing technology

GenBank entry says Illumina, so we can fill that in @@ -408,8 +385,8 @@ GenBank entry says Illumina, so we can fill that in

3.1.5 Authors

GenBank entry says 'Lamers,S., Nolan,D.J., Rose,R., Cross,S., Moraga @@ -420,16 +397,16 @@ Freehan,A. and Garcia-Diaz,J.', so we can fill that in.

3.2 Optional fields

All other fields are optional. But let's see what we can add.

3.2.1 Host information

Sadly, not much is known about the host from GenBank. A little @@ -443,8 +420,8 @@ did to the person and what the person was like (say age group).

3.2.2 Collecting institution

We can fill that in. @@ -452,8 +429,8 @@ We can fill that in.

3.2.3 Specimen source

We have that: nasopharyngeal swab @@ -461,8 +438,8 @@ We have that: nasopharyngeal swab

3.2.4 Source database accession

Genbank which is http://identifiers.org/insdc/MT536190.1#sequence. @@ -471,8 +448,8 @@ Note we plug in our own identifier MT536190.1.

3.2.5 Strain name

SARS-CoV-2/human/USA/LA-BIE-070/2020 @@ -482,8 +459,8 @@ SARS-CoV-2/human/USA/LA-BIE-070/2020

4 Step 3: Submit to COVID-19 PubSeq

Once you have the sequence and the metadata together, hit @@ -493,8 +470,8 @@ submitted and the workflows should kick in!

4.1 Trouble shooting

We got an error saying: {"stem": "http://www.wikidata.org/entity/",… @@ -508,8 +485,8 @@ submit button.

5 Step 4: Check output

The current pipeline takes 5.5 hours to complete! Once it completes @@ -520,8 +497,8 @@ in.

6 Bulk sequence uploader

Above steps require a manual upload of one sequence with metadata. @@ -584,8 +561,8 @@ submitter:

6.1 Run the uploader (CLI)

Installing with pip you should be @@ -610,9 +587,28 @@ python3 bh20sequploader/main.py example/sequence.fasta example/maximum_metadata_

after installing dependencies (also described in INSTALL with the GNU -Guix package manager). +Guix package manager). The --help shows

Entering sequence uploader
+usage: main.py [-h] [--validate] [--skip-qc] [--trusted] metadata sequence_p1 [sequence_p2]
+
+Upload SARS-CoV-19 sequences for analysis
+
+positional arguments:
+  metadata     sequence metadata json
+  sequence_p1  sequence FASTA/FASTQ
+  sequence_p2  sequence FASTQ pair
+
+optional arguments:
+  -h, --help   show this help message and exit
+  --validate   Dry run, validate only
+  --skip-qc    Skip local qc check
+  --trusted    Trust local validation and add directly to validated project
+

The web interface using this exact same script so it should just work (TM). @@ -620,8 +616,9 @@ The web interface using this exact same script so it should just work

6.2 Example: uploading bulk GenBank sequences

+ +

6.2 Example: uploading bulk GenBank sequences

We also use above script to bulk upload GenBank sequences with a FASTA @@ -646,10 +643,32 @@ ls $dir_fasta_and_yaml/*.yaml |

+ + +

6.3 Example: preparing metadata

+Usually, metadata are available in tabular format, like spreadsheets. As an example, we provide a script +esr_samples.py to show you how to parse +your metadata in YAML files ready for the upload. To execute the script, go in the ~bh20-seq-resource/scripts/esr_samples +and execute +

+ +

python3 esr_samples.py
+

+ +

+You will find the YAML files in the `yaml` folder which will be created in the same directory. +

Created by Pjotr Prins (pjotr.public768 at thebird 'dot' nl) using Emacs org-mode and a healthy dose of Lisp!
Modified 2020-08-25 Tue 06:13. +

Created by Pjotr Prins (pjotr.public768 at thebird 'dot' nl) using Emacs org-mode and a healthy dose of Lisp!
Modified 2020-10-27 Tue 06:43.

diff --git a/doc/blog/using-covid-19-pubseq-part3.org b/doc/blog/using-covid-19-pubseq-part3.org index abc260c..fb68251 100644 --- a/doc/blog/using-covid-19-pubseq-part3.org +++ b/doc/blog/using-covid-19-pubseq-part3.org @@ -228,7 +228,25 @@ command line : python3 bh20sequploader/main.py example/sequence.fasta example/maximum_metadata_example.yaml after installing dependencies (also described in [[https://github.com/arvados/bh20-seq-resource/blob/master/doc/INSTALL.md][INSTALL]] with the GNU -Guix package manager). +Guix package manager). The ~--help~ shows + +#+begin_src sh +Entering sequence uploader +usage: main.py [-h] [--validate] [--skip-qc] [--trusted] metadata sequence_p1 [sequence_p2] + +Upload SARS-CoV-19 sequences for analysis + +positional arguments: + metadata sequence metadata json + sequence_p1 sequence FASTA/FASTQ + sequence_p2 sequence FASTQ pair + +optional arguments: + -h, --help show this help message and exit + --validate Dry run, validate only + --skip-qc Skip local qc check + --trusted Trust local validation and add directly to validated project +#+end_src The web interface using this exact same script so it should just work (TM). @@ -265,4 +283,4 @@ and execute python3 esr_samples.py #+END_SRC -You will find the YAML files in the `yaml` folder which will be created in the same directory. \ No newline at end of file +You will find the YAML files in the `yaml` folder which will be created in the same directory. -- cgit 1.4.1

Table of Contents

1 Uploading Data

1 Uploading Data

2 Step 1: Upload sequence

2 Step 1: Upload sequence

3 Step 2: Add metadata

3 Step 2: Add metadata

3.1 Obligatory fields

3.1 Obligatory fields

3.1.1 Sample ID (sample_id)

3.1.1 Sample ID (sample_id)

3.1.2 Collection date

3.1.2 Collection date

3.1.3 Collection location

3.1.3 Collection location

3.1.4 Sequencing technology

3.1.4 Sequencing technology

3.1.5 Authors

3.1.5 Authors

3.2 Optional fields

3.2 Optional fields

3.2.1 Host information

3.2.1 Host information

3.2.2 Collecting institution

3.2.2 Collecting institution

3.2.3 Specimen source

3.2.3 Specimen source

3.2.4 Source database accession

3.2.4 Source database accession

3.2.5 Strain name

3.2.5 Strain name

4 Step 3: Submit to COVID-19 PubSeq

4 Step 3: Submit to COVID-19 PubSeq

4.1 Trouble shooting

4.1 Trouble shooting

5 Step 4: Check output

5 Step 4: Check output

6 Bulk sequence uploader

6 Bulk sequence uploader

6.1 Run the uploader (CLI)

6.1 Run the uploader (CLI)

6.2 Example: uploading bulk GenBank sequences

6.2 Example: uploading bulk GenBank sequences

6.3 Example: preparing metadata