From 7fabc4f9427856600e237c6cacd710f49b88d45d Mon Sep 17 00:00:00 2001 From: Pjotr Prins Date: Mon, 24 Aug 2020 10:31:24 +0100 Subject: Genbank upload --- doc/blog/using-covid-19-pubseq-part3.html | 145 ++++++++++++++++-------------- doc/blog/using-covid-19-pubseq-part3.org | 12 +++ 2 files changed, 92 insertions(+), 65 deletions(-) (limited to 'doc') diff --git a/doc/blog/using-covid-19-pubseq-part3.html b/doc/blog/using-covid-19-pubseq-part3.html index 80304c3..718b10f 100644 --- a/doc/blog/using-covid-19-pubseq-part3.html +++ b/doc/blog/using-covid-19-pubseq-part3.html @@ -3,7 +3,7 @@ "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"> - + COVID-19 PubSeq Uploading Data (part 3) @@ -248,40 +248,40 @@ for the JavaScript code in this tag.

Table of Contents

@@ -290,8 +290,8 @@ for the JavaScript code in this tag. -
-

1 Uploading Data

+
+

1 Uploading Data

The COVID-19 PubSeq allows you to upload your SARS-Cov-2 strains to a @@ -301,8 +301,8 @@ gets triggered on upload. Read the ABOUT page for more inf

-
-

2 Step 1: Upload sequence

+
+

2 Step 1: Upload sequence

To upload a sequence in the web upload page hit the browse button and @@ -330,8 +330,8 @@ an improved pangenome.

-
-

3 Step 2: Add metadata

+
+

3 Step 2: Add metadata

The web upload page contains fields for adding metadata. Metadata is @@ -357,12 +357,12 @@ the web form. Here we add some extra information.

-
-

3.1 Obligatory fields

+
+

3.1 Obligatory fields

-
-

3.1.1 Sample ID (sampleid)

+
+

3.1.1 Sample ID (sampleid)

This is a string field that defines a unique sample identifier by the @@ -380,8 +380,8 @@ Here we add the GenBank ID MT536190.1.

-
-

3.1.2 Collection date

+
+

3.1.2 Collection date

Estimated collection date. The GenBank page says April 6, 2020. @@ -389,8 +389,8 @@ Estimated collection date. The GenBank page says April 6, 2020.

-
-

3.1.3 Collection location

+
+

3.1.3 Collection location

A search on wikidata says Los Angeles is @@ -399,8 +399,8 @@ A search on wikidata says Los Angeles is

-
-

3.1.4 Sequencing technology

+
+

3.1.4 Sequencing technology

GenBank entry says Illumina, so we can fill that in @@ -408,8 +408,8 @@ GenBank entry says Illumina, so we can fill that in

-
-

3.1.5 Authors

+
+

3.1.5 Authors

GenBank entry says 'Lamers,S., Nolan,D.J., Rose,R., Cross,S., Moraga @@ -420,16 +420,16 @@ Freehan,A. and Garcia-Diaz,J.', so we can fill that in.

-
-

3.2 Optional fields

+
+

3.2 Optional fields

All other fields are optional. But let's see what we can add.

-
-

3.2.1 Host information

+
+

3.2.1 Host information

Sadly, not much is known about the host from GenBank. A little @@ -443,8 +443,8 @@ did to the person and what the person was like (say age group).

-
-

3.2.2 Collecting institution

+
+

3.2.2 Collecting institution

We can fill that in. @@ -452,8 +452,8 @@ We can fill that in.

-
-

3.2.3 Specimen source

+
+

3.2.3 Specimen source

We have that: nasopharyngeal swab @@ -461,8 +461,8 @@ We have that: nasopharyngeal swab

-
-

3.2.4 Source database accession

+
+

3.2.4 Source database accession

Genbank which is http://identifiers.org/insdc/MT536190.1#sequence. @@ -471,8 +471,8 @@ Note we plug in our own identifier MT536190.1.

-
-

3.2.5 Strain name

+
+

3.2.5 Strain name

SARS-CoV-2/human/USA/LA-BIE-070/2020 @@ -482,8 +482,8 @@ SARS-CoV-2/human/USA/LA-BIE-070/2020

-
-

4 Step 3: Submit to COVID-19 PubSeq

+
+

4 Step 3: Submit to COVID-19 PubSeq

Once you have the sequence and the metadata together, hit @@ -493,8 +493,8 @@ submitted and the workflows should kick in!

-
-

4.1 Trouble shooting

+
+

4.1 Trouble shooting

We got an error saying: {"stem": "http://www.wikidata.org/entity/",… @@ -508,8 +508,8 @@ submit button.

-
-

5 Step 4: Check output

+
+

5 Step 4: Check output

The current pipeline takes 5.5 hours to complete! Once it completes @@ -520,8 +520,8 @@ in.

-
-

6 Bulk sequence uploader

+
+

6 Bulk sequence uploader

Above steps require a manual upload of one sequence with metadata. @@ -584,8 +584,8 @@ submitter:

-
-

6.1 Run the uploader (CLI)

+
+

6.1 Run the uploader (CLI)

Installing with pip you should be @@ -620,20 +620,35 @@ The web interface using this exact same script so it should just work

-
-

6.2 Example: uploading bulk GenBank sequences

+
+

6.2 Example: uploading bulk GenBank sequences

We also use above script to bulk upload GenBank sequences with a FASTA and YAML extractor specific for GenBank. This means that the steps we took above for uploading a GenBank sequence are already automated.

+ +

+The steps are: from the +bh20-seq-resource/scripts/download_genbank_data/ directory +

+ +
+
python3 from_genbank_to_fasta_and_yaml.py
+dir_fasta_and_yaml=~/bh20-seq-resource/scripts/download_genbank_data/fasta_and_yaml
+ls $dir_fasta_and_yaml/*.yaml | while read path_code_yaml; do
+   path_code_fasta=${path_code_yaml%.*}.fasta
+   bh20-seq-uploader --skip-qc $path_code_yaml $path_code_fasta
+done
+
+
-
Created by Pjotr Prins (pjotr.public768 at thebird 'dot' nl) using Emacs org-mode and a healthy dose of Lisp!
Modified 2020-08-22 Sat 07:43
. +
Created by Pjotr Prins (pjotr.public768 at thebird 'dot' nl) using Emacs org-mode and a healthy dose of Lisp!
Modified 2020-08-24 Mon 04:31
.
diff --git a/doc/blog/using-covid-19-pubseq-part3.org b/doc/blog/using-covid-19-pubseq-part3.org index b1ab90d..fda7be8 100644 --- a/doc/blog/using-covid-19-pubseq-part3.org +++ b/doc/blog/using-covid-19-pubseq-part3.org @@ -236,3 +236,15 @@ The web interface using this exact same script so it should just work We also use above script to bulk upload GenBank sequences with a [[https://github.com/arvados/bh20-seq-resource/blob/master/scripts/download_genbank_data/from_genbank_to_fasta_and_yaml.py][FASTA and YAML]] extractor specific for GenBank. This means that the steps we took above for uploading a GenBank sequence are already automated. + +The steps are: from the +~bh20-seq-resource/scripts/download_genbank_data/~ directory + +#+BEGIN_SRC sh +python3 from_genbank_to_fasta_and_yaml.py +dir_fasta_and_yaml=~/bh20-seq-resource/scripts/download_genbank_data/fasta_and_yaml +ls $dir_fasta_and_yaml/*.yaml | while read path_code_yaml; do + path_code_fasta=${path_code_yaml%.*}.fasta + bh20-seq-uploader --skip-qc $path_code_yaml $path_code_fasta +done +#+END_SRC -- cgit v1.2.3