From fbbec51e604964d18ab72cbf0ac24b102ecc0376 Mon Sep 17 00:00:00 2001 From: Pjotr Prins Date: Fri, 6 Nov 2020 07:45:10 +0000 Subject: Working on upload --- doc/blog/using-covid-19-pubseq-part3.html | 261 +++++++++++++++++++----------- 1 file changed, 165 insertions(+), 96 deletions(-) (limited to 'doc/blog/using-covid-19-pubseq-part3.html') diff --git a/doc/blog/using-covid-19-pubseq-part3.html b/doc/blog/using-covid-19-pubseq-part3.html index 788c1d2..b49830b 100644 --- a/doc/blog/using-covid-19-pubseq-part3.html +++ b/doc/blog/using-covid-19-pubseq-part3.html @@ -3,7 +3,7 @@ "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"> - + COVID-19 PubSeq Uploading Data (part 3) @@ -224,52 +224,66 @@

Table of Contents

+
+

1 Introduction

+
+

+In this document we explain how to upload data into COVID-19 PubSeq. +This can happen through a web page, or through a command line +script. We'll also show how to parametrize uploads by using templates. +The procedure is much easier than with other repositories and can be +fully automated. Once uploaded you can use our export API to prepare +for other repositories. +

+
+
-
-

1 Uploading Data

-
+
+

2 Uploading data

+

The COVID-19 PubSeq allows you to upload your SARS-Cov-2 strains to a public resource for global comparisons. A recompute of the pangenome @@ -278,9 +292,9 @@ gets triggered on upload. Read the ABOUT page for more inf

-
-

2 Step 1: Upload sequence

-
+
+

3 Step 1: Upload sequence

+

To upload a sequence in the web upload page hit the browse button and select the FASTA file on your local hard disk. @@ -307,9 +321,9 @@ an improved pangenome.

-
-

3 Step 2: Add metadata

-
+
+

4 Step 2: Add metadata

+

The web upload page contains fields for adding metadata. Metadata is not only important for attribution, is also important for @@ -334,13 +348,13 @@ the web form. Here we add some extra information.

-
-

3.1 Obligatory fields

-
+
+

4.1 Obligatory fields

+
-
-

3.1.1 Sample ID (sample_id)

-
+
+

4.1.1 Sample ID (sample_id)

+

This is a string field that defines a unique sample identifier by the submitter. In addition to sample_id we also have host_id, @@ -357,18 +371,18 @@ Here we add the GenBank ID MT536190.1.

-
-

3.1.2 Collection date

-
+
+

4.1.2 Collection date

+

Estimated collection date. The GenBank page says April 6, 2020.

-
-

3.1.3 Collection location

-
+
+

4.1.3 Collection location

+

A search on wikidata says Los Angeles is https://www.wikidata.org/entity/Q65 @@ -376,18 +390,18 @@ A search on wikidata says Los Angeles is

-
-

3.1.4 Sequencing technology

-
+
+

4.1.4 Sequencing technology

+

GenBank entry says Illumina, so we can fill that in

-
-

3.1.5 Authors

-
+
+

4.1.5 Authors

+

GenBank entry says 'Lamers,S., Nolan,D.J., Rose,R., Cross,S., Moraga Amador,D., Yang,T., Caruso,L., Navia,W., Von Borstel,L., Hui Zhou,X., @@ -397,17 +411,17 @@ Freehan,A. and Garcia-Diaz,J.', so we can fill that in.

-
-

3.2 Optional fields

-
+
+

4.2 Optional fields

+

All other fields are optional. But let's see what we can add.

-
-

3.2.1 Host information

-
+
+

4.2.1 Host information

+

Sadly, not much is known about the host from GenBank. A little sleuthing renders an interesting paper by some of the authors titled @@ -420,27 +434,27 @@ did to the person and what the person was like (say age group).

-
-

3.2.2 Collecting institution

-
+
+

4.2.2 Collecting institution

+

We can fill that in.

-
-

3.2.3 Specimen source

-
+
+

4.2.3 Specimen source

+

We have that: nasopharyngeal swab

-
-

3.2.4 Source database accession

-
+
+

4.2.4 Source database accession

+

Genbank which is http://identifiers.org/insdc/MT536190.1#sequence. Note we plug in our own identifier MT536190.1. @@ -448,9 +462,9 @@ Note we plug in our own identifier MT536190.1.

-
-

3.2.5 Strain name

-
+
+

4.2.5 Strain name

+

SARS-CoV-2/human/USA/LA-BIE-070/2020

@@ -459,9 +473,9 @@ SARS-CoV-2/human/USA/LA-BIE-070/2020
-
-

4 Step 3: Submit to COVID-19 PubSeq

-
+
+

5 Step 3: Submit to COVID-19 PubSeq

+

Once you have the sequence and the metadata together, hit the 'Add to Pangenome' button. The data will be checked, @@ -470,9 +484,9 @@ submitted and the workflows should kick in!

-
-

4.1 Trouble shooting

-
+
+

5.1 Trouble shooting

+

We got an error saying: {"stem": "http://www.wikidata.org/entity/",… which means that our location field was not formed correctly! After @@ -485,9 +499,9 @@ submit button.

-
-

5 Step 4: Check output

-
+
+

6 Step 4: Check output

+

The current pipeline takes 5.5 hours to complete! Once it completes the updated data can be checked on the DOWNLOAD page. After completion @@ -497,9 +511,9 @@ in.

-
-

6 Bulk sequence uploader

-
+
+

7 Bulk sequence uploader

+ + +

+a more elaborate example (note most fields are optional) may look like +

+
id: placeholder
 
@@ -559,11 +606,20 @@ submitter:
     additional_submitter_information: Optional free text field for additional information
 
+ +

+more metadata is yummy. Yummydata is useful to a wider community. Note +that many of the terms in above example are URIs, such as +host_species: http://purl.obolibrary.org/obo/NCBITaxon_9606. We use +web ontologies for these to make the data less ambiguous and more +FAIR. Check out the option fields as defined in the schema. If it is not listed +a little bit of web searching may be required or contact us. +

-
-

6.1 Run the uploader (CLI)

-
+
+

7.1 Run the uploader (CLI)

+

Installing with pip you should be able to run @@ -574,7 +630,6 @@ bh20sequploader sequence.fasta metadata.yaml -

Alternatively the script can be installed from github. Run on the command line @@ -617,9 +672,9 @@ The web interface using this exact same script so it should just work

-
-

6.2 Example: uploading bulk GenBank sequences

-
+
+

7.2 Example: uploading bulk GenBank sequences

+

We also use above script to bulk upload GenBank sequences with a FASTA and YAML extractor specific for GenBank. This means that the steps we @@ -645,14 +700,15 @@ ls $dir_fasta_and_yaml/*.yaml | -

-

6.3 Example: preparing metadata

-
+
+

7.3 Example: preparing metadata

+

-Usually, metadata are available in tabular format, like spreadsheets. As an example, we provide a script -esr_samples.py to show you how to parse -your metadata in YAML files ready for the upload. To execute the script, go in the ~bh20-seq-resource/scripts/esr_samples -and execute +Usually, metadata are available in a tabular format, such as +spreadsheets. As an example, we provide a script esr_samples.py to +show you how to parse your metadata in YAML files ready for the +upload. To execute the script, go in the +~bh20-seq-resource/scripts/esr_samples and execute

@@ -661,14 +717,27 @@ and execute

-You will find the YAML files in the `yaml` folder which will be created in the same directory. +You will find the YAML files in the `yaml` folder which will be +created in the same directory. +

+ +

+In the example we use Python pandas to read the spreadsheet into a +tabular structure. Next we use a template.yaml file that gets filled +in by esr_samples.py so we get a metadata YAML file for each sample. +

+ +

+Next run the earlier CLI uploader for each YAML and FASTA combination. +It can't be much easier than this. For ESR we uploaded a batch of 600 +sequences this way. See example.

-
Created by Pjotr Prins (pjotr.public768 at thebird 'dot' nl) using Emacs org-mode and a healthy dose of Lisp!
Modified 2020-10-27 Tue 06:43
. +
Created by Pjotr Prins (pjotr.public768 at thebird 'dot' nl) using Emacs org-mode and a healthy dose of Lisp!
Modified 2020-11-05 Thu 07:27
.
-- cgit v1.2.3