diff options
Diffstat (limited to 'doc/blog/using-covid-19-pubseq-part3.html')
-rw-r--r-- | doc/blog/using-covid-19-pubseq-part3.html | 296 |
1 files changed, 204 insertions, 92 deletions
diff --git a/doc/blog/using-covid-19-pubseq-part3.html b/doc/blog/using-covid-19-pubseq-part3.html index 4132784..91879b0 100644 --- a/doc/blog/using-covid-19-pubseq-part3.html +++ b/doc/blog/using-covid-19-pubseq-part3.html @@ -3,7 +3,7 @@ "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"> <html xmlns="http://www.w3.org/1999/xhtml" lang="en" xml:lang="en"> <head> -<!-- 2020-05-30 Sat 10:45 --> +<!-- 2020-05-30 Sat 18:12 --> <meta http-equiv="Content-Type" content="text/html;charset=utf-8" /> <meta name="viewport" content="width=device-width, initial-scale=1" /> <title>COVID-19 PubSeq Uploading Data (part 3)</title> @@ -248,64 +248,62 @@ for the JavaScript code in this tag. <h2>Table of Contents</h2> <div id="text-table-of-contents"> <ul> -<li><a href="#org7fda7c8">1. Uploading Data</a></li> -<li><a href="#orgb062ac0">2. Introduction</a></li> -<li><a href="#org4061598">3. Step 1: Upload sequence</a></li> -<li><a href="#org51d80f8">4. Step 2: Add metadata</a> +<li><a href="#org193669a">1. Uploading Data</a></li> +<li><a href="#orgc6b3a47">2. Step 1: Upload sequence</a></li> +<li><a href="#org9c08714">3. Step 2: Add metadata</a> <ul> -<li><a href="#orgbb8f0bb">4.1. Obligatory fields</a> +<li><a href="#org4c2e907">3.1. Obligatory fields</a> <ul> -<li><a href="#org0e615dc">4.1.1. Sample ID (sample<sub>id</sub>)</a></li> -<li><a href="#org4d5308a">4.1.2. Collection date</a></li> -<li><a href="#org429f153">4.1.3. Collection location</a></li> -<li><a href="#orgbd7fa51">4.1.4. Sequencing technology</a></li> -<li><a href="#orgc3b424f">4.1.5. Authors</a></li> +<li><a href="#orgdddcb2e">3.1.1. Sample ID (sample<sub>id</sub>)</a></li> +<li><a href="#orge9c2e76">3.1.2. Collection date</a></li> +<li><a href="#org62c55ce">3.1.3. Collection location</a></li> +<li><a href="#org460b377">3.1.4. Sequencing technology</a></li> +<li><a href="#org77b1e14">3.1.5. Authors</a></li> </ul> </li> -<li><a href="#org5c01347">4.2. Optional fields</a> +<li><a href="#org3cb346f">3.2. Optional fields</a> <ul> -<li><a href="#org7fc5461">4.2.1. Host information</a></li> -<li><a href="#org140c8b5">4.2.2. Collecting institution</a></li> -<li><a href="#orgf231cf9">4.2.3. Specimen source</a></li> -<li><a href="#org74de839">4.2.4. Source database accession</a></li> -<li><a href="#org8927a67">4.2.5. Strain name</a></li> +<li><a href="#orgb0cffbb">3.2.1. Host information</a></li> +<li><a href="#orgd2a43a6">3.2.2. Collecting institution</a></li> +<li><a href="#org8d5bcf7">3.2.3. Specimen source</a></li> +<li><a href="#org86b21b2">3.2.4. Source database accession</a></li> +<li><a href="#org771ea66">3.2.5. Strain name</a></li> </ul> </li> </ul> </li> -<li><a href="#org38d48d8">5. Step 3: Submit to COVID-19 PubSeq</a></li> -<li><a href="#org5ec1337">6. Step 4: Check output</a> +<li><a href="#org7d281f5">4. Step 3: Submit to COVID-19 PubSeq</a> <ul> -<li><a href="#org070e13e">6.1. Trouble shooting</a></li> +<li><a href="#orgdf0f02d">4.1. Trouble shooting</a></li> +</ul> +</li> +<li><a href="#org29f8a92">5. Step 4: Check output</a></li> +<li><a href="#orgf493854">6. Bulk sequence uploader</a> +<ul> +<li><a href="#org37fadbc">6.1. Run the uploader (CLI)</a></li> +<li><a href="#org39adf09">6.2. Example: uploading bulk GenBank sequences</a></li> </ul> </li> </ul> </div> </div> -<div id="outline-container-org7fda7c8" class="outline-2"> -<h2 id="org7fda7c8"><span class="section-number-2">1</span> Uploading Data</h2> -<div class="outline-text-2" id="text-1"> -<p> -<i>Work in progress!</i> -</p> -</div> -</div> -<div id="outline-container-orgb062ac0" class="outline-2"> -<h2 id="orgb062ac0"><span class="section-number-2">2</span> Introduction</h2> -<div class="outline-text-2" id="text-2"> + +<div id="outline-container-org193669a" class="outline-2"> +<h2 id="org193669a"><span class="section-number-2">1</span> Uploading Data</h2> +<div class="outline-text-2" id="text-1"> <p> The COVID-19 PubSeq allows you to upload your SARS-Cov-2 strains to a -public resource for global comparisons. Compute it triggered on -upload. Read the <a href="./about">ABOUT</a> page for more information. +public resource for global comparisons. A recompute of the pangenome +gets triggered on upload. Read the <a href="./about">ABOUT</a> page for more information. </p> </div> </div> -<div id="outline-container-org4061598" class="outline-2"> -<h2 id="org4061598"><span class="section-number-2">3</span> Step 1: Upload sequence</h2> -<div class="outline-text-2" id="text-3"> +<div id="outline-container-orgc6b3a47" class="outline-2"> +<h2 id="orgc6b3a47"><span class="section-number-2">2</span> Step 1: Upload sequence</h2> +<div class="outline-text-2" id="text-2"> <p> To upload a sequence in the <a href="http://covid19.genenetwork.org/">web upload page</a> hit the browse button and select the FASTA file on your local hard disk. @@ -332,9 +330,9 @@ an improved pangenome. </div> </div> -<div id="outline-container-org51d80f8" class="outline-2"> -<h2 id="org51d80f8"><span class="section-number-2">4</span> Step 2: Add metadata</h2> -<div class="outline-text-2" id="text-4"> +<div id="outline-container-org9c08714" class="outline-2"> +<h2 id="org9c08714"><span class="section-number-2">3</span> Step 2: Add metadata</h2> +<div class="outline-text-2" id="text-3"> <p> The <a href="./">web upload page</a> contains fields for adding metadata. Metadata is not only important for attribution, is also important for @@ -348,7 +346,7 @@ A number of fields are obligatory: sample id, date, location, technology and authors. The others are optional, but it is valuable to enter them when information is available. Metadata is defined in this <a href="https://github.com/arvados/bh20-seq-resource/blob/master/bh20sequploader/bh20seq-schema.yml">schema</a>. From this schema we generate the input form. Note that -opitional fields have a question mark in the <code>type</code>. You can add +optional fields have a question mark in the <code>type</code>. You can add metadata yourself, btw, because this is a public resource! See also <a href="./blog?id=using-covid-19-pubseq-part5">Modify metadata</a> for more information. </p> @@ -359,13 +357,13 @@ the web form. Here we add some extra information. </p> </div> -<div id="outline-container-orgbb8f0bb" class="outline-3"> -<h3 id="orgbb8f0bb"><span class="section-number-3">4.1</span> Obligatory fields</h3> -<div class="outline-text-3" id="text-4-1"> +<div id="outline-container-org4c2e907" class="outline-3"> +<h3 id="org4c2e907"><span class="section-number-3">3.1</span> Obligatory fields</h3> +<div class="outline-text-3" id="text-3-1"> </div> -<div id="outline-container-org0e615dc" class="outline-4"> -<h4 id="org0e615dc"><span class="section-number-4">4.1.1</span> Sample ID (sample<sub>id</sub>)</h4> -<div class="outline-text-4" id="text-4-1-1"> +<div id="outline-container-orgdddcb2e" class="outline-4"> +<h4 id="orgdddcb2e"><span class="section-number-4">3.1.1</span> Sample ID (sample<sub>id</sub>)</h4> +<div class="outline-text-4" id="text-3-1-1"> <p> This is a string field that defines a unique sample identifier by the submitter. In addition to sample<sub>id</sub> we also have host<sub>id</sub>, @@ -382,37 +380,37 @@ Here we add the GenBank ID MT536190.1. </div> </div> -<div id="outline-container-org4d5308a" class="outline-4"> -<h4 id="org4d5308a"><span class="section-number-4">4.1.2</span> Collection date</h4> -<div class="outline-text-4" id="text-4-1-2"> +<div id="outline-container-orge9c2e76" class="outline-4"> +<h4 id="orge9c2e76"><span class="section-number-4">3.1.2</span> Collection date</h4> +<div class="outline-text-4" id="text-3-1-2"> <p> Estimated collection date. The GenBank page says April 6, 2020. </p> </div> </div> -<div id="outline-container-org429f153" class="outline-4"> -<h4 id="org429f153"><span class="section-number-4">4.1.3</span> Collection location</h4> -<div class="outline-text-4" id="text-4-1-3"> +<div id="outline-container-org62c55ce" class="outline-4"> +<h4 id="org62c55ce"><span class="section-number-4">3.1.3</span> Collection location</h4> +<div class="outline-text-4" id="text-3-1-3"> <p> -A search on wikidata says Los Angelos is +A search on wikidata says Los Angeles is <a href="https://www.wikidata.org/entity/Q65">https://www.wikidata.org/entity/Q65</a> </p> </div> </div> -<div id="outline-container-orgbd7fa51" class="outline-4"> -<h4 id="orgbd7fa51"><span class="section-number-4">4.1.4</span> Sequencing technology</h4> -<div class="outline-text-4" id="text-4-1-4"> +<div id="outline-container-org460b377" class="outline-4"> +<h4 id="org460b377"><span class="section-number-4">3.1.4</span> Sequencing technology</h4> +<div class="outline-text-4" id="text-3-1-4"> <p> GenBank entry says Illumina, so we can fill that in </p> </div> </div> -<div id="outline-container-orgc3b424f" class="outline-4"> -<h4 id="orgc3b424f"><span class="section-number-4">4.1.5</span> Authors</h4> -<div class="outline-text-4" id="text-4-1-5"> +<div id="outline-container-org77b1e14" class="outline-4"> +<h4 id="org77b1e14"><span class="section-number-4">3.1.5</span> Authors</h4> +<div class="outline-text-4" id="text-3-1-5"> <p> GenBank entry says 'Lamers,S., Nolan,D.J., Rose,R., Cross,S., Moraga Amador,D., Yang,T., Caruso,L., Navia,W., Von Borstel,L., Hui Zhou,X., @@ -422,17 +420,17 @@ Freehan,A. and Garcia-Diaz,J.', so we can fill that in. </div> </div> -<div id="outline-container-org5c01347" class="outline-3"> -<h3 id="org5c01347"><span class="section-number-3">4.2</span> Optional fields</h3> -<div class="outline-text-3" id="text-4-2"> +<div id="outline-container-org3cb346f" class="outline-3"> +<h3 id="org3cb346f"><span class="section-number-3">3.2</span> Optional fields</h3> +<div class="outline-text-3" id="text-3-2"> <p> All other fields are optional. But let's see what we can add. </p> </div> -<div id="outline-container-org7fc5461" class="outline-4"> -<h4 id="org7fc5461"><span class="section-number-4">4.2.1</span> Host information</h4> -<div class="outline-text-4" id="text-4-2-1"> +<div id="outline-container-orgb0cffbb" class="outline-4"> +<h4 id="orgb0cffbb"><span class="section-number-4">3.2.1</span> Host information</h4> +<div class="outline-text-4" id="text-3-2-1"> <p> Sadly, not much is known about the host from GenBank. A little sleuthing renders an interesting paper by some of the authors titled @@ -445,27 +443,27 @@ did to the person and what the person was like (say age group). </div> </div> -<div id="outline-container-org140c8b5" class="outline-4"> -<h4 id="org140c8b5"><span class="section-number-4">4.2.2</span> Collecting institution</h4> -<div class="outline-text-4" id="text-4-2-2"> +<div id="outline-container-orgd2a43a6" class="outline-4"> +<h4 id="orgd2a43a6"><span class="section-number-4">3.2.2</span> Collecting institution</h4> +<div class="outline-text-4" id="text-3-2-2"> <p> We can fill that in. </p> </div> </div> -<div id="outline-container-orgf231cf9" class="outline-4"> -<h4 id="orgf231cf9"><span class="section-number-4">4.2.3</span> Specimen source</h4> -<div class="outline-text-4" id="text-4-2-3"> +<div id="outline-container-org8d5bcf7" class="outline-4"> +<h4 id="org8d5bcf7"><span class="section-number-4">3.2.3</span> Specimen source</h4> +<div class="outline-text-4" id="text-3-2-3"> <p> We have that: nasopharyngeal swab </p> </div> </div> -<div id="outline-container-org74de839" class="outline-4"> -<h4 id="org74de839"><span class="section-number-4">4.2.4</span> Source database accession</h4> -<div class="outline-text-4" id="text-4-2-4"> +<div id="outline-container-org86b21b2" class="outline-4"> +<h4 id="org86b21b2"><span class="section-number-4">3.2.4</span> Source database accession</h4> +<div class="outline-text-4" id="text-3-2-4"> <p> Genbank which is <a href="http://identifiers.org/insdc/MT536190.1#sequence">http://identifiers.org/insdc/MT536190.1#sequence</a>. Note we plug in our own identifier MT536190.1. @@ -473,9 +471,9 @@ Note we plug in our own identifier MT536190.1. </div> </div> -<div id="outline-container-org8927a67" class="outline-4"> -<h4 id="org8927a67"><span class="section-number-4">4.2.5</span> Strain name</h4> -<div class="outline-text-4" id="text-4-2-5"> +<div id="outline-container-org771ea66" class="outline-4"> +<h4 id="org771ea66"><span class="section-number-4">3.2.5</span> Strain name</h4> +<div class="outline-text-4" id="text-3-2-5"> <p> SARS-CoV-2/human/USA/LA-BIE-070/2020 </p> @@ -484,20 +482,36 @@ SARS-CoV-2/human/USA/LA-BIE-070/2020 </div> </div> -<div id="outline-container-org38d48d8" class="outline-2"> -<h2 id="org38d48d8"><span class="section-number-2">5</span> Step 3: Submit to COVID-19 PubSeq</h2> -<div class="outline-text-2" id="text-5"> +<div id="outline-container-org7d281f5" class="outline-2"> +<h2 id="org7d281f5"><span class="section-number-2">4</span> Step 3: Submit to COVID-19 PubSeq</h2> +<div class="outline-text-2" id="text-4"> <p> Once you have the sequence and the metadata together, hit the 'Add to Pangenome' button. The data will be checked, submitted and the workflows should kick in! </p> </div> + + +<div id="outline-container-orgdf0f02d" class="outline-3"> +<h3 id="orgdf0f02d"><span class="section-number-3">4.1</span> Trouble shooting</h3> +<div class="outline-text-3" id="text-4-1"> +<p> +We got an error saying: {"stem": "<a href="http://www.wikidata.org/entity/">http://www.wikidata.org/entity/</a>",… +which means that our location field was not formed correctly! After +fixing it to look like <a href="http://www.wikidata.org/entity/Q65">http://www.wikidata.org/entity/Q65</a> (note http +instead on https and entity instead of wiki) the submission went +through. Reload the page (it won't empty the fields) to re-enable the +submit button. +</p> +</div> +</div> </div> -<div id="outline-container-org5ec1337" class="outline-2"> -<h2 id="org5ec1337"><span class="section-number-2">6</span> Step 4: Check output</h2> -<div class="outline-text-2" id="text-6"> + +<div id="outline-container-org29f8a92" class="outline-2"> +<h2 id="org29f8a92"><span class="section-number-2">5</span> Step 4: Check output</h2> +<div class="outline-text-2" id="text-5"> <p> The current pipeline takes 5.5 hours to complete! Once it completes the updated data can be checked on the <a href="./download">DOWNLOAD</a> page. After completion @@ -505,24 +519,122 @@ of above output this <a href="http://sparql.genenetwork.org/sparql/?default-grap in. </p> </div> +</div> + +<div id="outline-container-orgf493854" class="outline-2"> +<h2 id="orgf493854"><span class="section-number-2">6</span> Bulk sequence uploader</h2> +<div class="outline-text-2" id="text-6"> +<p> +Above steps require a manual upload of one sequence with metadata. +What if you have a number of sequences you want to upload in bulk? +For this we have a command line version of the uploader that can +directly submit to COVID-19 PubSeq. It accepts a FASTA sequence +file an associated metadata in <a href="https://github.com/arvados/bh20-seq-resource/blob/master/example/maximum_metadata_example.yaml">YAML</a> format. The YAML matches +the web form and gets validated from the same <a href="https://github.com/arvados/bh20-seq-resource/blob/master/bh20sequploader/bh20seq-schema.yml">schema</a> looks. The YAML +that you need to create/generate for your samples looks like +</p> + +<div class="org-src-container"> +<pre class="src src-json">id: placeholder + +host: + host_id: XX<span style="color: #8bc34a;">1</span> + host_species: http://purl.obolibrary.org/obo/NCBITaxon_<span style="color: #8bc34a;">9606</span> + host_sex: http://purl.obolibrary.org/obo/PATO_<span style="color: #8bc34a;">0000384</span> + host_age: <span style="color: #8bc34a;">20</span> + host_age_unit: http://purl.obolibrary.org/obo/UO_<span style="color: #8bc34a;">0000036</span> + host_health_status: http://purl.obolibrary.org/obo/NCIT_C<span style="color: #8bc34a;">25269</span> + host_treatment: Process in which the act is intended to modify or alter host status <span style="color: #e91e63;">(</span>Compounds<span style="color: #e91e63;">)</span> + host_vaccination: <span style="color: #e91e63;">[</span>vaccines<span style="color: #8bc34a;">1</span>,vaccine<span style="color: #8bc34a;">2</span><span style="color: #e91e63;">]</span> + ethnicity: http://purl.obolibrary.org/obo/HANCESTRO_<span style="color: #8bc34a;">0010</span> + additional_host_information: Optional free text field for additional information + +sample: + sample_id: Id of the sample as defined by the submitter + collector_name: Name of the person that took the sample + collecting_institution: Institute that was responsible of sampling + specimen_source: <span style="color: #e91e63;">[</span>http://purl.obolibrary.org/obo/NCIT_C<span style="color: #8bc34a;">155831</span>,http://purl.obolibrary.org/obo/NCIT_C<span style="color: #8bc34a;">155835</span>] + collection_date: <span style="color: #9ccc65;">"2020-01-01"</span> + collection_location: http://www.wikidata.org/entity/Q<span style="color: #8bc34a;">148</span> + sample_storage_conditions: frozen specimen + source_database_accession: <span style="color: #2196F3;">[</span>http://identifiers.org/insdc/LC<span style="color: #8bc34a;">522350.1</span>#sequence] + additional_collection_information: Optional free text field for additional information + +virus: + virus_species: http://purl.obolibrary.org/obo/NCBITaxon_<span style="color: #8bc34a;">2697049</span> + virus_strain: SARS-CoV-<span style="color: #8bc34a;">2</span>/human/CHN/HS_<span style="color: #8bc34a;">8</span>/<span style="color: #8bc34a;">2020</span> + +technology: + sample_sequencing_technology: <span style="color: #EF6C00;">[</span>http://www.ebi.ac.uk/efo/EFO_<span style="color: #8bc34a;">0009173</span>,http://www.ebi.ac.uk/efo/EFO_<span style="color: #8bc34a;">0009173</span>] + sequence_assembly_method: Protocol used for assembly + sequencing_coverage: <span style="color: #B388FF;">[</span><span style="color: #8bc34a;">70.0</span>, <span style="color: #8bc34a;">100.0</span><span style="color: #B388FF;">]</span> + additional_technology_information: Optional free text field for additional information + +submitter: + authors: <span style="color: #B388FF;">[</span>John Doe, Joe Boe, Jonny Oe<span style="color: #B388FF;">]</span> + submitter_name: <span style="color: #B388FF;">[</span>John Doe<span style="color: #B388FF;">]</span> + submitter_address: John Doe's address + originating_lab: John Doe kitchen + lab_address: John Doe's address + provider_sample_id: XXX<span style="color: #8bc34a;">1</span> + submitter_sample_id: XXX<span style="color: #8bc34a;">2</span> + publication: PMID<span style="color: #8bc34a;">00001113</span> + submitter_orcid: <span style="color: #B388FF;">[</span>https://orcid.org/<span style="color: #8bc34a;">0000</span>-<span style="color: #8bc34a;">0000</span>-<span style="color: #8bc34a;">0000</span>-<span style="color: #8bc34a;">0000</span>,https://orcid.org/<span style="color: #8bc34a;">0000</span>-<span style="color: #8bc34a;">0000</span>-<span style="color: #8bc34a;">0000</span>-<span style="color: #8bc34a;">0001</span>] + additional_submitter_information: Optional free text field for additional information +</pre> +</div> +</div> -<div id="outline-container-org070e13e" class="outline-3"> -<h3 id="org070e13e"><span class="section-number-3">6.1</span> Trouble shooting</h3> +<div id="outline-container-org37fadbc" class="outline-3"> +<h3 id="org37fadbc"><span class="section-number-3">6.1</span> Run the uploader (CLI)</h3> <div class="outline-text-3" id="text-6-1"> <p> -We got an error saying: {"stem": "<a href="http://www.wikidata.org/entity/">http://www.wikidata.org/entity/</a>",… -which means that our location field was not formed correctly! After -fixing it to look like <a href="http://www.wikidata.org/entity/Q65">http://www.wikidata.org/entity/Q65</a> (note http -instead on https and entity instead of wiki) the submission went -through. Reload the page (it won't empty the fields) to re-enable the -submit button. +Installing with pip you should be +able to run +</p> + +<pre class="example"> +bh20sequploader sequence.fasta metadata.yaml +</pre> + + + +<p> +Alternatively the script can be installed from <a href="https://github.com/arvados/bh20-seq-resource#installation">github</a>. Run on the +command line +</p> + +<pre class="example"> +python3 bh20sequploader/main.py example/sequence.fasta example/maximum_metadata_example.yaml +</pre> + + +<p> +after installing dependencies (also described in <a href="https://github.com/arvados/bh20-seq-resource/blob/master/doc/INSTALL.md">INSTALL</a> with the GNU +Guix package manager). +</p> + +<p> +The web interface using this exact same script so it should just work +(TM). +</p> +</div> +</div> + +<div id="outline-container-org39adf09" class="outline-3"> +<h3 id="org39adf09"><span class="section-number-3">6.2</span> Example: uploading bulk GenBank sequences</h3> +<div class="outline-text-3" id="text-6-2"> +<p> +We also use above script to bulk upload GenBank sequences with a <a href="https://github.com/arvados/bh20-seq-resource/blob/master/scripts/from_genbank_to_fasta_and_yaml.py">FASTA +and YAML</a> extractor specific for GenBank. This means that the steps we +took above for uploading a GenBank sequence are already automated. </p> </div> </div> </div> </div> <div id="postamble" class="status"> -<hr><small>Created by <a href="http://thebird.nl/">Pjotr Prins</a> (pjotr.public768 at thebird 'dot' nl) using Emacs org-mode and a healthy dose of Lisp!<br />Modified 2020-05-30 Sat 10:44</small>. +<hr><small>Created by <a href="http://thebird.nl/">Pjotr Prins</a> (pjotr.public768 at thebird 'dot' nl) using Emacs org-mode and a healthy dose of Lisp!<br />Modified 2020-05-30 Sat 18:12</small>. </div> </body> </html> |