Working on upload

author: Pjotr Prins 2020-11-06 07:45:10 +0000
committer: Pjotr Prins 2020-11-06 07:45:10 +0000
commit: fbbec51e604964d18ab72cbf0ac24b102ecc0376 (patch)
tree: 4e1a231f25cf2cd3eb4c5bfee3b50d4618a5c59f
parent: e921e0e061e6226f185e48e378ce083c149600b4 (diff)
download: bh20-seq-resource-fbbec51e604964d18ab72cbf0ac24b102ecc0376.tar.gz
bh20-seq-resource-fbbec51e604964d18ab72cbf0ac24b102ecc0376.tar.lz
bh20-seq-resource-fbbec51e604964d18ab72cbf0ac24b102ecc0376.zip
3 files changed, 277 insertions, 150 deletions
diff --git a/doc/INSTALL.md b/doc/INSTALL.md
index 0180a4b..96cf1d4 100644
--- a/doc/INSTALL.md
+++ b/doc/INSTALL.md
@@ -68,6 +68,11 @@ penguin2:~/iwrk/opensource/code/vg/bh20-seq-resource$  env GUIX_PACKAGE_PATH=~/i
 
 Note: see above on GUIX_PACKAGE_PATH.
 
+## Run the tests
+
+    guix package -i python-requests python-pandas python-jinja2 python -p ~/opt/python-dev
+    . ~/opt/python-dev/etc/profile
+
 
 ## Run Virtuoso-ose
 
diff --git a/doc/blog/using-covid-19-pubseq-part3.html b/doc/blog/using-covid-19-pubseq-part3.html
index 788c1d2..b49830b 100644
--- a/doc/blog/using-covid-19-pubseq-part3.html
+++ b/doc/blog/using-covid-19-pubseq-part3.html
@@ -3,7 +3,7 @@
 "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
 <html xmlns="http://www.w3.org/1999/xhtml" lang="en" xml:lang="en">
 <head>
-<!-- 2020-10-27 Tue 06:43 -->
+<!-- 2020-11-05 Thu 07:28 -->
 <meta http-equiv="Content-Type" content="text/html;charset=utf-8" />
 <meta name="viewport" content="width=device-width, initial-scale=1" />
 <title>COVID-19 PubSeq Uploading Data (part 3)</title>
@@ -224,52 +224,66 @@
 <h2>Table of Contents</h2>
 <div id="text-table-of-contents">
 <ul>
-<li><a href="#orga9eabf3">1. Uploading Data</a></li>
-<li><a href="#org643e745">2. Step 1: Upload sequence</a></li>
-<li><a href="#org0874b9f">3. Step 2: Add metadata</a>
+<li><a href="#org85998fd">1. Introduction</a></li>
+<li><a href="#orge783233">2. Uploading data</a></li>
+<li><a href="#orgc5810d7">3. Step 1: Upload sequence</a></li>
+<li><a href="#org5a4ae99">4. Step 2: Add metadata</a>
 <ul>
-<li><a href="#orgaaa44f2">3.1. Obligatory fields</a>
+<li><a href="#orga9824de">4.1. Obligatory fields</a>
 <ul>
-<li><a href="#orgf38cdbf">3.1.1. Sample ID (sample_id)</a></li>
-<li><a href="#org34b5b06">3.1.2. Collection date</a></li>
-<li><a href="#org221f1cf">3.1.3. Collection location</a></li>
-<li><a href="#org75d1dad">3.1.4. Sequencing technology</a></li>
-<li><a href="#org990e897">3.1.5. Authors</a></li>
+<li><a href="#org407fde2">4.1.1. Sample ID (sample_id)</a></li>
+<li><a href="#orgee3bb35">4.1.2. Collection date</a></li>
+<li><a href="#org123bf0c">4.1.3. Collection location</a></li>
+<li><a href="#org41b1f83">4.1.4. Sequencing technology</a></li>
+<li><a href="#org9bab62e">4.1.5. Authors</a></li>
 </ul>
 </li>
-<li><a href="#org959072e">3.2. Optional fields</a>
+<li><a href="#org7071af8">4.2. Optional fields</a>
 <ul>
-<li><a href="#org561b754">3.2.1. Host information</a></li>
-<li><a href="#org774a993">3.2.2. Collecting institution</a></li>
-<li><a href="#orgcf096cf">3.2.3. Specimen source</a></li>
-<li><a href="#orgeac0fd8">3.2.4. Source database accession</a></li>
-<li><a href="#org3c0aebd">3.2.5. Strain name</a></li>
+<li><a href="#org2a04fdb">4.2.1. Host information</a></li>
+<li><a href="#orgc4084bc">4.2.2. Collecting institution</a></li>
+<li><a href="#orge552325">4.2.3. Specimen source</a></li>
+<li><a href="#org2577e1f">4.2.4. Source database accession</a></li>
+<li><a href="#org0305fb3">4.2.5. Strain name</a></li>
 </ul>
 </li>
 </ul>
 </li>
-<li><a href="#org9f09957">4. Step 3: Submit to COVID-19 PubSeq</a>
+<li><a href="#orgdf67705">5. Step 3: Submit to COVID-19 PubSeq</a>
 <ul>
-<li><a href="#org25372da">4.1. Trouble shooting</a></li>
+<li><a href="#orgd33218c">5.1. Trouble shooting</a></li>
 </ul>
 </li>
-<li><a href="#org8d1b4ad">5. Step 4: Check output</a></li>
-<li><a href="#orgd86b3dc">6. Bulk sequence uploader</a>
+<li><a href="#orgbf4cd0f">6. Step 4: Check output</a></li>
+<li><a href="#orgc8d6fa4">7. Bulk sequence uploader</a>
 <ul>
-<li><a href="#orgc4aa7a1">6.1. Run the uploader (CLI)</a></li>
-<li><a href="#org46687b5">6.2. Example: uploading bulk GenBank sequences</a></li>
-<li><a href="#orgbc228bc">6.3. Example: preparing metadata</a></li>
+<li><a href="#org338ebf7">7.1. Run the uploader (CLI)</a></li>
+<li><a href="#org46d5e2f">7.2. Example: uploading bulk GenBank sequences</a></li>
+<li><a href="#orgbfc3f90">7.3. Example: preparing metadata</a></li>
 </ul>
 </li>
 </ul>
 </div>
 </div>
 
+<div id="outline-container-org85998fd" class="outline-2">
+<h2 id="org85998fd"><span class="section-number-2">1</span> Introduction</h2>
+<div class="outline-text-2" id="text-1">
+<p>
+In this document we explain how to upload data into COVID-19 PubSeq.
+This can happen through a web page, or through a command line
+script. We'll also show how to parametrize uploads by using templates.
+The procedure is much easier than with other repositories and can be
+fully automated. Once uploaded you can use our export API to prepare
+for other repositories.
+</p>
+</div>
+</div>
 
 
-<div id="outline-container-orga9eabf3" class="outline-2">
-<h2 id="orga9eabf3"><span class="section-number-2">1</span> Uploading Data</h2>
-<div class="outline-text-2" id="text-1">
+<div id="outline-container-orge783233" class="outline-2">
+<h2 id="orge783233"><span class="section-number-2">2</span> Uploading data</h2>
+<div class="outline-text-2" id="text-2">
 <p>
 The COVID-19 PubSeq allows you to upload your SARS-Cov-2 strains to a
 public resource for global comparisons. A recompute of the pangenome
@@ -278,9 +292,9 @@ gets triggered on upload. Read the <a href="./about">ABOUT</a> page for more inf
 </div>
 </div>
 
-<div id="outline-container-org643e745" class="outline-2">
-<h2 id="org643e745"><span class="section-number-2">2</span> Step 1: Upload sequence</h2>
-<div class="outline-text-2" id="text-2">
+<div id="outline-container-orgc5810d7" class="outline-2">
+<h2 id="orgc5810d7"><span class="section-number-2">3</span> Step 1: Upload sequence</h2>
+<div class="outline-text-2" id="text-3">
 <p>
 To upload a sequence in the <a href="http://covid19.genenetwork.org/">web upload page</a> hit the browse button and
 select the FASTA file on your local hard disk.
@@ -307,9 +321,9 @@ an improved pangenome.
 </div>
 </div>
 
-<div id="outline-container-org0874b9f" class="outline-2">
-<h2 id="org0874b9f"><span class="section-number-2">3</span> Step 2: Add metadata</h2>
-<div class="outline-text-2" id="text-3">
+<div id="outline-container-org5a4ae99" class="outline-2">
+<h2 id="org5a4ae99"><span class="section-number-2">4</span> Step 2: Add metadata</h2>
+<div class="outline-text-2" id="text-4">
 <p>
 The <a href="./">web upload page</a> contains fields for adding metadata. Metadata is
 not only important for attribution, is also important for
@@ -334,13 +348,13 @@ the web form. Here we add some extra information.
 </p>
 </div>
 
-<div id="outline-container-orgaaa44f2" class="outline-3">
-<h3 id="orgaaa44f2"><span class="section-number-3">3.1</span> Obligatory fields</h3>
-<div class="outline-text-3" id="text-3-1">
+<div id="outline-container-orga9824de" class="outline-3">
+<h3 id="orga9824de"><span class="section-number-3">4.1</span> Obligatory fields</h3>
+<div class="outline-text-3" id="text-4-1">
 </div>
-<div id="outline-container-orgf38cdbf" class="outline-4">
-<h4 id="orgf38cdbf"><span class="section-number-4">3.1.1</span> Sample ID (sample_id)</h4>
-<div class="outline-text-4" id="text-3-1-1">
+<div id="outline-container-org407fde2" class="outline-4">
+<h4 id="org407fde2"><span class="section-number-4">4.1.1</span> Sample ID (sample_id)</h4>
+<div class="outline-text-4" id="text-4-1-1">
 <p>
 This is a string field that defines a unique sample identifier by the
 submitter. In addition to sample_id we also have host_id,
@@ -357,18 +371,18 @@ Here we add the GenBank ID MT536190.1.
 </div>
 </div>
 
-<div id="outline-container-org34b5b06" class="outline-4">
-<h4 id="org34b5b06"><span class="section-number-4">3.1.2</span> Collection date</h4>
-<div class="outline-text-4" id="text-3-1-2">
+<div id="outline-container-orgee3bb35" class="outline-4">
+<h4 id="orgee3bb35"><span class="section-number-4">4.1.2</span> Collection date</h4>
+<div class="outline-text-4" id="text-4-1-2">
 <p>
 Estimated collection date. The GenBank page says April 6, 2020.
 </p>
 </div>
 </div>
 
-<div id="outline-container-org221f1cf" class="outline-4">
-<h4 id="org221f1cf"><span class="section-number-4">3.1.3</span> Collection location</h4>
-<div class="outline-text-4" id="text-3-1-3">
+<div id="outline-container-org123bf0c" class="outline-4">
+<h4 id="org123bf0c"><span class="section-number-4">4.1.3</span> Collection location</h4>
+<div class="outline-text-4" id="text-4-1-3">
 <p>
 A search on wikidata says Los Angeles is
 <a href="https://www.wikidata.org/entity/Q65">https://www.wikidata.org/entity/Q65</a>
@@ -376,18 +390,18 @@ A search on wikidata says Los Angeles is
 </div>
 </div>
 
-<div id="outline-container-org75d1dad" class="outline-4">
-<h4 id="org75d1dad"><span class="section-number-4">3.1.4</span> Sequencing technology</h4>
-<div class="outline-text-4" id="text-3-1-4">
+<div id="outline-container-org41b1f83" class="outline-4">
+<h4 id="org41b1f83"><span class="section-number-4">4.1.4</span> Sequencing technology</h4>
+<div class="outline-text-4" id="text-4-1-4">
 <p>
 GenBank entry says Illumina, so we can fill that in
 </p>
 </div>
 </div>
 
-<div id="outline-container-org990e897" class="outline-4">
-<h4 id="org990e897"><span class="section-number-4">3.1.5</span> Authors</h4>
-<div class="outline-text-4" id="text-3-1-5">
+<div id="outline-container-org9bab62e" class="outline-4">
+<h4 id="org9bab62e"><span class="section-number-4">4.1.5</span> Authors</h4>
+<div class="outline-text-4" id="text-4-1-5">
 <p>
 GenBank entry says 'Lamers,S., Nolan,D.J., Rose,R., Cross,S., Moraga
 Amador,D., Yang,T., Caruso,L., Navia,W., Von Borstel,L., Hui Zhou,X.,
@@ -397,17 +411,17 @@ Freehan,A. and Garcia-Diaz,J.', so we can fill that in.
 </div>
 </div>
 
-<div id="outline-container-org959072e" class="outline-3">
-<h3 id="org959072e"><span class="section-number-3">3.2</span> Optional fields</h3>
-<div class="outline-text-3" id="text-3-2">
+<div id="outline-container-org7071af8" class="outline-3">
+<h3 id="org7071af8"><span class="section-number-3">4.2</span> Optional fields</h3>
+<div class="outline-text-3" id="text-4-2">
 <p>
 All other fields are optional. But let's see what we can add.
 </p>
 </div>
 
-<div id="outline-container-org561b754" class="outline-4">
-<h4 id="org561b754"><span class="section-number-4">3.2.1</span> Host information</h4>
-<div class="outline-text-4" id="text-3-2-1">
+<div id="outline-container-org2a04fdb" class="outline-4">
+<h4 id="org2a04fdb"><span class="section-number-4">4.2.1</span> Host information</h4>
+<div class="outline-text-4" id="text-4-2-1">
 <p>
 Sadly, not much is known about the host from GenBank. A little
 sleuthing renders an interesting paper by some of the authors titled
@@ -420,27 +434,27 @@ did to the person and what the person was like (say age group).
 </div>
 </div>
 
-<div id="outline-container-org774a993" class="outline-4">
-<h4 id="org774a993"><span class="section-number-4">3.2.2</span> Collecting institution</h4>
-<div class="outline-text-4" id="text-3-2-2">
+<div id="outline-container-orgc4084bc" class="outline-4">
+<h4 id="orgc4084bc"><span class="section-number-4">4.2.2</span> Collecting institution</h4>
+<div class="outline-text-4" id="text-4-2-2">
 <p>
 We can fill that in.
 </p>
 </div>
 </div>
 
-<div id="outline-container-orgcf096cf" class="outline-4">
-<h4 id="orgcf096cf"><span class="section-number-4">3.2.3</span> Specimen source</h4>
-<div class="outline-text-4" id="text-3-2-3">
+<div id="outline-container-orge552325" class="outline-4">
+<h4 id="orge552325"><span class="section-number-4">4.2.3</span> Specimen source</h4>
+<div class="outline-text-4" id="text-4-2-3">
 <p>
 We have that: nasopharyngeal swab
 </p>
 </div>
 </div>
 
-<div id="outline-container-orgeac0fd8" class="outline-4">
-<h4 id="orgeac0fd8"><span class="section-number-4">3.2.4</span> Source database accession</h4>
-<div class="outline-text-4" id="text-3-2-4">
+<div id="outline-container-org2577e1f" class="outline-4">
+<h4 id="org2577e1f"><span class="section-number-4">4.2.4</span> Source database accession</h4>
+<div class="outline-text-4" id="text-4-2-4">
 <p>
 Genbank which is <a href="http://identifiers.org/insdc/MT536190.1#sequence">http://identifiers.org/insdc/MT536190.1#sequence</a>.
 Note we plug in our own identifier MT536190.1.
@@ -448,9 +462,9 @@ Note we plug in our own identifier MT536190.1.
 </div>
 </div>
 
-<div id="outline-container-org3c0aebd" class="outline-4">
-<h4 id="org3c0aebd"><span class="section-number-4">3.2.5</span> Strain name</h4>
-<div class="outline-text-4" id="text-3-2-5">
+<div id="outline-container-org0305fb3" class="outline-4">
+<h4 id="org0305fb3"><span class="section-number-4">4.2.5</span> Strain name</h4>
+<div class="outline-text-4" id="text-4-2-5">
 <p>
 SARS-CoV-2/human/USA/LA-BIE-070/2020
 </p>
@@ -459,9 +473,9 @@ SARS-CoV-2/human/USA/LA-BIE-070/2020
 </div>
 </div>
 
-<div id="outline-container-org9f09957" class="outline-2">
-<h2 id="org9f09957"><span class="section-number-2">4</span> Step 3: Submit to COVID-19 PubSeq</h2>
-<div class="outline-text-2" id="text-4">
+<div id="outline-container-orgdf67705" class="outline-2">
+<h2 id="orgdf67705"><span class="section-number-2">5</span> Step 3: Submit to COVID-19 PubSeq</h2>
+<div class="outline-text-2" id="text-5">
 <p>
 Once you have the sequence and the metadata together, hit
 the 'Add to Pangenome' button. The data will be checked,
@@ -470,9 +484,9 @@ submitted and the workflows should kick in!
 </div>
 
 
-<div id="outline-container-org25372da" class="outline-3">
-<h3 id="org25372da"><span class="section-number-3">4.1</span> Trouble shooting</h3>
-<div class="outline-text-3" id="text-4-1">
+<div id="outline-container-orgd33218c" class="outline-3">
+<h3 id="orgd33218c"><span class="section-number-3">5.1</span> Trouble shooting</h3>
+<div class="outline-text-3" id="text-5-1">
 <p>
 We got an error saying: {"stem": "<a href="http://www.wikidata.org/entity/">http://www.wikidata.org/entity/</a>",&#x2026;
 which means that our location field was not formed correctly!  After
@@ -485,9 +499,9 @@ submit button.
 </div>
 </div>
 
-<div id="outline-container-org8d1b4ad" class="outline-2">
-<h2 id="org8d1b4ad"><span class="section-number-2">5</span> Step 4: Check output</h2>
-<div class="outline-text-2" id="text-5">
+<div id="outline-container-orgbf4cd0f" class="outline-2">
+<h2 id="orgbf4cd0f"><span class="section-number-2">6</span> Step 4: Check output</h2>
+<div class="outline-text-2" id="text-6">
 <p>
 The current pipeline takes 5.5 hours to complete! Once it completes
 the updated data can be checked on the <a href="./download">DOWNLOAD</a> page. After completion
@@ -497,9 +511,9 @@ in.
 </div>
 </div>
 
-<div id="outline-container-orgd86b3dc" class="outline-2">
-<h2 id="orgd86b3dc"><span class="section-number-2">6</span> Bulk sequence uploader</h2>
-<div class="outline-text-2" id="text-6">
+<div id="outline-container-orgc8d6fa4" class="outline-2">
+<h2 id="orgc8d6fa4"><span class="section-number-2">7</span> Bulk sequence uploader</h2>
+<div class="outline-text-2" id="text-7">
 <p>
 Above steps require a manual upload of one sequence with metadata.
 What if you have a number of sequences you want to upload in bulk?
@@ -510,6 +524,39 @@ the web form and gets validated from the same <a href="https://github.com/arvado
 that you need to create/generate for your samples looks like
 </p>
 
+<p>
+A minimal example of metadata looks like
+</p>
+
+<div class="org-src-container">
+<pre class="src src-json">id: placeholder
+
+license:
+    license_type: http://creativecommons.org/licenses/by/<span style="color: #8bc34a;">4.0</span>/
+
+host:
+    host_species: http://purl.obolibrary.org/obo/NCBITaxon_<span style="color: #8bc34a;">9606</span>
+
+sample:
+    sample_id: XX
+    collection_date: <span style="color: #9ccc65;">"2020-01-01"</span>
+    collection_location: http://www.wikidata.org/entity/Q<span style="color: #8bc34a;">148</span>
+
+virus:
+    virus_species: http://purl.obolibrary.org/obo/NCBITaxon_<span style="color: #8bc34a;">2697049</span>
+
+technology:
+    sample_sequencing_technology: [http://www.ebi.ac.uk/efo/EFO_<span style="color: #8bc34a;">0008632</span>]
+
+submitter:
+    authors: [John Doe]
+</pre>
+</div>
+
+<p>
+a more elaborate example (note most fields are optional) may look like
+</p>
+
 <div class="org-src-container">
 <pre class="src src-json">id: placeholder
 
@@ -559,11 +606,20 @@ submitter:
     additional_submitter_information: Optional free text field for additional information
 </pre>
 </div>
+
+<p>
+more metadata is yummy. <a href="https://yummydata.org/">Yummydata</a> is useful to a wider community. Note
+that many of the terms in above example are URIs, such as
+host_species: <a href="http://purl.obolibrary.org/obo/NCBITaxon_9606">http://purl.obolibrary.org/obo/NCBITaxon_9606</a>.  We use
+web ontologies for these to make the data less ambiguous and more
+FAIR. Check out the option fields as defined in the schema. If it is not listed
+a little bit of web searching may be required or <a href="./contact">contact</a> us.
+</p>
 </div>
 
-<div id="outline-container-orgc4aa7a1" class="outline-3">
-<h3 id="orgc4aa7a1"><span class="section-number-3">6.1</span> Run the uploader (CLI)</h3>
-<div class="outline-text-3" id="text-6-1">
+<div id="outline-container-org338ebf7" class="outline-3">
+<h3 id="org338ebf7"><span class="section-number-3">7.1</span> Run the uploader (CLI)</h3>
+<div class="outline-text-3" id="text-7-1">
 <p>
 Installing with pip you should be
 able to run
@@ -574,7 +630,6 @@ bh20sequploader sequence.fasta metadata.yaml
 </pre>
 
 
-
 <p>
 Alternatively the script can be installed from <a href="https://github.com/arvados/bh20-seq-resource#installation">github</a>. Run on the
 command line
@@ -617,9 +672,9 @@ The web interface using this exact same script so it should just work
 </div>
 
 
-<div id="outline-container-org46687b5" class="outline-3">
-<h3 id="org46687b5"><span class="section-number-3">6.2</span> Example: uploading bulk GenBank sequences</h3>
-<div class="outline-text-3" id="text-6-2">
+<div id="outline-container-org46d5e2f" class="outline-3">
+<h3 id="org46d5e2f"><span class="section-number-3">7.2</span> Example: uploading bulk GenBank sequences</h3>
+<div class="outline-text-3" id="text-7-2">
 <p>
 We also use above script to bulk upload GenBank sequences with a <a href="https://github.com/arvados/bh20-seq-resource/blob/master/scripts/download_genbank_data/from_genbank_to_fasta_and_yaml.py">FASTA
 and YAML</a> extractor specific for GenBank. This means that the steps we
@@ -645,14 +700,15 @@ ls $<span style="color: #ffcc80;">dir_fasta_and_yaml</span>/*.yaml | <span style
 </div>
 
 
-<div id="outline-container-orgbc228bc" class="outline-3">
-<h3 id="orgbc228bc"><span class="section-number-3">6.3</span> Example: preparing metadata</h3>
-<div class="outline-text-3" id="text-6-3">
+<div id="outline-container-orgbfc3f90" class="outline-3">
+<h3 id="orgbfc3f90"><span class="section-number-3">7.3</span> Example: preparing metadata</h3>
+<div class="outline-text-3" id="text-7-3">
 <p>
-Usually, metadata are available in tabular format, like spreadsheets. As an example, we provide a script
-<a href="https://github.com/arvados/bh20-seq-resource/tree/master/scripts/esr_samples">esr_samples.py</a> to show you how to parse
-your metadata in YAML files ready for the upload. To execute the script, go in the ~bh20-seq-resource/scripts/esr_samples
-and execute
+Usually, metadata are available in a tabular format, such as
+spreadsheets. As an example, we provide a script <a href="https://github.com/arvados/bh20-seq-resource/tree/master/scripts/esr_samples">esr_samples.py</a> to
+show you how to parse your metadata in YAML files ready for the
+upload. To execute the script, go in the
+~bh20-seq-resource/scripts/esr_samples and execute
 </p>
 
 <div class="org-src-container">
@@ -661,14 +717,27 @@ and execute
 </div>
 
 <p>
-You will find the YAML files in the `yaml` folder which will be created in the same directory.
+You will find the YAML files in the `yaml` folder which will be
+created in the same directory.
+</p>
+
+<p>
+In the example we use Python pandas to read the spreadsheet into a
+tabular structure. Next we use a <a href="https://github.com/arvados/bh20-seq-resource/blob/master/scripts/esr_samples/template.yaml">template.yaml</a> file that gets filled
+in by <code>esr_samples.py</code> so we get a metadata YAML file for each sample.
+</p>
+
+<p>
+Next run the earlier CLI uploader for each YAML and FASTA combination.
+It can't be much easier than this. For ESR we uploaded a batch of 600
+sequences this way. See <a href="http://covid19.genenetwork.org/resource/20VR0995">example</a>.
 </p>
 </div>
 </div>
 </div>
 </div>
 <div id="postamble" class="status">
-<hr><small>Created by <a href="http://thebird.nl/">Pjotr Prins</a> (pjotr.public768 at thebird 'dot' nl) using Emacs org-mode and a healthy dose of Lisp!<br />Modified 2020-10-27 Tue 06:43</small>.
+<hr><small>Created by <a href="http://thebird.nl/">Pjotr Prins</a> (pjotr.public768 at thebird 'dot' nl) using Emacs org-mode and a healthy dose of Lisp!<br />Modified 2020-11-05 Thu 07:27</small>.
 </div>
 </body>
 </html>
diff --git a/doc/blog/using-covid-19-pubseq-part3.org b/doc/blog/using-covid-19-pubseq-part3.org
index fb68251..f3ba073 100644
--- a/doc/blog/using-covid-19-pubseq-part3.org
+++ b/doc/blog/using-covid-19-pubseq-part3.org
@@ -7,10 +7,19 @@
 #+HTML_HEAD: <link rel="Blog stylesheet" type="text/css" href="blog.css" />
 #+OPTIONS: ^:nil
 
+* Introduction
+
+In this document we explain how to upload data into COVID-19 PubSeq.
+This can happen through a web page, or through a command line
+script. We'll also show how to parametrize uploads by using templates.
+The procedure is much easier than with other repositories and can be
+fully automated. Once uploaded you can use our export API to prepare
+for other repositories.
 
 
 * Table of Contents                                                     :TOC:noexport:
- - [[#uploading-data][Uploading Data]]
+ - [[#introduction][Introduction]]
+ - [[#uploading-data][Uploading data]]
  - [[#step-1-upload-sequence][Step 1: Upload sequence]]
  - [[#step-2-add-metadata][Step 2: Add metadata]]
    - [[#obligatory-fields][Obligatory fields]]
@@ -23,7 +32,7 @@
    - [[#example-uploading-bulk-genbank-sequences][Example: uploading bulk GenBank sequences]]
    - [[#example-preparing-metadata][Example: preparing metadata]]
 
-* Uploading Data
+* Uploading data
 
 The COVID-19 PubSeq allows you to upload your SARS-Cov-2 strains to a
 public resource for global comparisons. A recompute of the pangenome
@@ -165,55 +174,90 @@ file an associated metadata in [[https://github.com/arvados/bh20-seq-resource/bl
 the web form and gets validated from the same [[https://github.com/arvados/bh20-seq-resource/blob/master/bh20sequploader/bh20seq-schema.yml][schema]] looks. The YAML
 that you need to create/generate for your samples looks like
 
+A minimal example of metadata looks like
+
+#+begin_src json
+  id: placeholder
+
+  license:
+      license_type: http://creativecommons.org/licenses/by/4.0/
+
+  host:
+      host_species: http://purl.obolibrary.org/obo/NCBITaxon_9606
+
+  sample:
+      sample_id: XX
+      collection_date: "2020-01-01"
+      collection_location: http://www.wikidata.org/entity/Q148
+
+  virus:
+      virus_species: http://purl.obolibrary.org/obo/NCBITaxon_2697049
+
+  technology:
+      sample_sequencing_technology: [http://www.ebi.ac.uk/efo/EFO_0008632]
+
+  submitter:
+      authors: [John Doe]
+#+end_src
+
+a more elaborate example (note most fields are optional) may look like
+
 #+begin_src json
-id: placeholder
-
-host:
-    host_id: XX1
-    host_species: http://purl.obolibrary.org/obo/NCBITaxon_9606
-    host_sex: http://purl.obolibrary.org/obo/PATO_0000384
-    host_age: 20
-    host_age_unit: http://purl.obolibrary.org/obo/UO_0000036
-    host_health_status: http://purl.obolibrary.org/obo/NCIT_C25269
-    host_treatment: Process in which the act is intended to modify or alter host status (Compounds)
-    host_vaccination: [vaccines1,vaccine2]
-    ethnicity: http://purl.obolibrary.org/obo/HANCESTRO_0010
-    additional_host_information: Optional free text field for additional information
-
-sample:
-    sample_id: Id of the sample as defined by the submitter
-    collector_name: Name of the person that took the sample
-    collecting_institution: Institute that was responsible of sampling
-    specimen_source: [http://purl.obolibrary.org/obo/NCIT_C155831,http://purl.obolibrary.org/obo/NCIT_C155835]
-    collection_date: "2020-01-01"
-    collection_location: http://www.wikidata.org/entity/Q148
-    sample_storage_conditions: frozen specimen
-    source_database_accession: [http://identifiers.org/insdc/LC522350.1#sequence]
-    additional_collection_information: Optional free text field for additional information
-
-virus:
-    virus_species: http://purl.obolibrary.org/obo/NCBITaxon_2697049
-    virus_strain: SARS-CoV-2/human/CHN/HS_8/2020
-
-technology:
-    sample_sequencing_technology: [http://www.ebi.ac.uk/efo/EFO_0009173,http://www.ebi.ac.uk/efo/EFO_0009173]
-    sequence_assembly_method: Protocol used for assembly
-    sequencing_coverage: [70.0, 100.0]
-    additional_technology_information: Optional free text field for additional information
-
-submitter:
-    authors: [John Doe, Joe Boe, Jonny Oe]
-    submitter_name: [John Doe]
-    submitter_address: John Doe's address
-    originating_lab: John Doe kitchen
-    lab_address: John Doe's address
-    provider_sample_id: XXX1
-    submitter_sample_id: XXX2
-    publication: PMID00001113
-    submitter_orcid: [https://orcid.org/0000-0000-0000-0000,https://orcid.org/0000-0000-0000-0001]
-    additional_submitter_information: Optional free text field for additional information
+  id: placeholder
+
+  host:
+      host_id: XX1
+      host_species: http://purl.obolibrary.org/obo/NCBITaxon_9606
+      host_sex: http://purl.obolibrary.org/obo/PATO_0000384
+      host_age: 20
+      host_age_unit: http://purl.obolibrary.org/obo/UO_0000036
+      host_health_status: http://purl.obolibrary.org/obo/NCIT_C25269
+      host_treatment: Process in which the act is intended to modify or alter host status (Compounds)
+      host_vaccination: [vaccines1,vaccine2]
+      ethnicity: http://purl.obolibrary.org/obo/HANCESTRO_0010
+      additional_host_information: Optional free text field for additional information
+
+  sample:
+      sample_id: Id of the sample as defined by the submitter
+      collector_name: Name of the person that took the sample
+      collecting_institution: Institute that was responsible of sampling
+      specimen_source: [http://purl.obolibrary.org/obo/NCIT_C155831,http://purl.obolibrary.org/obo/NCIT_C155835]
+      collection_date: "2020-01-01"
+      collection_location: http://www.wikidata.org/entity/Q148
+      sample_storage_conditions: frozen specimen
+      source_database_accession: [http://identifiers.org/insdc/LC522350.1#sequence]
+      additional_collection_information: Optional free text field for additional information
+
+  virus:
+      virus_species: http://purl.obolibrary.org/obo/NCBITaxon_2697049
+      virus_strain: SARS-CoV-2/human/CHN/HS_8/2020
+
+  technology:
+      sample_sequencing_technology: [http://www.ebi.ac.uk/efo/EFO_0009173,http://www.ebi.ac.uk/efo/EFO_0009173]
+      sequence_assembly_method: Protocol used for assembly
+      sequencing_coverage: [70.0, 100.0]
+      additional_technology_information: Optional free text field for additional information
+
+  submitter:
+      authors: [John Doe, Joe Boe, Jonny Oe]
+      submitter_name: [John Doe]
+      submitter_address: John Doe's address
+      originating_lab: John Doe kitchen
+      lab_address: John Doe's address
+      provider_sample_id: XXX1
+      submitter_sample_id: XXX2
+      publication: PMID00001113
+      submitter_orcid: [https://orcid.org/0000-0000-0000-0000,https://orcid.org/0000-0000-0000-0001]
+      additional_submitter_information: Optional free text field for additional information
 #+end_src
 
+more metadata is yummy when stored in RDF. [[https://yummydata.org/][Yummydata]] is useful to a wider community. Note
+that many of the terms in above example are URIs, such as
+host_species: http://purl.obolibrary.org/obo/NCBITaxon_9606.  We use
+web ontologies for these to make the data less ambiguous and more
+FAIR. Check out the option fields as defined in the schema. If it is not listed
+a little bit of web searching may be required or [[./contact][contact]] us.
+
 ** Run the uploader (CLI)
 
 Installing with pip you should be
@@ -221,7 +265,6 @@ able to run
 
 : bh20sequploader sequence.fasta metadata.yaml
 
-
 Alternatively the script can be installed from [[https://github.com/arvados/bh20-seq-resource#installation][github]]. Run on the
 command line
 
@@ -274,13 +317,23 @@ done
 
 ** Example: preparing metadata
 
-Usually, metadata are available in tabular format, like spreadsheets. As an example, we provide a script
-[[https://github.com/arvados/bh20-seq-resource/tree/master/scripts/esr_samples][esr_samples.py]] to show you how to parse
-your metadata in YAML files ready for the upload. To execute the script, go in the ~bh20-seq-resource/scripts/esr_samples
-and execute
+Usually, metadata are available in a tabular format, such as
+spreadsheets. As an example, we provide a script [[https://github.com/arvados/bh20-seq-resource/tree/master/scripts/esr_samples][esr_samples.py]] to
+show you how to parse your metadata in YAML files ready for the
+upload. To execute the script, go in the
+~bh20-seq-resource/scripts/esr_samples and execute
 
 #+BEGIN_SRC sh
 python3 esr_samples.py
 #+END_SRC
 
-You will find the YAML files in the `yaml` folder which will be created in the same directory.
+You will find the YAML files in the `yaml` folder which will be
+created in the same directory.
+
+In the example we use Python pandas to read the spreadsheet into a
+tabular structure. Next we use a [[https://github.com/arvados/bh20-seq-resource/blob/master/scripts/esr_samples/template.yaml][template.yaml]] file that gets filled
+in by ~esr_samples.py~ so we get a metadata YAML file for each sample.
+
+Next run the earlier CLI uploader for each YAML and FASTA combination.
+It can't be much easier than this. For ESR we uploaded a batch of 600
+sequences this way writing a few lines of Python [[https://github.com/arvados/bh20-seq-resource/blob/master/scripts/esr_samples/esr_samples.py][code]]. See [[http://covid19.genenetwork.org/resource/20VR0995][example]].
author	Pjotr Prins	2020-11-06 07:45:10 +0000
committer	Pjotr Prins	2020-11-06 07:45:10 +0000
commit	fbbec51e604964d18ab72cbf0ac24b102ecc0376 (patch)
tree	4e1a231f25cf2cd3eb4c5bfee3b50d4618a5c59f
parent	e921e0e061e6226f185e48e378ce083c149600b4 (diff)
download	bh20-seq-resource-fbbec51e604964d18ab72cbf0ac24b102ecc0376.tar.gz bh20-seq-resource-fbbec51e604964d18ab72cbf0ac24b102ecc0376.tar.lz bh20-seq-resource-fbbec51e604964d18ab72cbf0ac24b102ecc0376.zip