aboutsummaryrefslogtreecommitdiff
path: root/doc/blog
diff options
context:
space:
mode:
Diffstat (limited to 'doc/blog')
-rw-r--r--doc/blog/using-covid-19-pubseq-part3.html164
-rw-r--r--doc/blog/using-covid-19-pubseq-part3.org13
-rw-r--r--doc/blog/using-covid-19-pubseq-part5.html130
-rw-r--r--doc/blog/using-covid-19-pubseq-part5.org62
4 files changed, 255 insertions, 114 deletions
diff --git a/doc/blog/using-covid-19-pubseq-part3.html b/doc/blog/using-covid-19-pubseq-part3.html
index df4a286..718b10f 100644
--- a/doc/blog/using-covid-19-pubseq-part3.html
+++ b/doc/blog/using-covid-19-pubseq-part3.html
@@ -3,7 +3,7 @@
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" lang="en" xml:lang="en">
<head>
-<!-- 2020-05-30 Sat 18:12 -->
+<!-- 2020-08-24 Mon 04:31 -->
<meta http-equiv="Content-Type" content="text/html;charset=utf-8" />
<meta name="viewport" content="width=device-width, initial-scale=1" />
<title>COVID-19 PubSeq Uploading Data (part 3)</title>
@@ -248,40 +248,40 @@ for the JavaScript code in this tag.
<h2>Table of Contents</h2>
<div id="text-table-of-contents">
<ul>
-<li><a href="#org193669a">1. Uploading Data</a></li>
-<li><a href="#orgc6b3a47">2. Step 1: Upload sequence</a></li>
-<li><a href="#org9c08714">3. Step 2: Add metadata</a>
+<li><a href="#orgdaec996">1. Uploading Data</a></li>
+<li><a href="#org8472a05">2. Step 1: Upload sequence</a></li>
+<li><a href="#org668a46d">3. Step 2: Add metadata</a>
<ul>
-<li><a href="#org4c2e907">3.1. Obligatory fields</a>
+<li><a href="#orga044bef">3.1. Obligatory fields</a>
<ul>
-<li><a href="#orgdddcb2e">3.1.1. Sample ID (sample<sub>id</sub>)</a></li>
-<li><a href="#orge9c2e76">3.1.2. Collection date</a></li>
-<li><a href="#org62c55ce">3.1.3. Collection location</a></li>
-<li><a href="#org460b377">3.1.4. Sequencing technology</a></li>
-<li><a href="#org77b1e14">3.1.5. Authors</a></li>
+<li><a href="#org8e17492">3.1.1. Sample ID (sample<sub>id</sub>)</a></li>
+<li><a href="#orgd9805db">3.1.2. Collection date</a></li>
+<li><a href="#org3bd4901">3.1.3. Collection location</a></li>
+<li><a href="#org921de27">3.1.4. Sequencing technology</a></li>
+<li><a href="#org39fa678">3.1.5. Authors</a></li>
</ul>
</li>
-<li><a href="#org3cb346f">3.2. Optional fields</a>
+<li><a href="#org5315804">3.2. Optional fields</a>
<ul>
-<li><a href="#orgb0cffbb">3.2.1. Host information</a></li>
-<li><a href="#orgd2a43a6">3.2.2. Collecting institution</a></li>
-<li><a href="#org8d5bcf7">3.2.3. Specimen source</a></li>
-<li><a href="#org86b21b2">3.2.4. Source database accession</a></li>
-<li><a href="#org771ea66">3.2.5. Strain name</a></li>
+<li><a href="#orgf2b82d9">3.2.1. Host information</a></li>
+<li><a href="#org8986ca7">3.2.2. Collecting institution</a></li>
+<li><a href="#orge03eb0c">3.2.3. Specimen source</a></li>
+<li><a href="#org6815a6e">3.2.4. Source database accession</a></li>
+<li><a href="#org51b37e8">3.2.5. Strain name</a></li>
</ul>
</li>
</ul>
</li>
-<li><a href="#org7d281f5">4. Step 3: Submit to COVID-19 PubSeq</a>
+<li><a href="#org5778da6">4. Step 3: Submit to COVID-19 PubSeq</a>
<ul>
-<li><a href="#orgdf0f02d">4.1. Trouble shooting</a></li>
+<li><a href="#orge803d65">4.1. Trouble shooting</a></li>
</ul>
</li>
-<li><a href="#org29f8a92">5. Step 4: Check output</a></li>
-<li><a href="#orgf493854">6. Bulk sequence uploader</a>
+<li><a href="#org540cfdf">5. Step 4: Check output</a></li>
+<li><a href="#org6c43ab3">6. Bulk sequence uploader</a>
<ul>
-<li><a href="#org37fadbc">6.1. Run the uploader (CLI)</a></li>
-<li><a href="#org39adf09">6.2. Example: uploading bulk GenBank sequences</a></li>
+<li><a href="#org99bb8b7">6.1. Run the uploader (CLI)</a></li>
+<li><a href="#orga88593f">6.2. Example: uploading bulk GenBank sequences</a></li>
</ul>
</li>
</ul>
@@ -290,8 +290,8 @@ for the JavaScript code in this tag.
-<div id="outline-container-org193669a" class="outline-2">
-<h2 id="org193669a"><span class="section-number-2">1</span> Uploading Data</h2>
+<div id="outline-container-orgdaec996" class="outline-2">
+<h2 id="orgdaec996"><span class="section-number-2">1</span> Uploading Data</h2>
<div class="outline-text-2" id="text-1">
<p>
The COVID-19 PubSeq allows you to upload your SARS-Cov-2 strains to a
@@ -301,8 +301,8 @@ gets triggered on upload. Read the <a href="./about">ABOUT</a> page for more inf
</div>
</div>
-<div id="outline-container-orgc6b3a47" class="outline-2">
-<h2 id="orgc6b3a47"><span class="section-number-2">2</span> Step 1: Upload sequence</h2>
+<div id="outline-container-org8472a05" class="outline-2">
+<h2 id="org8472a05"><span class="section-number-2">2</span> Step 1: Upload sequence</h2>
<div class="outline-text-2" id="text-2">
<p>
To upload a sequence in the <a href="http://covid19.genenetwork.org/">web upload page</a> hit the browse button and
@@ -330,8 +330,8 @@ an improved pangenome.
</div>
</div>
-<div id="outline-container-org9c08714" class="outline-2">
-<h2 id="org9c08714"><span class="section-number-2">3</span> Step 2: Add metadata</h2>
+<div id="outline-container-org668a46d" class="outline-2">
+<h2 id="org668a46d"><span class="section-number-2">3</span> Step 2: Add metadata</h2>
<div class="outline-text-2" id="text-3">
<p>
The <a href="./">web upload page</a> contains fields for adding metadata. Metadata is
@@ -357,12 +357,12 @@ the web form. Here we add some extra information.
</p>
</div>
-<div id="outline-container-org4c2e907" class="outline-3">
-<h3 id="org4c2e907"><span class="section-number-3">3.1</span> Obligatory fields</h3>
+<div id="outline-container-orga044bef" class="outline-3">
+<h3 id="orga044bef"><span class="section-number-3">3.1</span> Obligatory fields</h3>
<div class="outline-text-3" id="text-3-1">
</div>
-<div id="outline-container-orgdddcb2e" class="outline-4">
-<h4 id="orgdddcb2e"><span class="section-number-4">3.1.1</span> Sample ID (sample<sub>id</sub>)</h4>
+<div id="outline-container-org8e17492" class="outline-4">
+<h4 id="org8e17492"><span class="section-number-4">3.1.1</span> Sample ID (sample<sub>id</sub>)</h4>
<div class="outline-text-4" id="text-3-1-1">
<p>
This is a string field that defines a unique sample identifier by the
@@ -380,8 +380,8 @@ Here we add the GenBank ID MT536190.1.
</div>
</div>
-<div id="outline-container-orge9c2e76" class="outline-4">
-<h4 id="orge9c2e76"><span class="section-number-4">3.1.2</span> Collection date</h4>
+<div id="outline-container-orgd9805db" class="outline-4">
+<h4 id="orgd9805db"><span class="section-number-4">3.1.2</span> Collection date</h4>
<div class="outline-text-4" id="text-3-1-2">
<p>
Estimated collection date. The GenBank page says April 6, 2020.
@@ -389,8 +389,8 @@ Estimated collection date. The GenBank page says April 6, 2020.
</div>
</div>
-<div id="outline-container-org62c55ce" class="outline-4">
-<h4 id="org62c55ce"><span class="section-number-4">3.1.3</span> Collection location</h4>
+<div id="outline-container-org3bd4901" class="outline-4">
+<h4 id="org3bd4901"><span class="section-number-4">3.1.3</span> Collection location</h4>
<div class="outline-text-4" id="text-3-1-3">
<p>
A search on wikidata says Los Angeles is
@@ -399,8 +399,8 @@ A search on wikidata says Los Angeles is
</div>
</div>
-<div id="outline-container-org460b377" class="outline-4">
-<h4 id="org460b377"><span class="section-number-4">3.1.4</span> Sequencing technology</h4>
+<div id="outline-container-org921de27" class="outline-4">
+<h4 id="org921de27"><span class="section-number-4">3.1.4</span> Sequencing technology</h4>
<div class="outline-text-4" id="text-3-1-4">
<p>
GenBank entry says Illumina, so we can fill that in
@@ -408,8 +408,8 @@ GenBank entry says Illumina, so we can fill that in
</div>
</div>
-<div id="outline-container-org77b1e14" class="outline-4">
-<h4 id="org77b1e14"><span class="section-number-4">3.1.5</span> Authors</h4>
+<div id="outline-container-org39fa678" class="outline-4">
+<h4 id="org39fa678"><span class="section-number-4">3.1.5</span> Authors</h4>
<div class="outline-text-4" id="text-3-1-5">
<p>
GenBank entry says 'Lamers,S., Nolan,D.J., Rose,R., Cross,S., Moraga
@@ -420,16 +420,16 @@ Freehan,A. and Garcia-Diaz,J.', so we can fill that in.
</div>
</div>
-<div id="outline-container-org3cb346f" class="outline-3">
-<h3 id="org3cb346f"><span class="section-number-3">3.2</span> Optional fields</h3>
+<div id="outline-container-org5315804" class="outline-3">
+<h3 id="org5315804"><span class="section-number-3">3.2</span> Optional fields</h3>
<div class="outline-text-3" id="text-3-2">
<p>
All other fields are optional. But let's see what we can add.
</p>
</div>
-<div id="outline-container-orgb0cffbb" class="outline-4">
-<h4 id="orgb0cffbb"><span class="section-number-4">3.2.1</span> Host information</h4>
+<div id="outline-container-orgf2b82d9" class="outline-4">
+<h4 id="orgf2b82d9"><span class="section-number-4">3.2.1</span> Host information</h4>
<div class="outline-text-4" id="text-3-2-1">
<p>
Sadly, not much is known about the host from GenBank. A little
@@ -443,8 +443,8 @@ did to the person and what the person was like (say age group).
</div>
</div>
-<div id="outline-container-orgd2a43a6" class="outline-4">
-<h4 id="orgd2a43a6"><span class="section-number-4">3.2.2</span> Collecting institution</h4>
+<div id="outline-container-org8986ca7" class="outline-4">
+<h4 id="org8986ca7"><span class="section-number-4">3.2.2</span> Collecting institution</h4>
<div class="outline-text-4" id="text-3-2-2">
<p>
We can fill that in.
@@ -452,8 +452,8 @@ We can fill that in.
</div>
</div>
-<div id="outline-container-org8d5bcf7" class="outline-4">
-<h4 id="org8d5bcf7"><span class="section-number-4">3.2.3</span> Specimen source</h4>
+<div id="outline-container-orge03eb0c" class="outline-4">
+<h4 id="orge03eb0c"><span class="section-number-4">3.2.3</span> Specimen source</h4>
<div class="outline-text-4" id="text-3-2-3">
<p>
We have that: nasopharyngeal swab
@@ -461,8 +461,8 @@ We have that: nasopharyngeal swab
</div>
</div>
-<div id="outline-container-org86b21b2" class="outline-4">
-<h4 id="org86b21b2"><span class="section-number-4">3.2.4</span> Source database accession</h4>
+<div id="outline-container-org6815a6e" class="outline-4">
+<h4 id="org6815a6e"><span class="section-number-4">3.2.4</span> Source database accession</h4>
<div class="outline-text-4" id="text-3-2-4">
<p>
Genbank which is <a href="http://identifiers.org/insdc/MT536190.1#sequence">http://identifiers.org/insdc/MT536190.1#sequence</a>.
@@ -471,8 +471,8 @@ Note we plug in our own identifier MT536190.1.
</div>
</div>
-<div id="outline-container-org771ea66" class="outline-4">
-<h4 id="org771ea66"><span class="section-number-4">3.2.5</span> Strain name</h4>
+<div id="outline-container-org51b37e8" class="outline-4">
+<h4 id="org51b37e8"><span class="section-number-4">3.2.5</span> Strain name</h4>
<div class="outline-text-4" id="text-3-2-5">
<p>
SARS-CoV-2/human/USA/LA-BIE-070/2020
@@ -482,8 +482,8 @@ SARS-CoV-2/human/USA/LA-BIE-070/2020
</div>
</div>
-<div id="outline-container-org7d281f5" class="outline-2">
-<h2 id="org7d281f5"><span class="section-number-2">4</span> Step 3: Submit to COVID-19 PubSeq</h2>
+<div id="outline-container-org5778da6" class="outline-2">
+<h2 id="org5778da6"><span class="section-number-2">4</span> Step 3: Submit to COVID-19 PubSeq</h2>
<div class="outline-text-2" id="text-4">
<p>
Once you have the sequence and the metadata together, hit
@@ -493,8 +493,8 @@ submitted and the workflows should kick in!
</div>
-<div id="outline-container-orgdf0f02d" class="outline-3">
-<h3 id="orgdf0f02d"><span class="section-number-3">4.1</span> Trouble shooting</h3>
+<div id="outline-container-orge803d65" class="outline-3">
+<h3 id="orge803d65"><span class="section-number-3">4.1</span> Trouble shooting</h3>
<div class="outline-text-3" id="text-4-1">
<p>
We got an error saying: {"stem": "<a href="http://www.wikidata.org/entity/">http://www.wikidata.org/entity/</a>",&#x2026;
@@ -508,9 +508,8 @@ submit button.
</div>
</div>
-
-<div id="outline-container-org29f8a92" class="outline-2">
-<h2 id="org29f8a92"><span class="section-number-2">5</span> Step 4: Check output</h2>
+<div id="outline-container-org540cfdf" class="outline-2">
+<h2 id="org540cfdf"><span class="section-number-2">5</span> Step 4: Check output</h2>
<div class="outline-text-2" id="text-5">
<p>
The current pipeline takes 5.5 hours to complete! Once it completes
@@ -521,8 +520,8 @@ in.
</div>
</div>
-<div id="outline-container-orgf493854" class="outline-2">
-<h2 id="orgf493854"><span class="section-number-2">6</span> Bulk sequence uploader</h2>
+<div id="outline-container-org6c43ab3" class="outline-2">
+<h2 id="org6c43ab3"><span class="section-number-2">6</span> Bulk sequence uploader</h2>
<div class="outline-text-2" id="text-6">
<p>
Above steps require a manual upload of one sequence with metadata.
@@ -544,8 +543,8 @@ host:
host_age: <span style="color: #8bc34a;">20</span>
host_age_unit: http://purl.obolibrary.org/obo/UO_<span style="color: #8bc34a;">0000036</span>
host_health_status: http://purl.obolibrary.org/obo/NCIT_C<span style="color: #8bc34a;">25269</span>
- host_treatment: Process in which the act is intended to modify or alter host status <span style="color: #e91e63;">(</span>Compounds<span style="color: #e91e63;">)</span>
- host_vaccination: <span style="color: #e91e63;">[</span>vaccines<span style="color: #8bc34a;">1</span>,vaccine<span style="color: #8bc34a;">2</span><span style="color: #e91e63;">]</span>
+ host_treatment: Process in which the act is intended to modify or alter host status (Compounds)
+ host_vaccination: [vaccines<span style="color: #8bc34a;">1</span>,vaccine<span style="color: #8bc34a;">2</span>]
ethnicity: http://purl.obolibrary.org/obo/HANCESTRO_<span style="color: #8bc34a;">0010</span>
additional_host_information: Optional free text field for additional information
@@ -553,11 +552,11 @@ sample:
sample_id: Id of the sample as defined by the submitter
collector_name: Name of the person that took the sample
collecting_institution: Institute that was responsible of sampling
- specimen_source: <span style="color: #e91e63;">[</span>http://purl.obolibrary.org/obo/NCIT_C<span style="color: #8bc34a;">155831</span>,http://purl.obolibrary.org/obo/NCIT_C<span style="color: #8bc34a;">155835</span>]
+ specimen_source: [http://purl.obolibrary.org/obo/NCIT_C<span style="color: #8bc34a;">155831</span>,http://purl.obolibrary.org/obo/NCIT_C<span style="color: #8bc34a;">155835</span>]
collection_date: <span style="color: #9ccc65;">"2020-01-01"</span>
collection_location: http://www.wikidata.org/entity/Q<span style="color: #8bc34a;">148</span>
sample_storage_conditions: frozen specimen
- source_database_accession: <span style="color: #2196F3;">[</span>http://identifiers.org/insdc/LC<span style="color: #8bc34a;">522350.1</span>#sequence]
+ source_database_accession: [http://identifiers.org/insdc/LC<span style="color: #8bc34a;">522350.1</span>#sequence]
additional_collection_information: Optional free text field for additional information
virus:
@@ -565,28 +564,28 @@ virus:
virus_strain: SARS-CoV-<span style="color: #8bc34a;">2</span>/human/CHN/HS_<span style="color: #8bc34a;">8</span>/<span style="color: #8bc34a;">2020</span>
technology:
- sample_sequencing_technology: <span style="color: #EF6C00;">[</span>http://www.ebi.ac.uk/efo/EFO_<span style="color: #8bc34a;">0009173</span>,http://www.ebi.ac.uk/efo/EFO_<span style="color: #8bc34a;">0009173</span>]
+ sample_sequencing_technology: [http://www.ebi.ac.uk/efo/EFO_<span style="color: #8bc34a;">0009173</span>,http://www.ebi.ac.uk/efo/EFO_<span style="color: #8bc34a;">0009173</span>]
sequence_assembly_method: Protocol used for assembly
- sequencing_coverage: <span style="color: #B388FF;">[</span><span style="color: #8bc34a;">70.0</span>, <span style="color: #8bc34a;">100.0</span><span style="color: #B388FF;">]</span>
+ sequencing_coverage: [<span style="color: #8bc34a;">70.0</span>, <span style="color: #8bc34a;">100.0</span>]
additional_technology_information: Optional free text field for additional information
submitter:
- authors: <span style="color: #B388FF;">[</span>John Doe, Joe Boe, Jonny Oe<span style="color: #B388FF;">]</span>
- submitter_name: <span style="color: #B388FF;">[</span>John Doe<span style="color: #B388FF;">]</span>
+ authors: [John Doe, Joe Boe, Jonny Oe]
+ submitter_name: [John Doe]
submitter_address: John Doe's address
originating_lab: John Doe kitchen
lab_address: John Doe's address
provider_sample_id: XXX<span style="color: #8bc34a;">1</span>
submitter_sample_id: XXX<span style="color: #8bc34a;">2</span>
publication: PMID<span style="color: #8bc34a;">00001113</span>
- submitter_orcid: <span style="color: #B388FF;">[</span>https://orcid.org/<span style="color: #8bc34a;">0000</span>-<span style="color: #8bc34a;">0000</span>-<span style="color: #8bc34a;">0000</span>-<span style="color: #8bc34a;">0000</span>,https://orcid.org/<span style="color: #8bc34a;">0000</span>-<span style="color: #8bc34a;">0000</span>-<span style="color: #8bc34a;">0000</span>-<span style="color: #8bc34a;">0001</span>]
+ submitter_orcid: [https://orcid.org/<span style="color: #8bc34a;">0000</span>-<span style="color: #8bc34a;">0000</span>-<span style="color: #8bc34a;">0000</span>-<span style="color: #8bc34a;">0000</span>,https://orcid.org/<span style="color: #8bc34a;">0000</span>-<span style="color: #8bc34a;">0000</span>-<span style="color: #8bc34a;">0000</span>-<span style="color: #8bc34a;">0001</span>]
additional_submitter_information: Optional free text field for additional information
</pre>
</div>
</div>
-<div id="outline-container-org37fadbc" class="outline-3">
-<h3 id="org37fadbc"><span class="section-number-3">6.1</span> Run the uploader (CLI)</h3>
+<div id="outline-container-org99bb8b7" class="outline-3">
+<h3 id="org99bb8b7"><span class="section-number-3">6.1</span> Run the uploader (CLI)</h3>
<div class="outline-text-3" id="text-6-1">
<p>
Installing with pip you should be
@@ -621,20 +620,35 @@ The web interface using this exact same script so it should just work
</div>
</div>
-<div id="outline-container-org39adf09" class="outline-3">
-<h3 id="org39adf09"><span class="section-number-3">6.2</span> Example: uploading bulk GenBank sequences</h3>
+<div id="outline-container-orga88593f" class="outline-3">
+<h3 id="orga88593f"><span class="section-number-3">6.2</span> Example: uploading bulk GenBank sequences</h3>
<div class="outline-text-3" id="text-6-2">
<p>
We also use above script to bulk upload GenBank sequences with a <a href="https://github.com/arvados/bh20-seq-resource/blob/master/scripts/download_genbank_data/from_genbank_to_fasta_and_yaml.py">FASTA
and YAML</a> extractor specific for GenBank. This means that the steps we
took above for uploading a GenBank sequence are already automated.
</p>
+
+<p>
+The steps are: from the
+<code>bh20-seq-resource/scripts/download_genbank_data/</code> directory
+</p>
+
+<div class="org-src-container">
+<pre class="src src-sh">python3 from_genbank_to_fasta_and_yaml.py
+<span style="color: #ffcc80;">dir_fasta_and_yaml</span>=~/bh20-seq-resource/scripts/download_genbank_data/fasta_and_yaml
+ls $<span style="color: #ffcc80;">dir_fasta_and_yaml</span>/*.yaml | <span style="color: #fff59d;">while </span><span style="color: #ff8A65;">read</span> path_code_yaml; <span style="color: #fff59d;">do</span>
+ <span style="color: #ffcc80;">path_code_fasta</span>=${<span style="color: #ffcc80;">path_code_yaml</span>%.*}.fasta
+ bh20-seq-uploader --skip-qc $<span style="color: #ffcc80;">path_code_yaml</span> $<span style="color: #ffcc80;">path_code_fasta</span>
+<span style="color: #fff59d;">done</span>
+</pre>
+</div>
</div>
</div>
</div>
</div>
<div id="postamble" class="status">
-<hr><small>Created by <a href="http://thebird.nl/">Pjotr Prins</a> (pjotr.public768 at thebird 'dot' nl) using Emacs org-mode and a healthy dose of Lisp!<br />Modified 2020-05-30 Sat 18:12</small>.
+<hr><small>Created by <a href="http://thebird.nl/">Pjotr Prins</a> (pjotr.public768 at thebird 'dot' nl) using Emacs org-mode and a healthy dose of Lisp!<br />Modified 2020-08-24 Mon 04:31</small>.
</div>
</body>
</html>
diff --git a/doc/blog/using-covid-19-pubseq-part3.org b/doc/blog/using-covid-19-pubseq-part3.org
index e8fee36..fda7be8 100644
--- a/doc/blog/using-covid-19-pubseq-part3.org
+++ b/doc/blog/using-covid-19-pubseq-part3.org
@@ -146,7 +146,6 @@ instead on https and entity instead of wiki) the submission went
through. Reload the page (it won't empty the fields) to re-enable the
submit button.
-
* Step 4: Check output
The current pipeline takes 5.5 hours to complete! Once it completes
@@ -237,3 +236,15 @@ The web interface using this exact same script so it should just work
We also use above script to bulk upload GenBank sequences with a [[https://github.com/arvados/bh20-seq-resource/blob/master/scripts/download_genbank_data/from_genbank_to_fasta_and_yaml.py][FASTA
and YAML]] extractor specific for GenBank. This means that the steps we
took above for uploading a GenBank sequence are already automated.
+
+The steps are: from the
+~bh20-seq-resource/scripts/download_genbank_data/~ directory
+
+#+BEGIN_SRC sh
+python3 from_genbank_to_fasta_and_yaml.py
+dir_fasta_and_yaml=~/bh20-seq-resource/scripts/download_genbank_data/fasta_and_yaml
+ls $dir_fasta_and_yaml/*.yaml | while read path_code_yaml; do
+ path_code_fasta=${path_code_yaml%.*}.fasta
+ bh20-seq-uploader --skip-qc $path_code_yaml $path_code_fasta
+done
+#+END_SRC
diff --git a/doc/blog/using-covid-19-pubseq-part5.html b/doc/blog/using-covid-19-pubseq-part5.html
index 4caa5ac..5d640f9 100644
--- a/doc/blog/using-covid-19-pubseq-part5.html
+++ b/doc/blog/using-covid-19-pubseq-part5.html
@@ -3,7 +3,7 @@
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" lang="en" xml:lang="en">
<head>
-<!-- 2020-07-17 Fri 05:03 -->
+<!-- 2020-08-22 Sat 07:43 -->
<meta http-equiv="Content-Type" content="text/html;charset=utf-8" />
<meta name="viewport" content="width=device-width, initial-scale=1" />
<title>COVID-19 PubSeq (part 4)</title>
@@ -248,19 +248,28 @@ for the JavaScript code in this tag.
<h2>Table of Contents</h2>
<div id="text-table-of-contents">
<ul>
-<li><a href="#org758b923">1. Modify Metadata</a></li>
-<li><a href="#orgec32c13">2. What is the schema?</a></li>
-<li><a href="#org2e487b2">3. How is the website generated?</a></li>
-<li><a href="#orge4dfe84">4. Modifying the schema</a></li>
-<li><a href="#org564a7a8">5. Adding fields to the form</a></li>
-<li><a href="#org633781a">6. <span class="todo TODO">TODO</span> Testing the license fields</a></li>
+<li><a href="#org935d151">1. Modify Metadata</a></li>
+<li><a href="#orgfb70872">2. What is the schema?</a></li>
+<li><a href="#orga76b489">3. How is the website generated?</a></li>
+<li><a href="#org80bb905">4. Changing the license field</a>
+<ul>
+<li><a href="#org3689b60">4.1. Modifying the schema</a></li>
+<li><a href="#org07e0c66">4.2. Adding fields to the form</a></li>
+<li><a href="#org1cfb94a">4.3. <span class="todo TODO">TODO</span> Testing the license fields</a></li>
+</ul>
+</li>
+<li><a href="#org88d4555">5. Changing GEO or location field</a>
+<ul>
+<li><a href="#org063bcfa">5.1. Relaxing the shex constraint</a></li>
+</ul>
+</li>
</ul>
</div>
</div>
-<div id="outline-container-org758b923" class="outline-2">
-<h2 id="org758b923"><span class="section-number-2">1</span> Modify Metadata</h2>
+<div id="outline-container-org935d151" class="outline-2">
+<h2 id="org935d151"><span class="section-number-2">1</span> Modify Metadata</h2>
<div class="outline-text-2" id="text-1">
<p>
The public sequence resource uses multiple data formats listed on the
@@ -268,8 +277,8 @@ The public sequence resource uses multiple data formats listed on the
for RDF and semantic web/linked data ontologies. This technology
allows for querying data in unprescribed ways - that is, you can
formulate your own queries without dealing with a preset model of that
-data (so typical of CSV files and SQL tables). Examples of exploring
-data are listed <a href="http://covid19.genenetwork.org/blog?id=using-covid-19-pubseq-part1">here</a>.
+data (which is how one has to approach CSV files and SQL
+tables). Examples of exploring data are listed <a href="http://covid19.genenetwork.org/blog?id=using-covid-19-pubseq-part1">here</a>.
</p>
<p>
@@ -280,8 +289,8 @@ understand that anyone, including you, can change that information!
</div>
</div>
-<div id="outline-container-orgec32c13" class="outline-2">
-<h2 id="orgec32c13"><span class="section-number-2">2</span> What is the schema?</h2>
+<div id="outline-container-orgfb70872" class="outline-2">
+<h2 id="orgfb70872"><span class="section-number-2">2</span> What is the schema?</h2>
<div class="outline-text-2" id="text-2">
<p>
The default metadata schema is listed <a href="https://github.com/arvados/bh20-seq-resource/blob/master/bh20sequploader/bh20seq-schema.yml">here</a>.
@@ -289,8 +298,8 @@ The default metadata schema is listed <a href="https://github.com/arvados/bh20-s
</div>
</div>
-<div id="outline-container-org2e487b2" class="outline-2">
-<h2 id="org2e487b2"><span class="section-number-2">3</span> How is the website generated?</h2>
+<div id="outline-container-orga76b489" class="outline-2">
+<h2 id="orga76b489"><span class="section-number-2">3</span> How is the website generated?</h2>
<div class="outline-text-2" id="text-3">
<p>
Using the schema we use <a href="https://pypi.org/project/PyShEx/">pyshex</a> shex expressions and <a href="https://github.com/common-workflow-language/schema_salad">schema salad</a> to
@@ -300,9 +309,13 @@ All from that one metadata schema.
</div>
</div>
-<div id="outline-container-orge4dfe84" class="outline-2">
-<h2 id="orge4dfe84"><span class="section-number-2">4</span> Modifying the schema</h2>
+<div id="outline-container-org80bb905" class="outline-2">
+<h2 id="org80bb905"><span class="section-number-2">4</span> Changing the license field</h2>
<div class="outline-text-2" id="text-4">
+</div>
+<div id="outline-container-org3689b60" class="outline-3">
+<h3 id="org3689b60"><span class="section-number-3">4.1</span> Modifying the schema</h3>
+<div class="outline-text-3" id="text-4-1">
<p>
One of the first things we want to do is to add a field for the data
license. Initially we only supported CC-4.0 as a license, but
@@ -380,25 +393,25 @@ So, we'll add it simply as a title field. Now the draft schema is
type: record
fields:
license_type:
- doc: License types as refined in https://wiki.creativecommons.org/images/d/d6/Ccrel-1.0.pdf
+ doc: License types as refined <span style="color: #fff59d;">in</span> https:<span style="color: #b0bec5;">//</span><span style="color: #b0bec5;">wiki.creativecommons.org/images/d/d6/Ccrel-1.0.pdf</span>
type: string?
jsonldPredicate:
- _id: https://creativecommons.org/ns#License
+ _id: https:<span style="color: #b0bec5;">//</span><span style="color: #b0bec5;">creativecommons.org/ns#License</span>
title:
doc: Attribution title related to license
type: string?
jsonldPredicate:
- _id: http://semanticscience.org/resource/SIO_001167
+ _id: http:<span style="color: #b0bec5;">//</span><span style="color: #b0bec5;">semanticscience.org/resource/SIO_001167</span>
attribution_url:
doc: Attribution URL related to license
type: string?
jsonldPredicate:
- _id: https://creativecommons.org/ns#Work
+ _id: https:<span style="color: #b0bec5;">//</span><span style="color: #b0bec5;">creativecommons.org/ns#Work</span>
attribution_source:
doc: Attribution source URL
type: string?
jsonldPredicate:
- _id: https://creativecommons.org/ns#Work
+ _id: https:<span style="color: #b0bec5;">//</span><span style="color: #b0bec5;">creativecommons.org/ns#Work</span>
</pre>
</div>
@@ -411,13 +424,13 @@ gitter channel and I merged it.
</div>
</div>
-<div id="outline-container-org564a7a8" class="outline-2">
-<h2 id="org564a7a8"><span class="section-number-2">5</span> Adding fields to the form</h2>
-<div class="outline-text-2" id="text-5">
+<div id="outline-container-org07e0c66" class="outline-3">
+<h3 id="org07e0c66"><span class="section-number-3">4.2</span> Adding fields to the form</h3>
+<div class="outline-text-3" id="text-4-2">
<p>
To add the new fields to the form we have to modify it a little. If we
go to the upload form we need to add the license box. The schema is
-loaded in <a href="https://github.com/arvados/bh20-seq-resource/blob/a0c8ebd57b875f265e8b0efec4abfaf892eb6c45/bh20simplewebuploader/main.py#L229">main.py</a> in the 'generate<sub>form</sub>' function.
+loaded in <a href="https://github.com/arvados/bh20-seq-resource/blob/a0c8ebd57b875f265e8b0efec4abfaf892eb6c45/bh20simplewebuploader/main.py#L229">main.py</a> in the 'generate-form' function.
</p>
<p>
@@ -453,12 +466,71 @@ field to be optional - a missing license assumes it is CC-BY-4.0.
</div>
</div>
-<div id="outline-container-org633781a" class="outline-2">
-<h2 id="org633781a"><span class="section-number-2">6</span> <span class="todo TODO">TODO</span> Testing the license fields</h2>
+<div id="outline-container-org1cfb94a" class="outline-3">
+<h3 id="org1cfb94a"><span class="section-number-3">4.3</span> <span class="todo TODO">TODO</span> Testing the license fields</h3>
+</div>
+</div>
+
+<div id="outline-container-org88d4555" class="outline-2">
+<h2 id="org88d4555"><span class="section-number-2">5</span> Changing GEO or location field</h2>
+<div class="outline-text-2" id="text-5">
+<p>
+When fetching information from GenBank and EBI/ENA we also translate
+the location into an unambiguous identifier. We opted for the wikidata
+tag. E.g. for New York city it is <a href="https://www.wikidata.org/wiki/Q60">https://www.wikidata.org/wiki/Q60</a>
+and for New York state it is <a href="https://www.wikidata.org/wiki/Q1384">https://www.wikidata.org/wiki/Q1384</a>. If
+everyone uses these metadata URIs it is easy to group when making
+queries. Note that we should be using
+<a href="http://www.wikidata.org/entity/Q60">http://www.wikidata.org/entity/Q60</a> in the dataset (http instead of
+https and entitity instead of wiki).
+</p>
+
+<p>
+Unfortunately the main repositories of SARS-CoV-2 have variable
+strings of text for location and/or GPS coordinates. For us to support
+our schema we had to translate all options and this proves expensive.
+</p>
+</div>
+
+<div id="outline-container-org063bcfa" class="outline-3">
+<h3 id="org063bcfa"><span class="section-number-3">5.1</span> Relaxing the shex constraint</h3>
+<div class="outline-text-3" id="text-5-1">
+<p>
+So we decide to relax the enforcement of this type of metadata and to
+allow for a free form string.
+</p>
+
+<p>
+The schema already used <a href="http://purl.obolibrary.org/obo/GAZ_00000448">http://purl.obolibrary.org/obo/GAZ_00000448</a>
+which states:
+</p>
+
+<div class="org-src-container">
+<pre class="src src-js">Class: geographic
+ location
+ Term IRI: http:<span style="color: #b0bec5;">//</span><span style="color: #b0bec5;">purl.obolibrary.org/obo/GAZ_00000448</span>
+Definition: A reference to a place on
+ the Earth, by its name or by its geographical location.
+</pre>
+</div>
+
+<p>
+and when you check count by location in the <a href="./demo">DEMO</a> it lists a free
+format.
+</p>
+
+<p>
+So, why does the validation step balk when importing GenBank?
+The problem was in the <a href="https://github.com/arvados/bh20-seq-resource/blob/46d4b7a3a31f6605f81d43ecd6651d60a5782364/bh20sequploader/bh20seq-shex.rdf#L39">shex check</a> for RDF generation.
+Removing the wikidata requirement relaxed the imports with this
+<a href="https://github.com/arvados/bh20-seq-resource/commit/f776816ee2b1af7ccc84afb494f68a81a51f5a76">patch</a>.
+</p>
+</div>
+</div>
</div>
</div>
<div id="postamble" class="status">
-<hr><small>Created by <a href="http://thebird.nl/">Pjotr Prins</a> (pjotr.public768 at thebird 'dot' nl) using Emacs org-mode and a healthy dose of Lisp!<br />Modified 2020-07-16 Thu 03:27</small>.
+<hr><small>Created by <a href="http://thebird.nl/">Pjotr Prins</a> (pjotr.public768 at thebird 'dot' nl) using Emacs org-mode and a healthy dose of Lisp!<br />Modified 2020-08-22 Sat 07:42</small>.
</div>
</body>
</html>
diff --git a/doc/blog/using-covid-19-pubseq-part5.org b/doc/blog/using-covid-19-pubseq-part5.org
index 78eea66..e260078 100644
--- a/doc/blog/using-covid-19-pubseq-part5.org
+++ b/doc/blog/using-covid-19-pubseq-part5.org
@@ -12,9 +12,12 @@
- [[#modify-metadata][Modify Metadata]]
- [[#what-is-the-schema][What is the schema?]]
- [[#how-is-the-website-generated][How is the website generated?]]
- - [[#modifying-the-schema][Modifying the schema]]
- - [[#adding-fields-to-the-form][Adding fields to the form]]
- - [[#testing-the-license-fields][Testing the license fields]]
+ - [[#changing-the-license-field][Changing the license field]]
+ - [[#modifying-the-schema][Modifying the schema]]
+ - [[#adding-fields-to-the-form][Adding fields to the form]]
+ - [[#testing-the-license-fields][Testing the license fields]]
+ - [[#changing-geo-or-location-field][Changing GEO or location field]]
+ - [[#relaxing-the-shex-constraint][Relaxing the shex constraint]]
* Modify Metadata
@@ -23,8 +26,8 @@ The public sequence resource uses multiple data formats listed on the
for RDF and semantic web/linked data ontologies. This technology
allows for querying data in unprescribed ways - that is, you can
formulate your own queries without dealing with a preset model of that
-data (so typical of CSV files and SQL tables). Examples of exploring
-data are listed [[http://covid19.genenetwork.org/blog?id=using-covid-19-pubseq-part1][here]].
+data (which is how one has to approach CSV files and SQL
+tables). Examples of exploring data are listed [[http://covid19.genenetwork.org/blog?id=using-covid-19-pubseq-part1][here]].
In this BLOG we are going to look at the metadata entered on the
COVID-19 PubSeq website (or command line client). It is important to
@@ -40,7 +43,9 @@ Using the schema we use [[https://pypi.org/project/PyShEx/][pyshex]] shex expres
generate the [[https://github.com/arvados/bh20-seq-resource/blob/edb17e7f7caebfa1e76b21006b1772a33f4f7887/bh20simplewebuploader/templates/form.html#L47][input form]], [[https://github.com/arvados/bh20-seq-resource/blob/edb17e7f7caebfa1e76b21006b1772a33f4f7887/bh20sequploader/qc_metadata.py#L13][validate]] the user input and to build [[https://github.com/arvados/bh20-seq-resource/blob/edb17e7f7caebfa1e76b21006b1772a33f4f7887/workflows/pangenome-generate/merge-metadata.py#L24][RDF]]!
All from that one metadata schema.
-* Modifying the schema
+* Changing the license field
+
+** Modifying the schema
One of the first things we want to do is to add a field for the data
license. Initially we only supported CC-4.0 as a license, but
@@ -120,11 +125,11 @@ our source tree and ask for feedback before wiring it up in the data
entry form. The pull request was submitted [[https://github.com/arvados/bh20-seq-resource/pull/97][here]] and reviewed on the
gitter channel and I merged it.
-* Adding fields to the form
+** Adding fields to the form
To add the new fields to the form we have to modify it a little. If we
go to the upload form we need to add the license box. The schema is
-loaded in [[https://github.com/arvados/bh20-seq-resource/blob/a0c8ebd57b875f265e8b0efec4abfaf892eb6c45/bh20simplewebuploader/main.py#L229][main.py]] in the 'generate_form' function.
+loaded in [[https://github.com/arvados/bh20-seq-resource/blob/a0c8ebd57b875f265e8b0efec4abfaf892eb6c45/bh20simplewebuploader/main.py#L229][main.py]] in the 'generate-form' function.
With this [[https://github.com/arvados/bh20-seq-resource/commit/b9691c7deae30bd6422fb7b0681572b7b6f78ae3][patch]] the website adds the license input fields on the form.
@@ -148,4 +153,43 @@ When pushing the license info we discovered the workflow broke because
the existing data had no licensing info. So we changed the license
field to be optional - a missing license assumes it is CC-BY-4.0.
-* TODO Testing the license fields
+** TODO Testing the license fields
+
+* Changing GEO or location field
+
+When fetching information from GenBank and EBI/ENA we also translate
+the location into an unambiguous identifier. We opted for the wikidata
+tag. E.g. for New York city it is https://www.wikidata.org/wiki/Q60
+and for New York state it is https://www.wikidata.org/wiki/Q1384. If
+everyone uses these metadata URIs it is easy to group when making
+queries. Note that we should be using
+http://www.wikidata.org/entity/Q60 in the dataset (http instead of
+https and entitity instead of wiki).
+
+Unfortunately the main repositories of SARS-CoV-2 have variable
+strings of text for location and/or GPS coordinates. For us to support
+our schema we had to translate all options and this proves expensive.
+
+** Relaxing the shex constraint
+
+So we decide to relax the enforcement of this type of metadata and to
+allow for a free form string.
+
+The schema already used http://purl.obolibrary.org/obo/GAZ_00000448
+which states:
+
+#+BEGIN_SRC js
+Class: geographic
+ location
+ Term IRI: http://purl.obolibrary.org/obo/GAZ_00000448
+Definition: A reference to a place on
+ the Earth, by its name or by its geographical location.
+#+END_SRC
+
+and when you check count by location in the [[./demo][DEMO]] it lists a free
+format.
+
+So, why does the validation step balk when importing GenBank?
+The problem was in the [[https://github.com/arvados/bh20-seq-resource/blob/46d4b7a3a31f6605f81d43ecd6651d60a5782364/bh20sequploader/bh20seq-shex.rdf#L39][shex check]] for RDF generation.
+Removing the wikidata requirement relaxed the imports with this
+[[https://github.com/arvados/bh20-seq-resource/commit/f776816ee2b1af7ccc84afb494f68a81a51f5a76][patch]].