From 246c516e4a8c98394c695dcb446995319d557e01 Mon Sep 17 00:00:00 2001 From: Pjotr Prins Date: Sat, 22 Aug 2020 14:07:53 +0100 Subject: Generated --- doc/blog/using-covid-19-pubseq-part3.html | 149 +++++++++++++++--------------- doc/blog/using-covid-19-pubseq-part3.org | 1 - doc/blog/using-covid-19-pubseq-part5.html | 130 ++++++++++++++++++++------ doc/blog/using-covid-19-pubseq-part5.org | 6 +- 4 files changed, 180 insertions(+), 106 deletions(-) diff --git a/doc/blog/using-covid-19-pubseq-part3.html b/doc/blog/using-covid-19-pubseq-part3.html index df4a286..80304c3 100644 --- a/doc/blog/using-covid-19-pubseq-part3.html +++ b/doc/blog/using-covid-19-pubseq-part3.html @@ -3,7 +3,7 @@ "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"> - + COVID-19 PubSeq Uploading Data (part 3) @@ -248,40 +248,40 @@ for the JavaScript code in this tag.

Table of Contents

@@ -290,8 +290,8 @@ for the JavaScript code in this tag. -
-

1 Uploading Data

+
+

1 Uploading Data

The COVID-19 PubSeq allows you to upload your SARS-Cov-2 strains to a @@ -301,8 +301,8 @@ gets triggered on upload. Read the ABOUT page for more inf

-
-

2 Step 1: Upload sequence

+
+

2 Step 1: Upload sequence

To upload a sequence in the web upload page hit the browse button and @@ -330,8 +330,8 @@ an improved pangenome.

-
-

3 Step 2: Add metadata

+
+

3 Step 2: Add metadata

The web upload page contains fields for adding metadata. Metadata is @@ -357,12 +357,12 @@ the web form. Here we add some extra information.

-
-

3.1 Obligatory fields

+
+

3.1 Obligatory fields

-
-

3.1.1 Sample ID (sampleid)

+
+

3.1.1 Sample ID (sampleid)

This is a string field that defines a unique sample identifier by the @@ -380,8 +380,8 @@ Here we add the GenBank ID MT536190.1.

-
-

3.1.2 Collection date

+
+

3.1.2 Collection date

Estimated collection date. The GenBank page says April 6, 2020. @@ -389,8 +389,8 @@ Estimated collection date. The GenBank page says April 6, 2020.

-
-

3.1.3 Collection location

+
+

3.1.3 Collection location

A search on wikidata says Los Angeles is @@ -399,8 +399,8 @@ A search on wikidata says Los Angeles is

-
-

3.1.4 Sequencing technology

+
+

3.1.4 Sequencing technology

GenBank entry says Illumina, so we can fill that in @@ -408,8 +408,8 @@ GenBank entry says Illumina, so we can fill that in

-
-

3.1.5 Authors

+
+

3.1.5 Authors

GenBank entry says 'Lamers,S., Nolan,D.J., Rose,R., Cross,S., Moraga @@ -420,16 +420,16 @@ Freehan,A. and Garcia-Diaz,J.', so we can fill that in.

-
-

3.2 Optional fields

+
+

3.2 Optional fields

All other fields are optional. But let's see what we can add.

-
-

3.2.1 Host information

+
+

3.2.1 Host information

Sadly, not much is known about the host from GenBank. A little @@ -443,8 +443,8 @@ did to the person and what the person was like (say age group).

-
-

3.2.2 Collecting institution

+
+

3.2.2 Collecting institution

We can fill that in. @@ -452,8 +452,8 @@ We can fill that in.

-
-

3.2.3 Specimen source

+
+

3.2.3 Specimen source

We have that: nasopharyngeal swab @@ -461,8 +461,8 @@ We have that: nasopharyngeal swab

-
-

3.2.4 Source database accession

+
+

3.2.4 Source database accession

Genbank which is http://identifiers.org/insdc/MT536190.1#sequence. @@ -471,8 +471,8 @@ Note we plug in our own identifier MT536190.1.

-
-

3.2.5 Strain name

+
+

3.2.5 Strain name

SARS-CoV-2/human/USA/LA-BIE-070/2020 @@ -482,8 +482,8 @@ SARS-CoV-2/human/USA/LA-BIE-070/2020

-
-

4 Step 3: Submit to COVID-19 PubSeq

+
+

4 Step 3: Submit to COVID-19 PubSeq

Once you have the sequence and the metadata together, hit @@ -493,8 +493,8 @@ submitted and the workflows should kick in!

-
-

4.1 Trouble shooting

+
+

4.1 Trouble shooting

We got an error saying: {"stem": "http://www.wikidata.org/entity/",… @@ -508,9 +508,8 @@ submit button.

- -
-

5 Step 4: Check output

+
+

5 Step 4: Check output

The current pipeline takes 5.5 hours to complete! Once it completes @@ -521,8 +520,8 @@ in.

-
-

6 Bulk sequence uploader

+
+

6 Bulk sequence uploader

Above steps require a manual upload of one sequence with metadata. @@ -544,8 +543,8 @@ host: host_age: 20 host_age_unit: http://purl.obolibrary.org/obo/UO_0000036 host_health_status: http://purl.obolibrary.org/obo/NCIT_C25269 - host_treatment: Process in which the act is intended to modify or alter host status (Compounds) - host_vaccination: [vaccines1,vaccine2] + host_treatment: Process in which the act is intended to modify or alter host status (Compounds) + host_vaccination: [vaccines1,vaccine2] ethnicity: http://purl.obolibrary.org/obo/HANCESTRO_0010 additional_host_information: Optional free text field for additional information @@ -553,11 +552,11 @@ sample: sample_id: Id of the sample as defined by the submitter collector_name: Name of the person that took the sample collecting_institution: Institute that was responsible of sampling - specimen_source: [http://purl.obolibrary.org/obo/NCIT_C155831,http://purl.obolibrary.org/obo/NCIT_C155835] + specimen_source: [http://purl.obolibrary.org/obo/NCIT_C155831,http://purl.obolibrary.org/obo/NCIT_C155835] collection_date: "2020-01-01" collection_location: http://www.wikidata.org/entity/Q148 sample_storage_conditions: frozen specimen - source_database_accession: [http://identifiers.org/insdc/LC522350.1#sequence] + source_database_accession: [http://identifiers.org/insdc/LC522350.1#sequence] additional_collection_information: Optional free text field for additional information virus: @@ -565,28 +564,28 @@ virus: virus_strain: SARS-CoV-2/human/CHN/HS_8/2020 technology: - sample_sequencing_technology: [http://www.ebi.ac.uk/efo/EFO_0009173,http://www.ebi.ac.uk/efo/EFO_0009173] + sample_sequencing_technology: [http://www.ebi.ac.uk/efo/EFO_0009173,http://www.ebi.ac.uk/efo/EFO_0009173] sequence_assembly_method: Protocol used for assembly - sequencing_coverage: [70.0, 100.0] + sequencing_coverage: [70.0, 100.0] additional_technology_information: Optional free text field for additional information submitter: - authors: [John Doe, Joe Boe, Jonny Oe] - submitter_name: [John Doe] + authors: [John Doe, Joe Boe, Jonny Oe] + submitter_name: [John Doe] submitter_address: John Doe's address originating_lab: John Doe kitchen lab_address: John Doe's address provider_sample_id: XXX1 submitter_sample_id: XXX2 publication: PMID00001113 - submitter_orcid: [https://orcid.org/0000-0000-0000-0000,https://orcid.org/0000-0000-0000-0001] + submitter_orcid: [https://orcid.org/0000-0000-0000-0000,https://orcid.org/0000-0000-0000-0001] additional_submitter_information: Optional free text field for additional information

-
-

6.1 Run the uploader (CLI)

+
+

6.1 Run the uploader (CLI)

Installing with pip you should be @@ -621,8 +620,8 @@ The web interface using this exact same script so it should just work

-
-

6.2 Example: uploading bulk GenBank sequences

+
+

6.2 Example: uploading bulk GenBank sequences

We also use above script to bulk upload GenBank sequences with a FASTA @@ -634,7 +633,7 @@ took above for uploading a GenBank sequence are already automated.

-
Created by
Pjotr Prins (pjotr.public768 at thebird 'dot' nl) using Emacs org-mode and a healthy dose of Lisp!
Modified 2020-05-30 Sat 18:12
. +
Created by Pjotr Prins (pjotr.public768 at thebird 'dot' nl) using Emacs org-mode and a healthy dose of Lisp!
Modified 2020-08-22 Sat 07:43
.
diff --git a/doc/blog/using-covid-19-pubseq-part3.org b/doc/blog/using-covid-19-pubseq-part3.org index e8fee36..b1ab90d 100644 --- a/doc/blog/using-covid-19-pubseq-part3.org +++ b/doc/blog/using-covid-19-pubseq-part3.org @@ -146,7 +146,6 @@ instead on https and entity instead of wiki) the submission went through. Reload the page (it won't empty the fields) to re-enable the submit button. - * Step 4: Check output The current pipeline takes 5.5 hours to complete! Once it completes diff --git a/doc/blog/using-covid-19-pubseq-part5.html b/doc/blog/using-covid-19-pubseq-part5.html index 4caa5ac..5d640f9 100644 --- a/doc/blog/using-covid-19-pubseq-part5.html +++ b/doc/blog/using-covid-19-pubseq-part5.html @@ -3,7 +3,7 @@ "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"> - + COVID-19 PubSeq (part 4) @@ -248,19 +248,28 @@ for the JavaScript code in this tag.

Table of Contents

-
-

1 Modify Metadata

+
+

1 Modify Metadata

The public sequence resource uses multiple data formats listed on the @@ -268,8 +277,8 @@ The public sequence resource uses multiple data formats listed on the for RDF and semantic web/linked data ontologies. This technology allows for querying data in unprescribed ways - that is, you can formulate your own queries without dealing with a preset model of that -data (so typical of CSV files and SQL tables). Examples of exploring -data are listed here. +data (which is how one has to approach CSV files and SQL +tables). Examples of exploring data are listed here.

@@ -280,8 +289,8 @@ understand that anyone, including you, can change that information!

-
-

2 What is the schema?

+
+

2 What is the schema?

The default metadata schema is listed here. @@ -289,8 +298,8 @@ The default metadata schema is listed -

3 How is the website generated?

+
+

3 How is the website generated?

Using the schema we use pyshex shex expressions and schema salad to @@ -300,9 +309,13 @@ All from that one metadata schema.

-
-

4 Modifying the schema

+
+

4 Changing the license field

+
+
+

4.1 Modifying the schema

+

One of the first things we want to do is to add a field for the data license. Initially we only supported CC-4.0 as a license, but @@ -380,25 +393,25 @@ So, we'll add it simply as a title field. Now the draft schema is type: record fields: license_type: - doc: License types as refined in https://wiki.creativecommons.org/images/d/d6/Ccrel-1.0.pdf + doc: License types as refined in https://wiki.creativecommons.org/images/d/d6/Ccrel-1.0.pdf type: string? jsonldPredicate: - _id: https://creativecommons.org/ns#License + _id: https://creativecommons.org/ns#License title: doc: Attribution title related to license type: string? jsonldPredicate: - _id: http://semanticscience.org/resource/SIO_001167 + _id: http://semanticscience.org/resource/SIO_001167 attribution_url: doc: Attribution URL related to license type: string? jsonldPredicate: - _id: https://creativecommons.org/ns#Work + _id: https://creativecommons.org/ns#Work attribution_source: doc: Attribution source URL type: string? jsonldPredicate: - _id: https://creativecommons.org/ns#Work + _id: https://creativecommons.org/ns#Work

@@ -411,13 +424,13 @@ gitter channel and I merged it.
-
-

5 Adding fields to the form

-
+
+

4.2 Adding fields to the form

+

To add the new fields to the form we have to modify it a little. If we go to the upload form we need to add the license box. The schema is -loaded in main.py in the 'generateform' function. +loaded in main.py in the 'generate-form' function.

@@ -453,12 +466,71 @@ field to be optional - a missing license assumes it is CC-BY-4.0.

-
-

6 TODO Testing the license fields

+
+

4.3 TODO Testing the license fields

+
+
+ +
+

5 Changing GEO or location field

+
+

+When fetching information from GenBank and EBI/ENA we also translate +the location into an unambiguous identifier. We opted for the wikidata +tag. E.g. for New York city it is https://www.wikidata.org/wiki/Q60 +and for New York state it is https://www.wikidata.org/wiki/Q1384. If +everyone uses these metadata URIs it is easy to group when making +queries. Note that we should be using +http://www.wikidata.org/entity/Q60 in the dataset (http instead of +https and entitity instead of wiki). +

+ +

+Unfortunately the main repositories of SARS-CoV-2 have variable +strings of text for location and/or GPS coordinates. For us to support +our schema we had to translate all options and this proves expensive. +

+
+ +
+

5.1 Relaxing the shex constraint

+
+

+So we decide to relax the enforcement of this type of metadata and to +allow for a free form string. +

+ +

+The schema already used http://purl.obolibrary.org/obo/GAZ_00000448 +which states: +

+ +
+
Class: geographic
+  location
+  Term IRI: http://purl.obolibrary.org/obo/GAZ_00000448
+Definition: A reference to a place on
+  the Earth, by its name or by its geographical location.
+
+
+ +

+and when you check count by location in the DEMO it lists a free +format. +

+ +

+So, why does the validation step balk when importing GenBank? +The problem was in the shex check for RDF generation. +Removing the wikidata requirement relaxed the imports with this +patch. +

+
+
-
Created by Pjotr Prins (pjotr.public768 at thebird 'dot' nl) using Emacs org-mode and a healthy dose of Lisp!
Modified 2020-07-16 Thu 03:27
. +
Created by Pjotr Prins (pjotr.public768 at thebird 'dot' nl) using Emacs org-mode and a healthy dose of Lisp!
Modified 2020-08-22 Sat 07:42
.
diff --git a/doc/blog/using-covid-19-pubseq-part5.org b/doc/blog/using-covid-19-pubseq-part5.org index ec768ed..e260078 100644 --- a/doc/blog/using-covid-19-pubseq-part5.org +++ b/doc/blog/using-covid-19-pubseq-part5.org @@ -17,6 +17,7 @@ - [[#adding-fields-to-the-form][Adding fields to the form]] - [[#testing-the-license-fields][Testing the license fields]] - [[#changing-geo-or-location-field][Changing GEO or location field]] + - [[#relaxing-the-shex-constraint][Relaxing the shex constraint]] * Modify Metadata @@ -169,6 +170,8 @@ Unfortunately the main repositories of SARS-CoV-2 have variable strings of text for location and/or GPS coordinates. For us to support our schema we had to translate all options and this proves expensive. +** Relaxing the shex constraint + So we decide to relax the enforcement of this type of metadata and to allow for a free form string. @@ -188,4 +191,5 @@ format. So, why does the validation step balk when importing GenBank? The problem was in the [[https://github.com/arvados/bh20-seq-resource/blob/46d4b7a3a31f6605f81d43ecd6651d60a5782364/bh20sequploader/bh20seq-shex.rdf#L39][shex check]] for RDF generation. -Removing the wikidata requirement relaxed the imports. +Removing the wikidata requirement relaxed the imports with this +[[https://github.com/arvados/bh20-seq-resource/commit/f776816ee2b1af7ccc84afb494f68a81a51f5a76][patch]]. -- cgit v1.2.3