From bef48abab5e8596703dd825b2d920ea25314d868 Mon Sep 17 00:00:00 2001 From: Pjotr Prins Date: Mon, 26 Oct 2020 10:23:00 +0000 Subject: Update blog --- doc/blog/using-covid-19-pubseq-part1.html | 257 ++++++++++++++++++++---------- doc/blog/using-covid-19-pubseq-part1.org | 57 +++++-- 2 files changed, 224 insertions(+), 90 deletions(-) diff --git a/doc/blog/using-covid-19-pubseq-part1.html b/doc/blog/using-covid-19-pubseq-part1.html index deeb749..454eeb5 100644 --- a/doc/blog/using-covid-19-pubseq-part1.html +++ b/doc/blog/using-covid-19-pubseq-part1.html @@ -3,7 +3,7 @@ "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
- +This means that when someone uploads a SARS-CoV-2 sequence using one @@ -313,11 +289,11 @@ initiative!
-The latest run of the pipeline can be viewed here. Each of these +The latest run of the pipeline can be viewed here. Each of these generated files can just be downloaded for your own use and sharing! Data is published under a Creative Commons 4.0 attribution license (CC-BY-4.0). This means that, unlike some other 'public' resources, @@ -338,8 +314,8 @@ these identifiers throughout.
To explore an RDF dataset, the first query we can do is open and gets
@@ -452,8 +428,8 @@ Run this
-
To get datasets with submitters we can do the above
@@ -558,26 +534,94 @@ PREFIX sio: <http://semanticscience.org/resource/"> sio: <http://semantics
This query tells us the sample was submitted "2020-03-21" and
originates from http://www.wikidata.org/entity/Q30, i.e., the USA and
is a biospecimen collected from the back of the throat by swabbing.
-We can track it back to the original GenBank submission using the
-http://identifiers.org/insdc/MT326090.1 link.
+We have also added country and label data to make it a bit easier to
+view/query the database and place the sequence on the map. We use
+wikidata entities for disambiguation. By using 'Q30' for the USA we
+don't have to figure out the different ways people spell the name. To
+get from the wikidata entity to a human readable form we provide a
+country name translation for convenience. For example when the
+predicate is http://purl.obolibrary.org/obo/GAZ_00000448 we can do
+
+Which will show the geoname spelled out as 'United States'.
-We have also added country and label data to make it a bit easier
-to view/query the database and place the sequence on the map.
+For this sample we can also track it back to the original GenBank
+submission using the listed http://identifiers.org/insdc/MT326090.1
+link.
4 Fetch submitter info and other metadata
+4 Fetch submitter info and other metadata
PREFIX pubseq: <http://biohackathon.org/bh20-seq-schema#MainSchema/>
+PREFIX sio: <http://semanticscience.org/resource/>
+select distinct ?sample ?geoname
+{
+ ?sample sio:SIO_000115 "MT326090.1" .
+ ?sample <http://purl.obolibrary.org/obo/GAZ_00000448> ?geo .
+ ?geo rdfs:label ?geoname .
+}
+
+
Now we know how to get at the origin we can do it the other way round @@ -585,19 +629,72 @@ and fetch all sequences referring to Washington state
select ?seq ?sample +select ?date ?name ?identifier ?seq { ?seq <http://biohackathon.org/bh20-seq-schema#MainSchema/sample> ?sample . - ?sample <http://purl.obolibrary.org/obo/GAZ_00000448> <http://www.wikidata.org/entity/Q1223> -} + + ?sample <http://purl.obolibrary.org/obo/GAZ_00000448> <http://www.wikidata.org/entity/Q1223> . + ?sample <http://ncicb.nci.nih.gov/xml/owl/EVS/Thesaurus.owl#C25164> ?date . + ?sample <http://semanticscience.org/resource/SIO_000115> ?name . + ?sample <http://edamontology.org/data_2091"><http://biohackathon.org/bh20-seq-schema#MainSchema/sample> ?sample . + + ?sample <http://purl.obolibrary.org/obo/GAZ_00000448> <http://www.wikidata.org/entity/Q1223> . + ?sample <http://ncicb.nci.nih.gov/xml/owl/EVS/Thesaurus.owl#C25164> ?date . + ?sample <http://semanticscience.org/resource/SIO_000115> ?name . + ?sample <http://edamontology.org/data_2091> ?identifier . +} order by ?date
-which lists 300 sequences originating from Washington state! Which in +Run query +
+ ++Which shows the date and links to NCBI and raw sequence data in FASTA format, +e.g. +
+ ++"date" "name" "identifier" "seq" +"2020-01-15" "MT252760.1" "http://identifiers.org/insdc/MT252760.1#sequence" "http://collections.lugli.arvadosapi.com/c=0164784cba5e3e39b7ba8d83fdc92649+126/sequence.fasta" +"2020-01-15" "MT252720.1" "http://identifiers.org/insdc/MT252720.1#sequence" "http://collections.lugli.arvadosapi.com/c=0387a3e47dd8a0c9ea0a4a21931f6308+126/sequence.fasta" +(...) ++ + +
+The query lists 300 sequences originating from Washington state! Which in April was almost half of the set coming out of GenBank.
@@ -624,8 +721,8 @@ Run -The public sequence uploader collects sequences, raw data and @@ -636,8 +733,8 @@ referenced in publications and origins are citeable.
The overall effort was due to magnificent freely donated input by a @@ -652,7 +749,7 @@ Garrison this initiative would not have existed!