aboutsummaryrefslogtreecommitdiff
path: root/doc/blog
diff options
context:
space:
mode:
Diffstat (limited to 'doc/blog')
-rw-r--r--doc/blog/using-covid-19-pubseq-part1.org159
1 files changed, 132 insertions, 27 deletions
diff --git a/doc/blog/using-covid-19-pubseq-part1.org b/doc/blog/using-covid-19-pubseq-part1.org
index 647165d..617a01d 100644
--- a/doc/blog/using-covid-19-pubseq-part1.org
+++ b/doc/blog/using-covid-19-pubseq-part1.org
@@ -68,53 +68,151 @@ If you download the GFA or FASTA sequences you'll find sequences are
named something like
*keep:e17abc8a0269875ed4cfbff5d9897c6c+123/sequence.fasta* which
refers to an internal Arvados Keep representation of the FASTA
-sequence. Keep is content-addressable which means that
+sequence. Keep is content-addressable which means that the value
e17abc8a0269875ed4cfbff5d9897c6c uniquely identifies the file by its
-contents. If the contents change, the identifier would change! We use
+contents. If the contents change, the identifier changes! We use
these identifiers throughout.
-* Fetch submitter info and other metadata
+* Predicates
+
+Lets look at all the predicates in the dataset by pasting
+the following in a SPARQL end point http://sparql.genenetwork.org/sparql/
+
+#+begin_src sql
+select distinct ?p
+{
+ ?o ?p ?s
+}
+#+end_src
+
+you can ignore the openlink and w3 ones. To reduce results to a named
+graph set the default graph to
+http://covid-19.genenetwork.org/graph/metadata.ttl in the top input
+box. There you can find a predicate for submitter that looks like
+http://biohackathon.org/bh20-seq-schema#MainSchema/submitter.
-We are interested in e17abc8a0269875ed4cfbff5d9897c6c and now we
-want to get some metadata. We can use a SPARQL end point hosted at
-http://sparql.genenetwork.org/sparql/. Paste in a query like
+To list all submitters, try
#+begin_src sql
-select ?p ?s
+select distinct ?s
{
- <http://arvados.org/keep:e17abc8a0269875ed4cfbff5d9897c6c+123/sequence.fasta> ?p ?s
+ ?o <http://biohackathon.org/bh20-seq-schema#MainSchema/submitter> ?s
}
#+end_src
-which will tell you that original FASTA ID is "MT293175.1". It also
-says the submitter is nodeID://b31228.
+Oh wait, it returns things like nodeID://b76150! That is not helpful,
+these are anonymous nodes in the graph. These point to another triple
+and by
#+begin_src sql
-select distinct ?id ?p ?s
+select distinct ?s
{
- <http://arvados.org/keep:e17abc8a0269875ed4cfbff5d9897c6c+123/sequence.fasta> <http://biohackathon.org/bh20-seq-schema#MainSchema/submitter> ?id .
+ ?o <http://biohackathon.org/bh20-seq-schema#MainSchema/submitter> ?id .
?id ?p ?s
}
#+end_src
-Tells you the submitter is "Roychoudhury,P.;Greninger,A.;Jerome,K."
-with [[http://purl.obolibrary.org/obo/NCIT_C42781][predicate]] explaining "The individual who is responsible for the
-content of a document." Welcome to the power of the semantic web.
+you get a list of all submitters including "University of Washington,
+Seattle, WA 98109, USA".
+
+To lift the full URL out of the query you can use a header like
+
+#+begin_src sql
+PREFIX pubseq: <http://biohackathon.org/bh20-seq-schema#MainSchema/>
+select distinct ?dataset ?submitter
+{
+ ?dataset pubseq:submitter ?id .
+ ?id ?p ?submitter
+}
+#+end_src
+
+which reads a bit better. We can also see the datasets. One of them submitted
+by University of Washington is
+is http://arvados.org/keep:00fede2c6f52b053a14edca01cfa02b7+126/sequence.fasta
+(note the ID may have changed so pick one with above query).
+
+
+* Fetch submitter info and other metadata
-To get more information about the relevant sample
+To get dataests with submitters we can do the above
#+begin_src sql
-select ?sample ?p ?o
+PREFIX pubseq: <http://biohackathon.org/bh20-seq-schema#MainSchema/>
+select distinct ?dataset ?p ?submitter
{
- <http://arvados.org/keep:e17abc8a0269875ed4cfbff5d9897c6c+123/sequence.fasta> <http://biohackathon.org/bh20-seq-schema#MainSchema/sample> ?sample .
- ?sample ?p ?o
+ ?dataset pubseq:submitter ?id .
+ ?id ?p ?submitter
}
#+end_src
-we find it originates from Washington state (object
-https://www.wikidata.org/wiki/Q1223) , dated "30-Mar-2020". The
-sequencing was executed with Illumina and pipeline "custom pipeline
-v. 2020-03" which is arguably not that descriptive.
+Tells you one submitter is "Roychoudhury,P.;Greninger,A.;Jerome,K."
+with a URL [[http://purl.obolibrary.org/obo/NCIT_C42781][predicate]] (http://purl.obolibrary.org/obo/NCIT_C42781)
+explaining "The individual who is responsible for the content of a
+document." Well formed URIs point to real information about the URI
+itself. Welcome to the power of the semantic web.
+
+Let's focus on one sample with
+
+#+begin_src sql
+PREFIX pubseq: <http://biohackathon.org/bh20-seq-schema#MainSchema/>
+select distinct ?dataset ?submitter
+{
+ ?dataset pubseq:submitter ?id .
+ ?id ?p ?submitter .
+ FILTER(CONTAINS(?submitter,"Roychoudhury")) .
+}
+#+end_src
+
+That is a lot of samples! We just want to pick one, so let's
+see if we can get a sample ID by listing sample predicates
+
+#+begin_src sql
+PREFIX pubseq: <http://biohackathon.org/bh20-seq-schema#MainSchema/>
+select distinct ?p
+{
+ ?dataset ?p ?o .
+ ?dataset pubseq:submitter ?id .
+}
+#+end_src
+
+which lists a predicate named
+http://biohackathon.org/bh20-seq-schema#MainSchema/sample.
+Let's zoom in on those of Roychoudhury with
+
+
+#+begin_src sql
+PREFIX pubseq: <http://biohackathon.org/bh20-seq-schema#MainSchema/>
+select distinct ?sid ?sample ?p1 ?dataset ?submitter
+{
+ ?dataset pubseq:submitter ?id .
+ ?id ?p ?submitter .
+ FILTER(CONTAINS(?submitter,"Roychoudhury")) .
+ ?dataset pubseq:sample ?sid .
+ ?sid ?p1 ?sample
+}
+#+end_src
+
+which shows pretty much [[http://sparql.genenetwork.org/sparql/?default-graph-uri=&query=PREFIX+pubseq%3A+%3Chttp%3A%2F%2Fbiohackathon.org%2Fbh20-seq-schema%23MainSchema%2F%3E%0D%0Aselect+distinct+%3Fsid+%3Fsample+%3Fp1+%3Fdataset+%3Fsubmitter%0D%0A%7B%0D%0A+++%3Fdataset+pubseq%3Asubmitter+%3Fid+.%0D%0A+++%3Fid+%3Fp+%3Fsubmitter+.%0D%0A+++FILTER%28CONTAINS%28%3Fsubmitter%2C%22Roychoudhury%22%29%29+.%0D%0A+++%3Fdataset+pubseq%3Asample+%3Fsid+.%0D%0A+++%3Fsid+%3Fp1+%3Fsample%0D%0A%7D&format=text%2Fhtml&timeout=0&debug=on&run=+Run+Query+][everything known]] about their submissions in
+this database. Let's focus on one sample "MT326090.1" with predicate
+http://semanticscience.org/resource/SIO_000115.
+
+#+begin_src sql
+PREFIX pubseq: <http://biohackathon.org/bh20-seq-schema#MainSchema/>
+PREFIX sio: <http://semanticscience.org/resource/>
+select distinct ?sample ?p ?o
+{
+ ?sample sio:SIO_000115 "MT326090.1" .
+ ?sample ?p ?o .
+}
+#+end_src
+
+This [[http://sparql.genenetwork.org/sparql/?default-graph-uri=&query=PREFIX+pubseq%3A+%3Chttp%3A%2F%2Fbiohackathon.org%2Fbh20-seq-schema%23MainSchema%2F%3E%0D%0APREFIX+sio%3A+%3Chttp%3A%2F%2Fsemanticscience.org%2Fresource%2F%3E%0D%0Aselect+distinct+%3Fsample+%3Fp+%3Fo%0D%0A%7B%0D%0A+++%3Fsample+sio%3ASIO_000115+%22MT326090.1%22+.%0D%0A+++%3Fsample+%3Fp+%3Fo+.%0D%0A%7D&format=text%2Fhtml&timeout=0&debug=on&run=+Run+Query+][query]] tells us the sample was submitted "2020-03-21" and
+originates from http://www.wikidata.org/entity/Q30, i.e., the USA and
+is a biospecimen collected from the back of the throat by swabbing.
+We can track it back to the original GenBank [[http://identifiers.org/insdc/MT326090.1#sequence][submission]].
+
+We have also added country and label data to make it a bit easier
+to view/query the database.
* Fetch all sequences from Washington state
@@ -133,12 +231,19 @@ select ?seq ?sample
which lists 300 sequences originating from Washington state! Which is almost
half of the set coming out of GenBank.
+* Discussion
+
+The public sequence uploader collects sequences, raw data and
+(machine) queriable metadata. Not only that: data gets analyzed in the
+pangenome and results are presented immediately. The data can be
+referenced in publications and origins are citeable.
+
* Acknowledgements
The overall effort was due to magnificent freely donated input by a
great number of people. I particularly want to thank Thomas Liener for
the great effort he made with the ontology group in getting ontology's
-and schema sorted! Peter Amstutz and Curii helped build the on-demand
-compute and back-ends. Thanks also to Michael Crusoe for supporting
-the CWL initiative. And without Erik Garrison this initiative would
-not have existed!
+and schema sorted! Peter Amstutz and [[https://arvados.org/][Arvados/Curii]] helped build the
+on-demand compute and back-ends. Thanks also to Michael Crusoe for
+supporting the [[https://www.commonwl.org/][Common Workflow Language]] initiative. And without Erik
+Garrison this initiative would not have existed!