COVID-19 PubSeq (part 1)

1. What does this mean?
2. Fetch sequence data
3. Predicates
4. Fetch submitter info and other metadata
5. Fetch all sequences from Washington state
6. Discussion
7. Acknowledgements

+As part of the COVID-19 Biohackathon 2020 we formed a working group +to create a COVID-19 Public Sequence Resource (COVID-19 PubSeq) for +Corona virus sequences. The general idea is to create a repository +that has a low barrier to entry for uploading sequence data using best +practices. I.e., data published with a creative commons 4.0 (CC-4.0) +license with metadata using state-of-the art standards and, perhaps +most importantly, providing standardised workflows that get triggered +on upload, so that results are immediately available in standardised +data formats. +

+ +

1 What does this mean?

+This means that when someone uploads a SARS-CoV-2 sequence using one +of our tools (CLI or web-based) they add some metadata which is +expressed in a schema that looks like +

+ +

- name: hostSchema
+  type: record
+  fields:
+    host_species:
+        doc: Host species as defined in NCBITaxon, e.g. http://purl.obolibrary.org/obo/NCBITaxon_9606 for Homo sapiens
+        type: string
+        jsonldPredicate:
+          _id: http://www.ebi.ac.uk/efo/EFO_0000532
+          _type: "@id"
+          noLinkCheck: true
+    host_sex:
+        doc: Sex of the host as defined in PATO, expect male () or female ()
+        type: string?
+        jsonldPredicate:
+          _id: http://purl.obolibrary.org/obo/PATO_0000047
+          _type: "@id"
+          noLinkCheck: true
+    host_age:
+        doc: Age of the host as number (e.g. 50)
+        type: int?
+        jsonldPredicate:
+          _id: http://purl.obolibrary.org/obo/PATO_0000011
+

+ +

+this metadata gets transformed into an RDF database which means +information can easily be fetched related to uploaded sequences. +We'll show an example below where we query a live database. +

+ +

+There is more: when a new sequence gets uploaded COVID-19 PubSeq kicks +in with a number of workflows running in the cloud. These workflows +generate a fresh variation graph (GFA) containing all sequences, an +RDF file containing metadata, and an RDF file containing the variation +graph in triples. Soon we will at multi sequence alignments (MSA) and +more. Anyone can contribute data, tools and workflows to this +initiative! +

+ + +

2 Fetch sequence data

+The latest run of the pipeline can be viewed here. Each of these +generated files can just be downloaded for your own use and sharing! +Data is published under a Creative Commons 4.0 attribution license +(CC-BY-4.0). This means that, unlike some other 'public' resources, +you can use this data in any way you want, provided the submitter gets +attributed. +

+ +

+If you download the GFA or FASTA sequences you'll find sequences are +named something like +keep:e17abc8a0269875ed4cfbff5d9897c6c+123/sequence.fasta which +refers to an internal Arvados Keep representation of the FASTA +sequence. Keep is content-addressable which means that the value +e17abc8a0269875ed4cfbff5d9897c6c uniquely identifies the file by its +contents. If the contents change, the identifier changes! We use +these identifiers throughout. +

+ +

3 Predicates

+To explore an RDF dataset, the first query we can do is open and gets +us a list. Lets look at all the predicates in the dataset by pasting +the following in a SPARQL end point +http://sparql.genenetwork.org/sparql/ +

+ +

select distinct ?p
+{
+   ?o ?p ?s
+}
+

+ +

+you can ignore the openlink and w3 ones. To reduce results to a named +graph set the default graph. +To get a list of graphs in the dataset, first do +

+ +

select distinct ?g
+{
+    GRAPH ?g {?s ?p ?o}
+}
+

+ +

+Limiting search to metadata add +http://covid-19.genenetwork.org/graph/metadata.ttl in the top input +box. Now you can find a predicate for submitter that looks like +http://biohackathon.org/bh20-seq-schema#MainSchema/submitter. +

+ +

+To list all submitters, try +

+ +

select distinct ?s
+{
+   ?o <http://biohackathon.org/bh20-seq-schema#MainSchema/submitter> ?s
+}
+

+ +

+Oh wait, it returns things like nodeID://b76150! That is not helpful, +these are anonymous nodes in the graph. These point to another triple +and by +

+ +

select distinct ?s
+{
+   ?o <http://biohackathon.org/bh20-seq-schema#MainSchema/submitter> ?id .
+   ?id ?p ?s
+}
+

+ +

+you get a list of all submitters including "University of Washington, +Seattle, WA 98109, USA". +

+ +

+To lift the full URL out of the query you can use a header like +

+ +

PREFIX pubseq: <http://biohackathon.org/bh20-seq-schema#MainSchema/>
+select distinct ?dataset ?submitter
+{
+   ?dataset pubseq:submitter ?id .
+   ?id ?p ?submitter
+}
+

+ +

+which reads a bit better. We can also see the submitted sequences. One +of them submitted by University of Washington is +http://collections.lugli.arvadosapi.com/c=030bcb8fda7f19743157359f5855f7a6+126/sequence.fasta +(note the ID may have changed so pick one with above query). +To see the submitted metadata replace sequence.fasta with metadata.yaml +http://collections.lugli.arvadosapi.com/c=030bcb8fda7f19743157359f5855f7a6+126/metadata.yaml +

+ +

+Now we got this far, lets count the datasets submitted with +

+ +

PREFIX pubseq: <http://biohackathon.org/bh20-seq-schema#MainSchema/>
+select (COUNT(distinct ?dataset) as ?num)
+{
+   ?dataset pubseq:submitter ?id .
+   ?id ?p ?submitter
+}
+

+ + +

4 Fetch submitter info and other metadata

+To get dataests with submitters we can do the above +

+ +

PREFIX pubseq: <http://biohackathon.org/bh20-seq-schema#MainSchema/>
+select distinct ?dataset ?p ?submitter
+{
+   ?dataset pubseq:submitter ?id .
+   ?id ?p ?submitter
+}
+

+ +

+Tells you one submitter is "Roychoudhury,P.;Greninger,A.;Jerome,K." +with a URL predicate (http://purl.obolibrary.org/obo/NCIT_C42781) +explaining "The individual who is responsible for the content of a +document." Well formed URIs point to real information about the URI +itself. Welcome to the power of the semantic web. +

+ +

+Let's focus on one sample with +

+ +

PREFIX pubseq: <http://biohackathon.org/bh20-seq-schema#MainSchema/>
+select distinct ?dataset ?submitter
+{
+   ?dataset pubseq:submitter ?id .
+   ?id ?p ?submitter .
+   FILTER(CONTAINS(?submitter,"Roychoudhury")) .
+}
+

+ +

+That is a lot of samples! We just want to pick one, so let's +see if we can get a sample ID by listing sample predicates +

+ +

PREFIX pubseq: <http://biohackathon.org/bh20-seq-schema#MainSchema/>
+select distinct ?p
+{
+   ?dataset ?p ?o .
+   ?dataset pubseq:submitter ?id .
+}
+

+ +

+which lists a predicate named +http://biohackathon.org/bh20-seq-schema#MainSchema/sample. +Let's zoom in on those of Roychoudhury with +

+ + +

PREFIX pubseq: <http://biohackathon.org/bh20-seq-schema#MainSchema/>
+select distinct ?sid ?sample ?p1 ?dataset ?submitter
+{
+   ?dataset pubseq:submitter ?id .
+   ?id ?p ?submitter .
+   FILTER(CONTAINS(?submitter,"Roychoudhury")) .
+   ?dataset pubseq:sample ?sid .
+   ?sid ?p1 ?sample
+}
+

+ +

+which shows pretty much everything known about their submissions in +this database. Let's focus on one sample "MT326090.1" with predicate +http://semanticscience.org/resource/SIO_000115. +

+ +

PREFIX pubseq: <http://biohackathon.org/bh20-seq-schema#MainSchema/>
+PREFIX sio: <http://semanticscience.org/resource/>
+select distinct ?sample ?p ?o
+{
+   ?sample sio:SIO_000115 "MT326090.1" .
+   ?sample ?p ?o .
+}
+

+ +

+This query tells us the sample was submitted "2020-03-21" and +originates from http://www.wikidata.org/entity/Q30, i.e., the USA and +is a biospecimen collected from the back of the throat by swabbing. +We can track it back to the original GenBank submission. +

+ +

+We have also added country and label data to make it a bit easier +to view/query the database. +

+ +

5 Fetch all sequences from Washington state

+Now we know how to get at the origin we can do it the other way round +and fetch all sequences referring to Washington state +

+ +

+select ?seq ?sample
+{
+    ?seq <http://biohackathon.org/bh20-seq-schema#MainSchema/sample> ?sample .
+    ?sample <http://purl.obolibrary.org/obo/GAZ_00000448> <http://www.wikidata.org/entity/Q1223>
+}
+

+ +

+which lists 300 sequences originating from Washington state! Which is almost +half of the set coming out of GenBank. +

+ +

6 Discussion

+The public sequence uploader collects sequences, raw data and +(machine) queriable metadata. Not only that: data gets analyzed in the +pangenome and results are presented immediately. The data can be +referenced in publications and origins are citeable. +

+ +

7 Acknowledgements

+The overall effort was due to magnificent freely donated input by a +great number of people. I particularly want to thank Thomas Liener for +the great effort he made with the ontology group in getting ontology's +and schema sorted! Peter Amstutz and Arvados/Curii helped build the +on-demand compute and back-ends. Thanks also to Michael Crusoe for +supporting the Common Workflow Language initiative. And without Erik +Garrison this initiative would not have existed! +

COVID-19 PubSeq (part 1)

Table of Contents

1 What does this mean?

2 Fetch sequence data

3 Predicates

4 Fetch submitter info and other metadata

5 Fetch all sequences from Washington state

6 Discussion

7 Acknowledgements