COVID-19 PubSeq (part 1)
+Table of Contents
+ ++As part of the COVID-19 Biohackathon 2020 we formed a working group +to create a COVID-19 Public Sequence Resource (COVID-19 PubSeq) for +Corona virus sequences. The general idea is to create a repository +that has a low barrier to entry for uploading sequence data using best +practices. I.e., data published with a creative commons 4.0 (CC-4.0) +license with metadata using state-of-the art standards and, perhaps +most importantly, providing standardised workflows that get triggered +on upload, so that results are immediately available in standardised +data formats. +
+ +1 What does this mean?
++This means that when someone uploads a SARS-CoV-2 sequence using one +of our tools (CLI or web-based) they add some metadata which is +expressed in a schema that looks like +
+ +- name: hostSchema + type: record + fields: + host_species: + doc: Host species as defined in NCBITaxon, e.g. http://purl.obolibrary.org/obo/NCBITaxon_9606 for Homo sapiens + type: string + jsonldPredicate: + _id: http://www.ebi.ac.uk/efo/EFO_0000532 + _type: "@id" + noLinkCheck: true + host_sex: + doc: Sex of the host as defined in PATO, expect male () or female () + type: string? + jsonldPredicate: + _id: http://purl.obolibrary.org/obo/PATO_0000047 + _type: "@id" + noLinkCheck: true + host_age: + doc: Age of the host as number (e.g. 50) + type: int? + jsonldPredicate: + _id: http://purl.obolibrary.org/obo/PATO_0000011 ++
+this metadata gets transformed into an RDF database which means +information can easily be fetched related to uploaded sequences. +We'll show an example below where we query a live database. +
+ ++There is more: when a new sequence gets uploaded COVID-19 PubSeq kicks +in with a number of workflows running in the cloud. These workflows +generate a fresh variation graph (GFA) containing all sequences, an +RDF file containing metadata, and an RDF file containing the variation +graph in triples. Soon we will at multi sequence alignments (MSA) and +more. Anyone can contribute data, tools and workflows to this +initiative! +
+2 Fetch sequence data
++The latest run of the pipeline can be viewed here. Each of these +generated files can just be downloaded for your own use and sharing! +Data is published under a Creative Commons 4.0 attribution license +(CC-BY-4.0). This means that, unlike some other 'public' resources, +you can use this data in any way you want, provided the submitter gets +attributed. +
+ ++If you download the GFA or FASTA sequences you'll find sequences are +named something like +keep:e17abc8a0269875ed4cfbff5d9897c6c+123/sequence.fasta which +refers to an internal Arvados Keep representation of the FASTA +sequence. Keep is content-addressable which means that the value +e17abc8a0269875ed4cfbff5d9897c6c uniquely identifies the file by its +contents. If the contents change, the identifier changes! We use +these identifiers throughout. +
+3 Predicates
++To explore an RDF dataset, the first query we can do is open and gets +us a list. Lets look at all the predicates in the dataset by pasting +the following in a SPARQL end point +http://sparql.genenetwork.org/sparql/ +
+ +select distinct ?p +{ + ?o ?p ?s +} ++
+you can ignore the openlink and w3 ones. To reduce results to a named +graph set the default graph. +To get a list of graphs in the dataset, first do +
+ +select distinct ?g +{ + GRAPH ?g {?s ?p ?o} +} ++
+Limiting search to metadata add +http://covid-19.genenetwork.org/graph/metadata.ttl in the top input +box. Now you can find a predicate for submitter that looks like +http://biohackathon.org/bh20-seq-schema#MainSchema/submitter. +
+ ++To list all submitters, try +
+ +select distinct ?s +{ + ?o <http://biohackathon.org/bh20-seq-schema#MainSchema/submitter> ?s +} ++
+Oh wait, it returns things like nodeID://b76150! That is not helpful, +these are anonymous nodes in the graph. These point to another triple +and by +
+ +select distinct ?s +{ + ?o <http://biohackathon.org/bh20-seq-schema#MainSchema/submitter> ?id . + ?id ?p ?s +} ++
+you get a list of all submitters including "University of Washington, +Seattle, WA 98109, USA". +
+ ++To lift the full URL out of the query you can use a header like +
+ +PREFIX pubseq: <http://biohackathon.org/bh20-seq-schema#MainSchema/> +select distinct ?dataset ?submitter +{ + ?dataset pubseq:submitter ?id . + ?id ?p ?submitter +} ++
+which reads a bit better. We can also see the submitted sequences. One +of them submitted by University of Washington is +http://collections.lugli.arvadosapi.com/c=030bcb8fda7f19743157359f5855f7a6+126/sequence.fasta +(note the ID may have changed so pick one with above query). +To see the submitted metadata replace sequence.fasta with metadata.yaml +http://collections.lugli.arvadosapi.com/c=030bcb8fda7f19743157359f5855f7a6+126/metadata.yaml +
+ ++Now we got this far, lets count the datasets submitted with +
+ +PREFIX pubseq: <http://biohackathon.org/bh20-seq-schema#MainSchema/> +select (COUNT(distinct ?dataset) as ?num) +{ + ?dataset pubseq:submitter ?id . + ?id ?p ?submitter +} ++
4 Fetch submitter info and other metadata
++To get dataests with submitters we can do the above +
+ +PREFIX pubseq: <http://biohackathon.org/bh20-seq-schema#MainSchema/> +select distinct ?dataset ?p ?submitter +{ + ?dataset pubseq:submitter ?id . + ?id ?p ?submitter +} ++
+Tells you one submitter is "Roychoudhury,P.;Greninger,A.;Jerome,K." +with a URL predicate (http://purl.obolibrary.org/obo/NCIT_C42781) +explaining "The individual who is responsible for the content of a +document." Well formed URIs point to real information about the URI +itself. Welcome to the power of the semantic web. +
+ ++Let's focus on one sample with +
+ +PREFIX pubseq: <http://biohackathon.org/bh20-seq-schema#MainSchema/> +select distinct ?dataset ?submitter +{ + ?dataset pubseq:submitter ?id . + ?id ?p ?submitter . + FILTER(CONTAINS(?submitter,"Roychoudhury")) . +} ++
+That is a lot of samples! We just want to pick one, so let's +see if we can get a sample ID by listing sample predicates +
+ +PREFIX pubseq: <http://biohackathon.org/bh20-seq-schema#MainSchema/> +select distinct ?p +{ + ?dataset ?p ?o . + ?dataset pubseq:submitter ?id . +} ++
+which lists a predicate named +http://biohackathon.org/bh20-seq-schema#MainSchema/sample. +Let's zoom in on those of Roychoudhury with +
+ + +PREFIX pubseq: <http://biohackathon.org/bh20-seq-schema#MainSchema/> +select distinct ?sid ?sample ?p1 ?dataset ?submitter +{ + ?dataset pubseq:submitter ?id . + ?id ?p ?submitter . + FILTER(CONTAINS(?submitter,"Roychoudhury")) . + ?dataset pubseq:sample ?sid . + ?sid ?p1 ?sample +} ++
+which shows pretty much everything known about their submissions in +this database. Let's focus on one sample "MT326090.1" with predicate +http://semanticscience.org/resource/SIO_000115. +
+ +PREFIX pubseq: <http://biohackathon.org/bh20-seq-schema#MainSchema/> +PREFIX sio: <http://semanticscience.org/resource/> +select distinct ?sample ?p ?o +{ + ?sample sio:SIO_000115 "MT326090.1" . + ?sample ?p ?o . +} ++
+This query tells us the sample was submitted "2020-03-21" and +originates from http://www.wikidata.org/entity/Q30, i.e., the USA and +is a biospecimen collected from the back of the throat by swabbing. +We can track it back to the original GenBank submission. +
+ ++We have also added country and label data to make it a bit easier +to view/query the database. +
+5 Fetch all sequences from Washington state
++Now we know how to get at the origin we can do it the other way round +and fetch all sequences referring to Washington state +
+ ++select ?seq ?sample +{ + ?seq <http://biohackathon.org/bh20-seq-schema#MainSchema/sample> ?sample . + ?sample <http://purl.obolibrary.org/obo/GAZ_00000448> <http://www.wikidata.org/entity/Q1223> +} ++
+which lists 300 sequences originating from Washington state! Which is almost +half of the set coming out of GenBank. +
+6 Discussion
++The public sequence uploader collects sequences, raw data and +(machine) queriable metadata. Not only that: data gets analyzed in the +pangenome and results are presented immediately. The data can be +referenced in publications and origins are citeable. +
+7 Acknowledgements
++The overall effort was due to magnificent freely donated input by a +great number of people. I particularly want to thank Thomas Liener for +the great effort he made with the ontology group in getting ontology's +and schema sorted! Peter Amstutz and Arvados/Curii helped build the +on-demand compute and back-ends. Thanks also to Michael Crusoe for +supporting the Common Workflow Language initiative. And without Erik +Garrison this initiative would not have existed! +
+