#+TITLE: COVID-19 PubSeq - query metadata (part 1) #+AUTHOR: Pjotr Prins # C-c C-e h h publish # C-c ! insert date (use . for active agenda, C-u C-c ! for date, C-u C-c . for time) # C-c C-t task rotate # RSS_IMAGE_URL: http://xxxx.xxxx.free.fr/rss_icon.png #+HTML_HEAD: * Table of Contents :TOC:noexport: - [[#what-does-this-mean][What does this mean?]] - [[#fetch-sequence-data][Fetch sequence data]] - [[#predicates][Predicates]] - [[#fetch-submitter-info-and-other-metadata][Fetch submitter info and other metadata]] - [[#fetch-all-sequences-from-washington-state][Fetch all sequences from Washington state]] - [[#discussion][Discussion]] - [[#acknowledgements][Acknowledgements]] * What does this mean? This means that when someone uploads a SARS-CoV-2 sequence using one of our tools (CLI or web-based) they add some metadata which is expressed in a [[https://github.com/arvados/bh20-seq-resource/blob/master/bh20sequploader/bh20seq-schema.yml][schema]] that looks like #+begin_src json - name: hostSchema type: record fields: host_species: doc: Host species as defined in NCBITaxon, e.g. http://purl.obolibrary.org/obo/NCBITaxon_9606 for Homo sapiens type: string jsonldPredicate: _id: http://www.ebi.ac.uk/efo/EFO_0000532 _type: "@id" noLinkCheck: true host_sex: doc: Sex of the host as defined in PATO, expect male () or female () type: string? jsonldPredicate: _id: http://purl.obolibrary.org/obo/PATO_0000047 _type: "@id" noLinkCheck: true host_age: doc: Age of the host as number (e.g. 50) type: int? jsonldPredicate: _id: http://purl.obolibrary.org/obo/PATO_0000011 #+end_src this metadata gets transformed into an RDF database which means information can easily be fetched related to uploaded sequences. We'll show an example below where we query a live database. There is more: when a new sequence gets uploaded COVID-19 PubSeq kicks in with a number of workflows running in the cloud. These workflows generate a fresh variation graph (GFA) containing all sequences, an RDF file containing metadata, and an RDF file containing the variation graph in triples. Soon we will at multi sequence alignments (MSA) and more. Anyone can contribute data, tools and workflows to this initiative! * Fetch sequence data The latest run of the pipeline can be viewed [[http://covid19.genenetwork.org/status][here]]. Each of these generated files can just be downloaded for your own use and sharing! Data is published under a [[https://creativecommons.org/licenses/by/4.0/][Creative Commons 4.0 attribution license]] (CC-BY-4.0). This means that, unlike some other 'public' resources, you can use this data in any way you want, provided the submitter gets attributed. If you download the GFA or FASTA sequences you'll find sequences are named something like *keep:e17abc8a0269875ed4cfbff5d9897c6c+123/sequence.fasta* which refers to an internal Arvados Keep representation of the FASTA sequence. Keep is content-addressable which means that the value e17abc8a0269875ed4cfbff5d9897c6c uniquely identifies the file by its contents. If the contents change, the identifier changes! We use these identifiers throughout. * Predicates To explore an RDF dataset, the first query we can do is open and gets us a list. Lets look at all the predicates in the dataset by pasting the following in a SPARQL end point http://sparql.genenetwork.org/sparql/ #+begin_src sql select distinct ?p { ?o ?p ?s } #+end_src you can ignore the openlink and w3 ones. To reduce results to a named graph set the default graph. To get a [[http://sparql.genenetwork.org/sparql/?default-graph-uri=&query=select+distinct+%3Fg%0D%0A%7B%0D%0A++++GRAPH+%3Fg+%7B%3Fs+%3Fp+%3Fo%7D%0D%0A%7D&format=text%2Fhtml&timeout=0&debug=on&run=+Run+Query+][list of graphs]] in the dataset, first do #+begin_src sql select distinct ?g { GRAPH ?g {?s ?p ?o} } #+end_src Limiting search to metadata add http://covid-19.genenetwork.org/graph/metadata.ttl in the top input box. Now you can find a [[http://sparql.genenetwork.org/sparql/?default-graph-uri=http%3A%2F%2Fcovid-19.genenetwork.org%2Fgraph%2Fmetadata.ttl&query=select+distinct+%3Fp%0D%0A%7B%0D%0A+++%3Fo+%3Fp+%3Fs%0D%0A%7D&format=text%2Fhtml&timeout=0&debug=on&run=+Run+Query+][predicate]] for submitter that looks like http://biohackathon.org/bh20-seq-schema#MainSchema/submitter. To list all submitters, try #+begin_src sql select distinct ?s { ?o ?s } #+end_src Oh wait, it returns things like nodeID://b76150! That is not helpful, these are anonymous nodes in the graph. These point to another triple and by #+begin_src sql select distinct ?s { ?o ?id . ?id ?p ?s } #+end_src you get a list of all submitters including "University of Washington, Seattle, WA 98109, USA". To lift the full URL out of the query you can use a header like #+begin_src sql PREFIX pubseq: select distinct ?dataset ?submitter { ?dataset pubseq:submitter ?id . ?id ?p ?submitter } #+end_src which reads a bit better. We can also see the [[http://sparql.genenetwork.org/sparql/?default-graph-uri=&query=PREFIX+pubseq%3A+%3Chttp%3A%2F%2Fbiohackathon.org%2Fbh20-seq-schema%23MainSchema%2F%3E%0D%0Aselect+distinct+%3Fdataset+%3Fsubmitter%0D%0A%7B%0D%0A+++%3Fdataset+pubseq%3Asubmitter+%3Fid+.%0D%0A+++%3Fid+%3Fp+%3Fsubmitter%0D%0A%7D%0D%0A&format=text%2Fhtml&timeout=0&debug=on&run=+Run+Query+][submitted sequences]]. One of them submitted by University of Washington is http://collections.lugli.arvadosapi.com/c=030bcb8fda7f19743157359f5855f7a6+126/sequence.fasta (note the ID may have changed so pick one with above query). To see the submitted metadata replace sequence.fasta with metadata.yaml http://collections.lugli.arvadosapi.com/c=030bcb8fda7f19743157359f5855f7a6+126/metadata.yaml Now we got this far, lets [[http://sparql.genenetwork.org/sparql/?default-graph-uri=http%3A%2F%2Fcovid-19.genenetwork.org%2Fgraph%2Fmetadata.ttl&query=PREFIX+pubseq%3A+%3Chttp%3A%2F%2Fbiohackathon.org%2Fbh20-seq-schema%23MainSchema%2F%3E%0D%0Aselect+%28COUNT%28distinct+%3Fdataset%29+as+%3Fnum%29%0D%0A%7B%0D%0A+++%3Fdataset+pubseq%3Asubmitter+%3Fid+.%0D%0A+++%3Fid+%3Fp+%3Fsubmitter%0D%0A%7D+&format=text%2Fhtml&timeout=0&debug=on&run=+Run+Query+][count the datasets]] submitted with #+begin_src sql PREFIX pubseq: select (COUNT(distinct ?dataset) as ?num) { ?dataset pubseq:submitter ?id . ?id ?p ?submitter } #+end_src Run this [[http://sparql.genenetwork.org/sparql/?default-graph-uri=&query=PREFIX+pubseq%3A+%3Chttp%3A%2F%2Fbiohackathon.org%2Fbh20-seq-schema%23MainSchema%2F%3E%0D%0Aselect+%28COUNT%28distinct+%3Fdataset%29+as+%3Fnum%29%0D%0A%7B%0D%0A+++%3Fdataset+pubseq%3Asubmitter+%3Fid+.%0D%0A+++%3Fid+%3Fp+%3Fsubmitter%0D%0A%7D&format=text%2Fhtml&timeout=0&debug=on&run=+Run+Query+][query]]. * Fetch submitter info and other metadata To get datasets with submitters we can do the above #+begin_src sql PREFIX pubseq: select distinct ?dataset ?p ?submitter { ?dataset pubseq:submitter ?id . ?id ?p ?submitter } #+end_src Run this [[http://sparql.genenetwork.org/sparql/?default-graph-uri=&query=PREFIX+pubseq%3A+%3Chttp%3A%2F%2Fbiohackathon.org%2Fbh20-seq-schema%23MainSchema%2F%3E%0D%0Aselect+distinct+%3Fdataset+%3Fp+%3Fsubmitter%0D%0A%7B%0D%0A+++%3Fdataset+pubseq%3Asubmitter+%3Fid+.%0D%0A+++%3Fid+%3Fp+%3Fsubmitter%0D%0A%7D&format=text%2Fhtml&timeout=0&debug=on&run=+Run+Query+][query]]. Tells you one submitter is "Roychoudhury,P.;Greninger,A.;Jerome,K." with a URL [[http://purl.obolibrary.org/obo/NCIT_C42781][predicate]] (http://purl.obolibrary.org/obo/NCIT_C42781) explaining "The individual who is responsible for the content of a document." Well formed URIs point to real information about the URI itself. Welcome to the power of the semantic web. Let's focus on one sample with #+begin_src sql PREFIX pubseq: select distinct ?dataset ?submitter { ?dataset pubseq:submitter ?id . ?id ?p ?submitter . FILTER(CONTAINS(?submitter,"Roychoudhury")) . } #+end_src That is a lot of samples! We just want to pick one, so let's see if we can get a sample ID by listing sample predicates #+begin_src sql PREFIX pubseq: select distinct ?p { ?dataset ?p ?o . ?dataset pubseq:submitter ?id . } #+end_src which lists a predicate named http://biohackathon.org/bh20-seq-schema#MainSchema/sample. Let's zoom in on those of Roychoudhury with #+begin_src sql PREFIX pubseq: select distinct ?sid ?sample ?p1 ?dataset ?submitter { ?dataset pubseq:submitter ?id . ?id ?p ?submitter . FILTER(CONTAINS(?submitter,"Roychoudhury")) . ?dataset pubseq:sample ?sid . ?sid ?p1 ?sample } #+end_src Run [[http://sparql.genenetwork.org/sparql/?default-graph-uri=&query=%0D%0APREFIX+pubseq%3A+%3Chttp%3A%2F%2Fbiohackathon.org%2Fbh20-seq-schema%23MainSchema%2F%3E%0D%0Aselect+distinct+%3Fsid+%3Fsample+%3Fp1+%3Fdataset+%3Fsubmitter%0D%0A%7B%0D%0A+++%3Fdataset+pubseq%3Asubmitter+%3Fid+.%0D%0A+++%3Fid+%3Fp+%3Fsubmitter+.%0D%0A+++FILTER%28CONTAINS%28%3Fsubmitter%2C%22Roychoudhury%22%29%29+.%0D%0A+++%3Fdataset+pubseq%3Asample+%3Fsid+.%0D%0A+++%3Fsid+%3Fp1+%3Fsample%0D%0A%7D%0D%0A&format=text%2Fhtml&timeout=0&debug=on&run=+Run+Query+][query]]. which shows pretty much [[http://sparql.genenetwork.org/sparql/?default-graph-uri=&query=PREFIX+pubseq%3A+%3Chttp%3A%2F%2Fbiohackathon.org%2Fbh20-seq-schema%23MainSchema%2F%3E%0D%0Aselect+distinct+%3Fsid+%3Fsample+%3Fp1+%3Fdataset+%3Fsubmitter%0D%0A%7B%0D%0A+++%3Fdataset+pubseq%3Asubmitter+%3Fid+.%0D%0A+++%3Fid+%3Fp+%3Fsubmitter+.%0D%0A+++FILTER%28CONTAINS%28%3Fsubmitter%2C%22Roychoudhury%22%29%29+.%0D%0A+++%3Fdataset+pubseq%3Asample+%3Fsid+.%0D%0A+++%3Fsid+%3Fp1+%3Fsample%0D%0A%7D&format=text%2Fhtml&timeout=0&debug=on&run=+Run+Query+][everything known]] about their submissions in this database. Let's focus on one sample "MT326090.1" with predicate http://semanticscience.org/resource/SIO_000115. #+begin_src sql PREFIX pubseq: PREFIX sio: select distinct ?sample ?p ?o { ?sample sio:SIO_000115 "MT326090.1" . ?sample ?p ?o . } #+end_src Run this [[http://sparql.genenetwork.org/sparql/?default-graph-uri=&query=%0D%0APREFIX+pubseq%3A+%3Chttp%3A%2F%2Fbiohackathon.org%2Fbh20-seq-schema%23MainSchema%2F%3E%0D%0APREFIX+sio%3A+%3Chttp%3A%2F%2Fsemanticscience.org%2Fresource%2F%3E%0D%0Aselect+distinct+%3Fsample+%3Fp+%3Fo%0D%0A%7B%0D%0A+++%3Fsample+sio%3ASIO_000115+%22MT326090.1%22+.%0D%0A+++%3Fsample+%3Fp+%3Fo+.%0D%0A%7D&format=text%2Fhtml&timeout=0&debug=on&run=+Run+Query+][query]]. This query tells us the sample was submitted "2020-03-21" and originates from http://www.wikidata.org/entity/Q30, i.e., the USA and is a biospecimen collected from the back of the throat by swabbing. We have also added country and label data to make it a bit easier to view/query the database and place the sequence on the [[http://covid19.genenetwork.org/][map]]. We use wikidata entities for disambiguation. By using 'Q30' for the USA we don't have to figure out the different ways people spell the name. To get from the wikidata entity to a human readable form we provide a country name [[https://github.com/arvados/bh20-seq-resource/blob/72369b2e2e3cd881be2bd648a61e1449ffe34875/semantic_enrichment/countries.ttl#L306][translation]] for convenience. For example when the predicate is http://purl.obolibrary.org/obo/GAZ_00000448 we can do #+begin_src sql PREFIX pubseq: PREFIX sio: select distinct ?sample ?geoname { ?sample sio:SIO_000115 "MT326090.1" . ?sample ?geo . ?geo rdfs:label ?geoname . } #+end_src Which will show the geoname spelled out as 'United States'. For this sample we can also track it back to the original GenBank [[http://identifiers.org/insdc/MT326090.1#sequence][submission]] using the listed http://identifiers.org/insdc/MT326090.1 link. * Fetch all sequences from Washington state Now we know how to get at the origin we can do it the other way round and fetch all sequences referring to Washington state #+begin_src sql select ?date ?name ?identifier ?seq { ?seq ?sample . ?sample . ?sample ?date . ?sample ?name . ?sample ?identifier . } order by ?date #+end_src Run [[http://sparql.genenetwork.org/sparql/?default-graph-uri=&query=select+%3Fdate+%3Fname+%3Fidentifier+%3Fseq%0D%0A%7B%0D%0A++++%3Fseq+%3Chttp%3A%2F%2Fbiohackathon.org%2Fbh20-seq-schema%23MainSchema%2Fsample%3E+%3Fsample+.%0D%0A%0D%0A++++%3Fsample+%3Chttp%3A%2F%2Fpurl.obolibrary.org%2Fobo%2FGAZ_00000448%3E+%3Chttp%3A%2F%2Fwww.wikidata.org%2Fentity%2FQ1223%3E+.%0D%0A++++%3Fsample+%3Chttp%3A%2F%2Fncicb.nci.nih.gov%2Fxml%2Fowl%2FEVS%2FThesaurus.owl%23C25164%3E+%3Fdate+.%0D%0A++++%3Fsample+%3Chttp%3A%2F%2Fsemanticscience.org%2Fresource%2FSIO_000115%3E+%3Fname+.%0D%0A++++%3Fsample+%3Chttp%3A%2F%2Fedamontology.org%2Fdata_2091%3E+%3Fidentifier+.%0D%0A%7D+order+by+%3Fdate&format=text%2Fhtml&timeout=0&debug=on&run=+Run+Query+][query]] Which shows the date and links to NCBI and raw sequence data in FASTA format, e.g. #+begin_example "date" "name" "identifier" "seq" "2020-01-15" "MT252760.1" "http://identifiers.org/insdc/MT252760.1#sequence" "http://collections.lugli.arvadosapi.com/c=0164784cba5e3e39b7ba8d83fdc92649+126/sequence.fasta" "2020-01-15" "MT252720.1" "http://identifiers.org/insdc/MT252720.1#sequence" "http://collections.lugli.arvadosapi.com/c=0387a3e47dd8a0c9ea0a4a21931f6308+126/sequence.fasta" (...) #+end_example The query lists 300 sequences originating from Washington state! Which in April was almost half of the set coming out of GenBank. Likewise to list all sequences from Turkey we can find the wikidata entity is [[https://www.wikidata.org/wiki/Q43][Q43]]: #+begin_src sql select ?seq ?sample { ?seq ?sample . ?sample } #+end_src Run [[http://sparql.genenetwork.org/sparql/?default-graph-uri=&query=%0D%0Aselect+%3Fseq+%3Fsample%0D%0A%7B%0D%0A++++%3Fseq+%3Chttp%3A%2F%2Fbiohackathon.org%2Fbh20-seq-schema%23MainSchema%2Fsample%3E+%3Fsample+.%0D%0A++++%3Fsample+%3Chttp%3A%2F%2Fpurl.obolibrary.org%2Fobo%2FGAZ_00000448%3E+%3Chttp%3A%2F%2Fwww.wikidata.org%2Fentity%2FQ43%3E%0D%0A%7D&format=text%2Fhtml&timeout=0&debug=on&run=+Run+Query+][query]]. * Discussion The public sequence uploader collects sequences, raw data and (machine) queriable metadata. Not only that: data gets analyzed in the pangenome and results are presented immediately. The data can be referenced in publications and origins are citeable. * Acknowledgements The overall effort was due to magnificent freely donated input by a great number of people. I particularly want to thank Thomas Liener for the great effort he made with the ontology group in getting ontology's and schema sorted! Peter Amstutz and [[https://arvados.org/][Arvados/Curii]] helped build the on-demand compute and back-ends. Thanks also to Michael Crusoe for supporting the [[https://www.commonwl.org/][Common Workflow Language]] initiative. And without Erik Garrison this initiative would not have existed!