* COVID-19 PubSeq (part 1) /by Pjotr Prins/ As part of the COVID-19 Biohackathon 2020 we formed a working group to create a COVID-19 Public Sequence Resource (COVID-19 PubSeq) for Corona virus sequences. The general idea is to create a repository that has a low barrier to entry for uploading sequence data using best practices. I.e., data published with a creative commons 4.0 (CC-4.0) license with metadata using state-of-the art standards and, perhaps most importantly, providing standardised workflows that get triggered on upload, so that results are immediately available in standardised data formats. ** What does this mean? This means that when someone uploads a SARS-CoV-2 sequence using one of our tools (CLI or web-based) they add some metadata which is expressed in a [[https://github.com/arvados/bh20-seq-resource/blob/master/bh20sequploader/bh20seq-schema.yml][schema]] that looks like #+begin_src json - name: hostSchema type: record fields: host_species: doc: Host species as defined in NCBITaxon, e.g. http://purl.obolibrary.org/obo/NCBITaxon_9606 for Homo sapiens type: string jsonldPredicate: _id: http://www.ebi.ac.uk/efo/EFO_0000532 _type: "@id" noLinkCheck: true host_sex: doc: Sex of the host as defined in PATO, expect male () or female () type: string? jsonldPredicate: _id: http://purl.obolibrary.org/obo/PATO_0000047 _type: "@id" noLinkCheck: true host_age: doc: Age of the host as number (e.g. 50) type: int? jsonldPredicate: _id: http://purl.obolibrary.org/obo/PATO_0000011 #+end_src this metadata gets transformed into an RDF database which means information can easily be fetched related to uploaded sequences. We'll show an example below where we query a live database. There is more: when a new sequence gets uploaded COVID-19 PubSeq kicks in with a number of workflows running in the cloud. These workflows generate a fresh variation graph (GFA) containing all sequences, an RDF file containing metadata, and an RDF file containing the variation graph in triples. Soon we will at multi sequence alignments (MSA) and more. Anyone can contribute data, tools and workflows to this initiative! * Fetch sequence data The latest run of the pipeline can be viewed [[https://workbench.lugli.arvadosapi.com/collections/lugli-4zz18-z513nlpqm03hpca][here]]. Each of these generated files can just be downloaded for your own use and sharing! Data is published under a [[https://creativecommons.org/licenses/by/4.0/][Creative Commons 4.0 attribution license]] (CC-BY-4.0). This means that, unlike some other 'public' resources, you can use this data in any way you want, provided the submitter gets attributed. If you download the GFA or FASTA sequences you'll find sequences are named something like *keep:e17abc8a0269875ed4cfbff5d9897c6c+123/sequence.fasta* which refers to an internal Arvados Keep representation of the FASTA sequence. Keep is content-addressable which means that the value e17abc8a0269875ed4cfbff5d9897c6c uniquely identifies the file by its contents. If the contents change, the identifier changes! We use these identifiers throughout. * Predicates Lets look at all the predicates in the dataset by pasting the following in a SPARQL end point http://sparql.genenetwork.org/sparql/ #+begin_src sql select distinct ?p { ?o ?p ?s } #+end_src you can ignore the openlink and w3 ones. To reduce results to a named graph set the default graph to http://covid-19.genenetwork.org/graph/metadata.ttl in the top input box. There you can find a predicate for submitter that looks like http://biohackathon.org/bh20-seq-schema#MainSchema/submitter. To list all submitters, try #+begin_src sql select distinct ?s { ?o ?s } #+end_src Oh wait, it returns things like nodeID://b76150! That is not helpful, these are anonymous nodes in the graph. These point to another triple and by #+begin_src sql select distinct ?s { ?o ?id . ?id ?p ?s } #+end_src you get a list of all submitters including "University of Washington, Seattle, WA 98109, USA". To lift the full URL out of the query you can use a header like #+begin_src sql PREFIX pubseq: select distinct ?dataset ?submitter { ?dataset pubseq:submitter ?id . ?id ?p ?submitter } #+end_src which reads a bit better. We can also see the datasets. One of them submitted by University of Washington is is http://arvados.org/keep:00fede2c6f52b053a14edca01cfa02b7+126/sequence.fasta (note the ID may have changed so pick one with above query). * Fetch submitter info and other metadata #+begin_src sql select ?p ?s { ?p ?s } #+end_src which will tell you that original FASTA ID is "MT293175.1". It also says the submitter is nodeID://b31228. #+begin_src sql select distinct ?id ?p ?s { ?id . ?id ?p ?s } #+end_src Tells you the submitter is "Roychoudhury,P.;Greninger,A.;Jerome,K." with [[http://purl.obolibrary.org/obo/NCIT_C42781][predicate]] explaining "The individual who is responsible for the content of a document." Welcome to the power of the semantic web. To get more information about the relevant sample #+begin_src sql select ?sample ?p ?o { ?sample . ?sample ?p ?o } #+end_src we find it originates from Washington state (object https://www.wikidata.org/wiki/Q1223) , dated "30-Mar-2020". The sequencing was executed with Illumina and pipeline "custom pipeline v. 2020-03" which is arguably not that descriptive. * Fetch all sequences from Washington state Now we know how to get at the origin we can do it the other way round and fetch all sequences referring to Washington state #+begin_src sql select ?seq ?sample { ?seq ?sample . ?sample } #+end_src which lists 300 sequences originating from Washington state! Which is almost half of the set coming out of GenBank. * Acknowledgements The overall effort was due to magnificent freely donated input by a great number of people. I particularly want to thank Thomas Liener for the great effort he made with the ontology group in getting ontology's and schema sorted! Peter Amstutz and [[https://arvados.org/][Arvados/Curii]] helped build the on-demand compute and back-ends. Thanks also to Michael Crusoe for supporting the [[https://www.commonwl.org/][Common Workflow Language]] initiative. And without Erik Garrison this initiative would not have existed!