diff options
-rw-r--r-- | README.md | 4 | ||||
-rw-r--r-- | doc/blog/using-covid-19-pubseq-part1.org | 144 | ||||
-rw-r--r-- | paper/paper.md | 26 |
3 files changed, 159 insertions, 15 deletions
@@ -1,11 +1,11 @@ -# Sequence uploader +# COVID-19 PubSeq: Public Sequence uploader This repository provides a sequence uploader for the COVID-19 Virtual Biohackathon's Public Sequence Resource project. There are two versions, one that runs on the command line and another that acts as web interface. You can use it to upload the genomes of SARS-CoV-2 samples to make them publicly and freely available to other -researchers. +researchers. For more information see the [paper](./paper/paper.md). ![alt text](./image/website.png "Website") diff --git a/doc/blog/using-covid-19-pubseq-part1.org b/doc/blog/using-covid-19-pubseq-part1.org new file mode 100644 index 0000000..647165d --- /dev/null +++ b/doc/blog/using-covid-19-pubseq-part1.org @@ -0,0 +1,144 @@ +* COVID-19 PubSeq (part 1) + +/by Pjotr Prins/ + +As part of the COVID-19 Biohackathon 2020 we formed a working group +to create a COVID-19 Public Sequence Resource (COVID-19 PubSeq) for +Corona virus sequences. The general idea is to create a repository +that has a low barrier to entry for uploading sequence data using best +practices. I.e., data published with a creative commons 4.0 (CC-4.0) +license with metadata using state-of-the art standards and, perhaps +most importantly, providing standardised workflows that get triggered +on upload, so that results are immediately available in standardised +data formats. + +** What does this mean? + +This means that when someone uploads a SARS-CoV-2 sequence using one +of our tools (CLI or web-based) they add some metadata which is +expressed in a [[https://github.com/arvados/bh20-seq-resource/blob/master/bh20sequploader/bh20seq-schema.yml][schema]] that looks like + +#+begin_src json +- name: hostSchema + type: record + fields: + host_species: + doc: Host species as defined in NCBITaxon, e.g. http://purl.obolibrary.org/obo/NCBITaxon_9606 for Homo sapiens + type: string + jsonldPredicate: + _id: http://www.ebi.ac.uk/efo/EFO_0000532 + _type: "@id" + noLinkCheck: true + host_sex: + doc: Sex of the host as defined in PATO, expect male () or female () + type: string? + jsonldPredicate: + _id: http://purl.obolibrary.org/obo/PATO_0000047 + _type: "@id" + noLinkCheck: true + host_age: + doc: Age of the host as number (e.g. 50) + type: int? + jsonldPredicate: + _id: http://purl.obolibrary.org/obo/PATO_0000011 +#+end_src + +this metadata gets transformed into an RDF database which means +information can easily be fetched related to uploaded sequences. +We'll show an example below where we query a live database. + +There is more: when a new sequence gets uploaded COVID-19 PubSeq kicks +in with a number of workflows running in the cloud. These workflows +generate a fresh variation graph (GFA) containing all sequences, an +RDF file containing metadata, and an RDF file containing the variation +graph in triples. Soon we will at multi sequence alignments (MSA) and +more. Anyone can contribute data, tools and workflows to this +initiative! + +* Fetch sequence data + +The latest run of the pipeline can be viewed [[https://workbench.lugli.arvadosapi.com/collections/lugli-4zz18-z513nlpqm03hpca][here]]. Each of these +generated files can just be downloaded for your own use and sharing! +Data is published under a [[https://creativecommons.org/licenses/by/4.0/][Creative Commons 4.0 attribution license]] +(CC-BY-4.0). This means that, unlike some other 'public' resources, +you can use this data in any way you want, provided the submitter gets +attributed. + +If you download the GFA or FASTA sequences you'll find sequences are +named something like +*keep:e17abc8a0269875ed4cfbff5d9897c6c+123/sequence.fasta* which +refers to an internal Arvados Keep representation of the FASTA +sequence. Keep is content-addressable which means that +e17abc8a0269875ed4cfbff5d9897c6c uniquely identifies the file by its +contents. If the contents change, the identifier would change! We use +these identifiers throughout. + +* Fetch submitter info and other metadata + +We are interested in e17abc8a0269875ed4cfbff5d9897c6c and now we +want to get some metadata. We can use a SPARQL end point hosted at +http://sparql.genenetwork.org/sparql/. Paste in a query like + +#+begin_src sql +select ?p ?s +{ + <http://arvados.org/keep:e17abc8a0269875ed4cfbff5d9897c6c+123/sequence.fasta> ?p ?s +} +#+end_src + +which will tell you that original FASTA ID is "MT293175.1". It also +says the submitter is nodeID://b31228. + +#+begin_src sql +select distinct ?id ?p ?s +{ + <http://arvados.org/keep:e17abc8a0269875ed4cfbff5d9897c6c+123/sequence.fasta> <http://biohackathon.org/bh20-seq-schema#MainSchema/submitter> ?id . + ?id ?p ?s +} +#+end_src + +Tells you the submitter is "Roychoudhury,P.;Greninger,A.;Jerome,K." +with [[http://purl.obolibrary.org/obo/NCIT_C42781][predicate]] explaining "The individual who is responsible for the +content of a document." Welcome to the power of the semantic web. + +To get more information about the relevant sample + +#+begin_src sql +select ?sample ?p ?o +{ + <http://arvados.org/keep:e17abc8a0269875ed4cfbff5d9897c6c+123/sequence.fasta> <http://biohackathon.org/bh20-seq-schema#MainSchema/sample> ?sample . + ?sample ?p ?o +} +#+end_src + +we find it originates from Washington state (object +https://www.wikidata.org/wiki/Q1223) , dated "30-Mar-2020". The +sequencing was executed with Illumina and pipeline "custom pipeline +v. 2020-03" which is arguably not that descriptive. + +* Fetch all sequences from Washington state + +Now we know how to get at the origin we can do it the other way round +and fetch all sequences referring to Washington state + +#+begin_src sql + +select ?seq ?sample +{ + ?seq <http://biohackathon.org/bh20-seq-schema#MainSchema/sample> ?sample . + ?sample <http://purl.obolibrary.org/obo/GAZ_00000448> <http://www.wikidata.org/entity/Q1223> +} +#+end_src + +which lists 300 sequences originating from Washington state! Which is almost +half of the set coming out of GenBank. + +* Acknowledgements + +The overall effort was due to magnificent freely donated input by a +great number of people. I particularly want to thank Thomas Liener for +the great effort he made with the ontology group in getting ontology's +and schema sorted! Peter Amstutz and Curii helped build the on-demand +compute and back-ends. Thanks also to Michael Crusoe for supporting +the CWL initiative. And without Erik Garrison this initiative would +not have existed! diff --git a/paper/paper.md b/paper/paper.md index 05eb581..6a5d624 100644 --- a/paper/paper.md +++ b/paper/paper.md @@ -1,6 +1,6 @@ --- -title: 'CPSR: COVID-19 Public Sequence Resource' -title_short: 'CPSR: COVID-19 Public Sequence Resource' +title: 'COVID-19 PubSeq: COVID-19 Public Sequence Resource' +title_short: 'COVID-19 PubSeq' tags: - Sequencing - COVID-19 @@ -84,9 +84,9 @@ Note that author order will change! # Introduction -As part of the COVID-19 Biohackathion 2020 we formed a working -group to create a COVID-19 Public Sequence Resource (CPSR) for -Corona virus sequences. The general idea was to create a +As part of the COVID-19 Biohackathon 2020 we formed a working +group to create a COVID-19 Public Sequence Resource (COVID-19 PubSeq) for +Corona virus sequences. The general idea is to create a repository that has a low barrier to entry for uploading sequence data using best practices. I.e., data published with a creative commons 4.0 (CC-4.0) license with metadata using state-of-the art @@ -137,7 +137,7 @@ on CWL hub (FIXME). ## Cloud computing backend -The development of CPSR was accelerated by using the Arvados +The development of COVID-19 PubSeq was accelerated by using the Arvados Cloud platform. Arvados is an open source platform for managing, processing, and sharing genomic and other large scientific and biomedical data. The Arvados instance was deployed on Amazon AWS @@ -186,24 +186,24 @@ WIP # Discussion -CPSR is a data repository with computational pipelines that will +COVID-19 PubSeq is a data repository with computational pipelines that will persist during pandemics. Unlike other data repositories for Sars-COV-2 we created a repository that immediately computes the pangenome of all available data and presents that in useful formats for futher analysis, including visualisations, GFA and RDF. Code and data are available and written using best practises -and state-of-the-art standards. CPSR can be deployed by anyone, +and state-of-the-art standards. COVID-19 PubSeq can be deployed by anyone, anywhere. -CPSR is designed to abide by FAIR data principles (expand...) +COVID-19 PubSeq is designed to abide by FAIR data principles (expand...) -CPSR is primed with viral data coming from repositories that have +COVID-19 PubSeq is primed with viral data coming from repositories that have no sharing restrictions. The metadata includes relevant attribution to uploaders. Some institutes have already committed -to uploading their data to CPSR first so as to warrant sharing +to uploading their data to COVID-19 PubSeq first so as to warrant sharing for computation. -CPSR is currently running on an Arvados cluster in the cloud. To +COVID-19 PubSeq is currently running on an Arvados cluster in the cloud. To ascertain the service remains running we will source money from project during pandemics. The workflows are written in CWL which means they can be deployed on any infrastructure that runs @@ -214,7 +214,7 @@ party. This guarantees the data will live on. <!-- Future work... --> -We aim to add more workflows to CPSR, for example to prepare +We aim to add more workflows to COVID-19 PubSeq, for example to prepare sequence data for submitting in other public repositories, such as EBI ENA and GISAID. This will allow researchers to share data in multiple systems without pain, circumventing current sharing |