diff options
-rw-r--r-- | bh20sequploader/bh20seq-schema.yml | 45 | ||||
-rw-r--r-- | bh20sequploader/rdf-mappings.ttl | 0 | ||||
-rw-r--r-- | example/metadata.yaml | 5 | ||||
-rw-r--r-- | example/minimal_example.yaml | 6 | ||||
-rw-r--r-- | paper/paper.bib | 16 | ||||
-rw-r--r-- | paper/paper.md | 174 |
6 files changed, 206 insertions, 40 deletions
diff --git a/bh20sequploader/bh20seq-schema.yml b/bh20sequploader/bh20seq-schema.yml index 8a22db1..81a7f22 100644 --- a/bh20sequploader/bh20seq-schema.yml +++ b/bh20sequploader/bh20seq-schema.yml @@ -13,41 +13,52 @@ $graph: type: record fields: host_species: + ## autocomplete # NCBITAXON + doc: Host species as defined in NCBITaxon (e.g. http://purl.obolibrary.org/obo/NCBITaxon_9606 for Homo sapiens) type: string jsonldPredicate: _id: http://www.ebi.ac.uk/efo/EFO_0000532 host_id: + doc: Identifer for the host. If you submit multiple samples from the same host, use the same host_id for those samples type: string jsonldPredicate: _id: http://semanticscience.org/resource/SIO_000115 host_common_name: + doc: Text label for the host species (e.g. homo sapiens) type: string? jsonldPredicate: _id: http://purl.obolibrary.org/obo/NOMEN_0000037 host_sex: + doc: Sex of the host as define in NCIT, IRI expected (http://purl.obolibrary.org/obo/C20197 (Male), http://purl.obolibrary.org/obo/NCIT_C27993 (Female) or unkown (http://purl.obolibrary.org/obo/NCIT_C17998)) type: string jsonldPredicate: _id: http://purl.obolibrary.org/obo/PATO_0000047 host_age: + doc: Age of the host as number (e.g. 50) type: int? jsonldPredicate: _id: http://purl.obolibrary.org/obo/PATO_0000011 host_age_unit: + doc: Unit of host age.... this field is unstable as of now (might be removed) type: string? jsonldPredicate: _id: http://purl.obolibrary.org/obo/UO_0000036 host_health_status: + doc: A condition or state at a particular time type: string? jsonldPredicate: http://purl.obolibrary.org/obo/NCIT_C25688 host_treatment: + doc: Process in which the act is intended to modify or alter type: string? jsonldPredicate: _id: http://www.ebi.ac.uk/efo/EFO_0000727 host_vaccination: + doc: Field is unstable type: string? jsonldPredicate: _id: http://purl.obolibrary.org/obo/VO_0000001 additional_host_information: + doc: Field for additional host information type: string? jsonldPredicate: _id: http://semanticscience.org/resource/SIO_001167 @@ -56,38 +67,47 @@ $graph: type: record fields: collector_name: + doc: Name of the person that took the sample type: string jsonldPredicate: _id: http://purl.obolibrary.org/obo/OBI_0001895 collecting_institution: + doc: Institute that was responsible of sampeling type: string jsonldPredicate: _id: http://semanticscience.org/resource/SIO_001167 specimen_source: + doc: A specimen that derives from an anatomical part or substance arising from an organism, e.g. tissue, organ type: string? jsonldPredicate: _id: http://purl.obolibrary.org/obo/OBI_0001479 collection_date: + doc: Date when the sample was taken type: string? jsonldPredicate: _id: http://ncicb.nci.nih.gov/xml/owl/EVS/Thesaurus.owl#C25164 collection_location: + doc: Geographical location where the sample was collected as Gazetteer (https://www.ebi.ac.uk/ols/ontologies/gaz) reference, e.g. http://purl.obolibrary.org/obo/GAZ_00002845 (China) type: string? jsonldPredicate: _id: http://purl.obolibrary.org/obo/GAZ_00000448 sample_storage_conditions: + doc: Information aboout storage of a specified type, e.g. frozen specimen, paraffin, fresh .... type: string? jsonldPredicate: _id: http://purl.obolibrary.org/obo/OBI_0001472 additional_collection_information: + doc: Add additional comment about the circumstances that a sample was taken type: string? jsonldPredicate: _id: http://semanticscience.org/resource/SIO_001167 sample_id: + doc: Id of the sample as defined by the submitter type: string jsonldPredicate: _id: http://semanticscience.org/resource/SIO_000115 source_database_accession: + doc: If data is deposit at a public resource (e.g. Genbank, ENA) enter the Accession Id here type: string? jsonldPredicate: _id: http://edamontology.org/data_2091 @@ -96,10 +116,12 @@ $graph: type: record fields: virus_species: + doc: The name of a taxon from the NCBI taxonomy database type: string? jsonldPredicate: _id: http://edamontology.org/data_1875 virus_strain: + doc: Name of the virus strain type: string? jsonldPredicate: _id: http://semanticscience.org/resource/SIO_010055 @@ -108,14 +130,17 @@ $graph: type: record fields: sample_sequencing_technology: + doc: Technology that was used to sequence this sample (e.g Sanger, Nanopor MiniION) type: string jsonldPredicate: _id: http://purl.obolibrary.org/obo/OBI_0600047 sequence_assembly_method: + doc: Protocol which provides instructions on the alignment of sequencing reads to reference genome type: string? jsonldPredicate: _id: http://www.ebi.ac.uk/efo/EFO_0002699 sequencing_coverage: + doc: Sequence coverage defined as the average number of reads representing a given nucleotide (e.g. 100x) type: string? jsonldPredicate: _id: http://purl.obolibrary.org/obo/FLU_0000848 @@ -124,22 +149,22 @@ $graph: type: record fields: submitter_name: + doc: Name of the submitter type: string jsonldPredicate: _id: http://semanticscience.org/resource/SIO_000116 - submitter_date: - type: string - jsonldPredicate: - _id: http://purl.obolibrary.org/obo/NCIT_C94162 submitter_address: + doc: Address of the submitter type: string? jsonldPredicate: _id: http://semanticscience.org/resource/SIO_000172 originating_lab: + doc: Name of the laboratory that took the sample type: string jsonldPredicate: _id: http://purl.obolibrary.org/obo/NCIT_C37984 lab_address: + doc: Address of the laboratory where the sample was taken type: string? jsonldPredicate: _id: http://purl.obolibrary.org/obo/OBI_0600047 @@ -152,10 +177,17 @@ $graph: jsonldPredicate: _id: http://www.ebi.ac.uk/efo/EFO_0001741 authors: + doc: Name of the author(s) type: string? jsonldPredicate: _id: http://purl.obolibrary.org/obo/NCIT_C42781 - submitter_id: + publication: + doc: Reference to publication of this sample (e.g. DOI, pubmed ID, ...) + type: string? + jsonldPredicate: + _id: http://purl.obolibrary.org/obo/NCIT_C19026 + submitter_orchid: + doc: ORCHID of the submitter type: string? jsonldPredicate: _id: http://semanticscience.org/resource/SIO_000115 @@ -171,7 +203,8 @@ $graph: submitter: submitterSchema id: doc: The subject (eg the fasta/fastq file) that the metadata describes - type: string? + type: string jsonldPredicate: _id: "@id" _type: "@id" + noLinkCheck: true diff --git a/bh20sequploader/rdf-mappings.ttl b/bh20sequploader/rdf-mappings.ttl new file mode 100644 index 0000000..e69de29 --- /dev/null +++ b/bh20sequploader/rdf-mappings.ttl diff --git a/example/metadata.yaml b/example/metadata.yaml index c780921..d9e8e92 100644 --- a/example/metadata.yaml +++ b/example/metadata.yaml @@ -1,3 +1,5 @@ +id: placeholder + host: host_id: XX1 host_species: string @@ -36,5 +38,4 @@ submitter: provider_sample_id: string submitter_sample_id: string authors: testAuthor - submitter_id: X12 - submitter_date: Subdate + submitter_orchid: X12 diff --git a/example/minimal_example.yaml b/example/minimal_example.yaml index f312ab7..160d1d4 100644 --- a/example/minimal_example.yaml +++ b/example/minimal_example.yaml @@ -1,8 +1,9 @@ -submission: publicSequenceResource +id: placeholder host: host_id: XX host_species: string + host_sex: string sample: sample_id: XXX @@ -14,5 +15,4 @@ technology: submitter: submitter_name: tester - originating_lab: testLab - submitter_date: Subdate
\ No newline at end of file + originating_lab: testLab
\ No newline at end of file diff --git a/paper/paper.bib b/paper/paper.bib index e69de29..bcb9c0b 100644 --- a/paper/paper.bib +++ b/paper/paper.bib @@ -0,0 +1,16 @@ +@book{CWL, +title = "Common Workflow Language, v1.0", +abstract = "The Common Workflow Language (CWL) is an informal, multi-vendor working group consisting of various organizations and individuals that have an interest in portability of data analysis workflows. Our goal is to create specifications that enable data scientists to describe analysis tools and workflows that are powerful, easy to use, portable, and support reproducibility.CWL builds on technologies such as JSON-LD and Avro for data modeling and Docker for portable runtime environments. CWL is designed to express workflows for data-intensive science, such as Bioinformatics, Medical Imaging, Chemistry, Physics, and Astronomy.This is v1.0 of the CWL tool and workflow specification, released on 2016-07-08", +keywords = "cwl, workflow, specification", +author = "Brad Chapman and John Chilton and Michael Heuer and Andrey Kartashov and Dan Leehr and Herv{\'e} M{\'e}nager and Maya Nedeljkovich and Matt Scales and Stian Soiland-Reyes and Luka Stojanovic", +editor = "Peter Amstutz and Crusoe, {Michael R.} and Nebojša Tijanić", +note = "Specification, product of the Common Workflow Language working group. http://www.commonwl.org/v1.0/", +year = "2016", +month = "7", +day = "8", +doi = "10.6084/m9.figshare.3115156.v2", +language = "English", +publisher = "figshare", +address = "United States", + +}
\ No newline at end of file diff --git a/paper/paper.md b/paper/paper.md index caa9903..7bd18c8 100644 --- a/paper/paper.md +++ b/paper/paper.md @@ -1,8 +1,9 @@ --- -title: 'Public Sequence Resource for COVID-19' +title: 'CPSR: COVID-19 Public Sequence Resource' +title_short: 'CPSR: COVID-19 Public Sequence Resource' tags: - Sequencing - - COVID + - COVID-19 authors: - name: Pjotr Prins orcid: 0000-0002-8021-9162 @@ -19,22 +20,42 @@ authors: - name: Erik Garrison orcid: 0000 affiliation: 5 - - name: Michael Crusoe - orcid: 0000 - affiliation: 6 + - name: Michael R. Crusoe + orcid: 0000-0002-2961-9670 + affiliation: 6, 2 - name: Rutger Vos orcid: 0000 affiliation: 7 - - Michael Heuer - orcid: 0000 + - name: Michael Heuer + orcid: 0000-0002-9052-6000 affiliation: 8 - + - name: Adam M Novak + orcid: 0000-0001-5828-047X + affiliation: 5 + - name: Alex Kanitz + orcid: 0000 + affiliation: 10 + - name: Jerven Bolleman + orcid: 0000 + affiliation: 11 + - name: Joep de Ligt + orcid: 0000 + affiliation: 12 affiliations: - name: Department of Genetics, Genomics and Informatics, The University of Tennessee Health Science Center, Memphis, TN, USA. index: 1 - name: Curii, Boston, USA index: 2 + - name: UC Santa Cruz Genomics Institute, University of California, Santa Cruz, CA 95064, USA. + index: 5 + - name: Department of Computer Science, Faculty of Sciences, Vrije Universiteit Amsterdam, The Netherlands + index: 6 + - name: RISE Lab, University of California Berkeley, Berkeley, CA, USA. + index: 8 date: 11 April 2020 +event: COVID2020 +group: Public Sequence Uploader +authors_short: Pjotr Prins & Peter Amstutz \emph{et al.} bibliography: paper.bib --- @@ -49,13 +70,48 @@ pasting above link (or yours) with https://github.com/biohackrxiv/bhxiv-gen-pdf +Note that author order will change! + --> # Introduction -As part of the one week COVID-19 Biohackathion 2020, we formed a -working group on creating a public sequence resource for Corona virus. - +As part of the COVID-19 Biohackathion 2020 we formed a working +group to create a COVID-19 Public Sequence Resource (CPSR) for +Corona virus sequences. The general idea was to create a +repository that has a low barrier to entry for uploading sequence +data using best practices. I.e., data published with a creative +commons 4.0 (CC-4.0) license with metadata using state-of-the art +standards and, perhaps most importantly, providing standardized +workflows that get triggered on upload, so that results are +immediately available in standardized data formats. + +Existing data repositories for viral data include GISAID, EBI ENA +and NCBI. These repositories allow for free sharing of data, but +do not add value in terms of running immediate +computations. Also, GISAID, at this point, has the most complete +collection of genetic sequence data of influenza viruses and +related clinical and epidemiological data through its +database. But, due to a restricted license, data submitted to +GISAID can not be used for online web services and on-the-fly +computation. In addition GISAID registration which can take weeks +and, painfully, forces users to download sequences one at a time +to do any type of analysis. In our opinion this does not fit a +pandemic scenario where fast turnaround times are key and data +analysis has to be agile. + +We managed to create a useful sequence uploader utility within +one week by leveraging existing technologies, such as the Arvados +Cloud platform [@Arvados], the Common Workflow Langauge (CWL) +[@CWL], Docker images built with Debian packages, and the many +free and open source software packages that are available for +bioinformatics. + +The source code for the CLI uploader and web uploader can be +found [here](https://github.com/arvados/bh20-seq-resource) +(FIXME: we'll have a full page). The CWL workflow definitions can +be found [here](https://github.com/hpobio-lab/viral-analysis) and +on CWL hub (FIXME). <!-- @@ -73,38 +129,98 @@ working group on creating a public sequence resource for Corona virus. ## Cloud computing backend -Peter, Pjotr, MichaelC +The development of CPSR was accelerated by using the Arvados +Cloud platform. Arvados is an open source platform for managing, +processing, and sharing genomic and other large scientific and +biomedical data. The Arvados instance was deployed on Amazon AWS +for testing and development and a project was created that +allows for uploading data. -## A command-line sequence uploader +## Sequence uploader -Peter, Pjotr +We wrote a Python-based uploader that authenticates with Arvados +using a token. Data gets validated for being a FASTA sequence, +FASTQ raw data and/or metadata in the form of JSON LD that gets +validated against a schema. The uploader can be used +from a command line or using a simple web interface. -## Metadata uploader +## Creating a Pangenome -With Thomas +### FASTA to GFA workflow -## FASTA to GFA workflow +The first workflow (1) we implemented was a FASTA to Graphical +Fragment Assembly (GFA) Format conversion. When someone uploads a +sequence in FASTA format it gets combined with all known viral +sequences in our storage to generate a pangenome or variation +graph (VG). The full pangenome is made available as a +downloadable GFA file together with a visualisation (Figure 1). -Michael Heuer +### FASTQ to GFA workflow -## BAM to GFA workflow +In the next step we introduced a workflow (2) that takes raw +sequence data in fastq format and converts that into FASTA. +This FASTA file, in turn, gets fed to workflow (1) to generate +the pangenome. -Tazro & Erik +## Creating linked data workflow -## Phylogeny app +We created a workflow (3) that takes GFA and turns that into +RDF. Together with the metadata at upload time a single RDF +resource is compiled that can be linked against external +resources such as Uniprot and Wikidata. The generated RDF file +can be hosted in any triple store and queried using SPARQL. -With Rutger +## Creating a Phylogeny workflow -## RDF app +WIP -Jerven? - -## EBI app - -? +## Other workflows? # Discussion -Future work... +CPSR is a data repository with computational pipelines that will +persist during pandemics. Unlike other data repositories for +Sars-COV-2 we created a repository that immediately computes the +pangenome of all available data and presents that in useful +formats for futher analysis, including visualisations, GFA and +RDF. Code and data are available and written using best practises +and state-of-the-art standards. CPSR can be deployed by anyone, +anywhere. + +CPSR is designed to abide by FAIR data principles (expand...) + +CPSR is primed with viral data coming from repositories that have +no sharing restrictions. The metadata includes relevant +attribution to uploaders. Some institutes have already committed +to uploading their data to CPSR first so as to warrant sharing +for computation. + +CPSR is currently running on an Arvados cluster in the cloud. To +ascertain the service remains running we will source money from +project during pandemics. The workflows are written in CWL which +means they can be deployed on any infrastructure that runs +CWL. One of the advantages of the CC-4.0 license is that we make +available all uploaded sequence and meta data, as well as +results, online to anyone. So the data can be mirrored by any +party. This guarantees the data will live on. + +<!-- Future work... --> + +We aim to add more workflows to CPSR, for example to prepare +sequence data for submitting in other public repositories, such +as EBI ENA and GISAID. This will allow researchers to share data +in multiple systems without pain, circumventing current sharing +restrictions. + +# Acknowledgements + +We thank the COVID-19 BioHackathon 2020 and ELIXIR for creating a +unique event that triggered many collaborations. We thank Curii +Corporation for their financial support for creating and running +Arvados instances. We thank Amazon AWS for their financial +support to run COVID-19 workflows. We also want to thank the +other working groups in the BioHackathon who generously +contributed onthologies, workflows and software. + # References |