1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
|
---
title: 'COVID-19 PubSeq: COVID-19 Public Sequence Resource'
title_short: 'COVID-19 PubSeq'
tags:
- Sequencing
- COVID-19
authors:
- name: Pjotr Prins
orcid: 0000-0002-8021-9162
affiliation: 1
- name: Peter Amstutz
orcid: 0000
affiliation: 2
- name: Tazro Ohta
orcid: 0000
affiliation: 3
- name: Thomas Liener
orcid: https://orcid.org/0000-0003-3257-9937
affiliation: 4
- name: Erik Garrison
orcid: 0000
affiliation: 5
- name: Michael R. Crusoe
orcid: 0000-0002-2961-9670
affiliation: 6, 2
- name: Rutger Vos
orcid: 0000
affiliation: 7
- name: Michael Heuer
orcid: 0000-0002-9052-6000
affiliation: 8
- name: Adam M Novak
orcid: 0000-0001-5828-047X
affiliation: 5
- name: Alex Kanitz
orcid: 0000
affiliation: 10
- name: Jerven Bolleman
orcid: 0000
affiliation: 11
- name: Joep de Ligt
orcid: 0000
affiliation: 12
- name: Bonface Munyoki
orcid: 0000
affiliation: 13
- name: Andrea Guarracino
orcid: https://orcid.org/0000-0001-9744-131X
affiliation: 14
affiliations:
- name: Department of Genetics, Genomics and Informatics, The University of Tennessee Health Science Center, Memphis, TN, USA.
index: 1
- name: Curii, Boston, USA
index: 2
- name: Thomas Liener Consultancy
index: 4
- name: UC Santa Cruz Genomics Institute, University of California, Santa Cruz, CA 95064, USA
index: 5
- name: Department of Computer Science, Faculty of Sciences, Vrije Universiteit Amsterdam, The Netherlands
index: 6
- name: RISE Lab, University of California Berkeley, Berkeley, CA, USA.
index: 8
- name: Centre for Molecular Bioinformatics, Department of Biology, University Of Rome Tor Vergata, Rome, Italy
index: 14
date: 11 April 2020
event: COVID2020
group: Public Sequence Uploader
authors_short: Pjotr Prins & Peter Amstutz \emph{et al.}
bibliography: paper.bib
---
<!--
The paper.md, bibtex and figure file can be found in this repo:
https://github.com/arvados/bh20-seq-resource
To modify, please clone the repo. You can generate PDF of the paper by
pasting above link (or yours) with
https://github.com/biohackrxiv/bhxiv-gen-pdf
Note that author order will change!
-->
# Introduction
As part of the COVID-19 Biohackathon 2020 we formed a working
group to create a COVID-19 Public Sequence Resource (COVID-19 PubSeq) for
Corona virus sequences. The general idea is to create a
repository that has a low barrier to entry for uploading sequence
data using best practices. I.e., data published with a creative
commons 4.0 (CC-4.0) license with metadata using state-of-the art
standards and, perhaps most importantly, providing standardized
workflows that get triggered on upload, so that results are
immediately available in standardized data formats.
Existing data repositories for viral data include GISAID, EBI ENA
and NCBI. These repositories allow for free sharing of data, but
do not add value in terms of running immediate
computations. Also, GISAID, at this point, has the most complete
collection of genetic sequence data of influenza viruses and
related clinical and epidemiological data through its
database. But, due to a restricted license, data submitted to
GISAID can not be used for online web services and on-the-fly
computation. In addition GISAID registration which can take weeks
and, painfully, forces users to download sequences one at a time
to do any type of analysis. In our opinion this does not fit a
pandemic scenario where fast turnaround times are key and data
analysis has to be agile.
We managed to create a useful sequence uploader utility within
one week by leveraging existing technologies, such as the Arvados
Cloud platform [@Arvados], the Common Workflow Langauge (CWL)
[@CWL], Docker images built with Debian packages, and the many
free and open source software packages that are available for
bioinformatics.
The source code for the CLI uploader and web uploader can be
found [here](https://github.com/arvados/bh20-seq-resource)
(FIXME: we'll have a full page). The CWL workflow definitions can
be found [here](https://github.com/hpobio-lab/viral-analysis) and
on CWL hub (FIXME).
<!--
RESULTS!
For each section below
State the problem you worked on
Give the state-of-the art/plan
Describe what you have done/results starting with The working group created...
Write a conclusion
Write up any future work
-->
## Cloud computing backend
The development of COVID-19 PubSeq was accelerated by using the Arvados
Cloud platform. Arvados is an open source platform for managing,
processing, and sharing genomic and other large scientific and
biomedical data. The Arvados instance was deployed on Amazon AWS
for testing and development and a project was created that
allows for uploading data.
## Sequence uploader
We wrote a Python-based uploader that authenticates with Arvados
using a token. Data gets validated for being a FASTA sequence,
FASTQ raw data and/or metadata in the form of JSON LD that gets
validated against a schema. The uploader can be used
from a command line or using a simple web interface.
## Creating a Pangenome
### FASTA to GFA workflow
The first workflow (1) we implemented was a FASTA to Graphical
Fragment Assembly (GFA) Format conversion. When someone uploads a
sequence in FASTA format it gets combined with all known viral
sequences in our storage to generate a pangenome or variation
graph (VG). The full pangenome is made available as a
downloadable GFA file together with a visualisation (Figure 1).
### FASTQ to GFA workflow
In the next step we introduced a workflow (2) that takes raw
sequence data in fastq format and converts that into FASTA.
This FASTA file, in turn, gets fed to workflow (1) to generate
the pangenome.
## Creating linked data workflow
We created a workflow (3) that takes GFA and turns that into
RDF. Together with the metadata at upload time a single RDF
resource is compiled that can be linked against external
resources such as Uniprot and Wikidata. The generated RDF file
can be hosted in any triple store and queried using SPARQL.
## Creating a Phylogeny workflow
WIP
## Other workflows?
# Discussion
COVID-19 PubSeq is a data repository with computational pipelines that will
persist during pandemics. Unlike other data repositories for
Sars-COV-2 we created a repository that immediately computes the
pangenome of all available data and presents that in useful
formats for futher analysis, including visualisations, GFA and
RDF. Code and data are available and written using best practises
and state-of-the-art standards. COVID-19 PubSeq can be deployed by anyone,
anywhere.
COVID-19 PubSeq is designed to abide by FAIR data principles (expand...)
COVID-19 PubSeq is primed with viral data coming from repositories that have
no sharing restrictions. The metadata includes relevant
attribution to uploaders. Some institutes have already committed
to uploading their data to COVID-19 PubSeq first so as to warrant sharing
for computation.
COVID-19 PubSeq is currently running on an Arvados cluster in the cloud. To
ascertain the service remains running we will source money from
project during pandemics. The workflows are written in CWL which
means they can be deployed on any infrastructure that runs
CWL. One of the advantages of the CC-4.0 license is that we make
available all uploaded sequence and meta data, as well as
results, online to anyone. So the data can be mirrored by any
party. This guarantees the data will live on.
<!-- Future work... -->
We aim to add more workflows to COVID-19 PubSeq, for example to prepare
sequence data for submitting in other public repositories, such
as EBI ENA and GISAID. This will allow researchers to share data
in multiple systems without pain, circumventing current sharing
restrictions.
# Acknowledgements
We thank the COVID-19 BioHackathon 2020 and ELIXIR for creating a
unique event that triggered many collaborations. We thank Curii
Corporation for their financial support for creating and running
Arvados instances. We thank Amazon AWS for their financial
support to run COVID-19 workflows. We also want to thank the
other working groups in the BioHackathon who generously
contributed onthologies, workflows and software.
# References
|