paper/paper.md


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236

---
title: 'COVID-19 PubSeq: COVID-19 Public Sequence Resource'
title_short: 'COVID-19 PubSeq'
tags:
  - Sequencing
  - COVID-19
authors:
  - name: Pjotr Prins
    orcid: 0000-0002-8021-9162
    affiliation: 1
  - name: Peter Amstutz
    orcid: 0000
    affiliation: 2
  - name: Tazro Ohta
    orcid: 0000
    affiliation: 3
  - name: Thomas Liener
    orcid: https://orcid.org/0000-0003-3257-9937
    affiliation: 4
  - name: Erik Garrison
    orcid: 0000
    affiliation: 5
  - name: Michael R. Crusoe
    orcid: 0000-0002-2961-9670
    affiliation: 6, 2
  - name: Rutger Vos
    orcid: 0000
    affiliation: 7
  - name: Michael Heuer
    orcid: 0000-0002-9052-6000
    affiliation: 8
  - name: Adam M Novak
    orcid: 0000-0001-5828-047X
    affiliation: 5
  - name: Alex Kanitz
    orcid: 0000
    affiliation: 10
  - name: Jerven Bolleman
    orcid: 0000
    affiliation: 11
  - name: Joep de Ligt
    orcid: 0000
    affiliation: 12
  - name: Bonface Munyoki
    orcid: 0000
    affiliation: 13
  - name: Andrea Guarracino
    orcid: https://orcid.org/0000-0001-9744-131X
    affiliation: 14
affiliations:
  - name: Department of Genetics, Genomics and Informatics, The University of Tennessee Health Science Center, Memphis, TN, USA.
    index: 1
  - name: Curii, Boston, USA
    index: 2
  - name: Thomas Liener Consultancy
    index: 4
  - name: UC Santa Cruz Genomics Institute, University of California, Santa Cruz, CA 95064, USA
    index: 5
  - name: Department of Computer Science, Faculty of Sciences, Vrije Universiteit Amsterdam, The Netherlands
    index: 6
  - name: RISE Lab, University of California Berkeley, Berkeley, CA, USA.
    index: 8
  - name: Centre for Molecular Bioinformatics, Department of Biology, University Of Rome Tor Vergata, Rome, Italy
    index: 14
date: 11 April 2020
event: COVID2020
group: Public Sequence Uploader
authors_short: Pjotr Prins & Peter Amstutz \emph{et al.}
bibliography: paper.bib
---

<!--

The paper.md, bibtex and figure file can be found in this repo:

  https://github.com/arvados/bh20-seq-resource

To modify, please clone the repo. You can generate PDF of the paper by
pasting above link (or yours) with

  https://github.com/biohackrxiv/bhxiv-gen-pdf

Note that author order will change!

-->

# Introduction

As part of the COVID-19 Biohackathon 2020 we formed a working
group to create a COVID-19 Public Sequence Resource (COVID-19 PubSeq) for
Corona virus sequences. The general idea is to create a
repository that has a low barrier to entry for uploading sequence
data using best practices. I.e., data published with a creative
commons 4.0 (CC-4.0) license with metadata using state-of-the art
standards and, perhaps most importantly, providing standardized
workflows that get triggered on upload, so that results are
immediately available in standardized data formats.

Existing data repositories for viral data include GISAID, EBI ENA
and NCBI. These repositories allow for free sharing of data, but
do not add value in terms of running immediate
computations. Also, GISAID, at this point, has the most complete
collection of genetic sequence data of influenza viruses and
related clinical and epidemiological data through its
database. But, due to a restricted license, data submitted to
GISAID can not be used for online web services and on-the-fly
computation. In addition GISAID registration which can take weeks
and, painfully, forces users to download sequences one at a time
to do any type of analysis. In our opinion this does not fit a
pandemic scenario where fast turnaround times are key and data
analysis has to be agile.

We managed to create a useful sequence uploader utility within
one week by leveraging existing technologies, such as the Arvados
Cloud platform [@Arvados], the Common Workflow Langauge (CWL)
[@CWL], Docker images built with Debian packages, and the many
free and open source software packages that are available for
bioinformatics.

The source code for the CLI uploader and web uploader can be
found [here](https://github.com/arvados/bh20-seq-resource)
(FIXME: we'll have a full page). The CWL workflow definitions can
be found [here](https://github.com/hpobio-lab/viral-analysis) and
on CWL hub (FIXME).

<!--

    RESULTS!

    For each section below

    State the problem you worked on
    Give the state-of-the art/plan
    Describe what you have done/results starting with The working group created...
    Write a conclusion
    Write up any future work

-->

## Cloud computing backend

The development of COVID-19 PubSeq was accelerated by using the Arvados
Cloud platform. Arvados is an open source platform for managing,
processing, and sharing genomic and other large scientific and
biomedical data. The Arvados instance was deployed on Amazon AWS
for testing and development and a project was created that
allows for uploading data.

## Sequence uploader

We wrote a Python-based uploader that authenticates with Arvados
using a token. Data gets validated for being a FASTA sequence,
FASTQ raw data and/or metadata in the form of JSON LD that gets
validated against a schema. The uploader can be used
from a command line or using a simple web interface.

## Creating a Pangenome

### FASTA to GFA workflow

The first workflow (1) we implemented was a FASTA to Graphical
Fragment Assembly (GFA) Format conversion. When someone uploads a
sequence in FASTA format it gets combined with all known viral
sequences in our storage to generate a pangenome or variation
graph (VG). The full pangenome is made available as a
downloadable GFA file together with a visualisation (Figure 1).

### FASTQ to GFA workflow

In the next step we introduced a workflow (2) that takes raw
sequence data in fastq format and converts that into FASTA.
This FASTA file, in turn, gets fed to workflow (1) to generate
the pangenome.

## Creating linked data workflow

We created a workflow (3) that takes GFA and turns that into
RDF. Together with the metadata at upload time a single RDF
resource is compiled that can be linked against external
resources such as Uniprot and Wikidata. The generated RDF file
can be hosted in any triple store and queried using SPARQL.

## Creating a Phylogeny workflow

WIP

## Other workflows?

# Discussion

COVID-19 PubSeq is a data repository with computational pipelines that will
persist during pandemics.  Unlike other data repositories for
Sars-COV-2 we created a repository that immediately computes the
pangenome of all available data and presents that in useful
formats for futher analysis, including visualisations, GFA and
RDF. Code and data are available and written using best practises
and state-of-the-art standards. COVID-19 PubSeq can be deployed by anyone,
anywhere.

COVID-19 PubSeq is designed to abide by FAIR data principles (expand...)

COVID-19 PubSeq is primed with viral data coming from repositories that have
no sharing restrictions. The metadata includes relevant
attribution to uploaders. Some institutes have already committed
to uploading their data to COVID-19 PubSeq first so as to warrant sharing
for computation.

COVID-19 PubSeq is currently running on an Arvados cluster in the cloud. To
ascertain the service remains running we will source money from
project during pandemics. The workflows are written in CWL which
means they can be deployed on any infrastructure that runs
CWL. One of the advantages of the CC-4.0 license is that we make
available all uploaded sequence and meta data, as well as
results, online to anyone. So the data can be mirrored by any
party. This guarantees the data will live on.

<!-- Future work... -->

We aim to add more workflows to COVID-19 PubSeq, for example to prepare
sequence data for submitting in other public repositories, such
as EBI ENA and GISAID. This will allow researchers to share data
in multiple systems without pain, circumventing current sharing
restrictions.

# Acknowledgements

We thank the COVID-19 BioHackathon 2020 and ELIXIR for creating a
unique event that triggered many collaborations. We thank Curii
Corporation for their financial support for creating and running
Arvados instances.  We thank Amazon AWS for their financial
support to run COVID-19 workflows. We also want to thank the
other working groups in the BioHackathon who generously
contributed onthologies, workflows and software.


# References