1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
|
#+TITLE: COVID-19 PubSeq (part 2)
#+AUTHOR: Pjotr Prins
# C-c C-e h h publish
# C-c ! insert date (use . for active agenda, C-u C-c ! for date, C-u C-c . for time)
# C-c C-t task rotate
# RSS_IMAGE_URL: http://xxxx.xxxx.free.fr/rss_icon.png
#+HTML_LINK_HOME: http://covid19.genenetwork.org
#+HTML_HEAD: <link rel="Blog stylesheet" type="text/css" href="blog.css" />
As part of the COVID-19 Biohackathon 2020 we formed a working group to
create a COVID-19 Public Sequence Resource (COVID-19 PubSeq) for
Corona virus sequences. The general idea is to create a repository
that has a low barrier to entry for uploading sequence data using best
practices. I.e., data published with a creative commons 4.0 (CC-4.0)
license with metadata using state-of-the art standards and, perhaps
most importantly, providing standardised workflows that get triggered
on upload, so that results are immediately available in standardised
data formats.
* Table of Contents :TOC:noexport:
- [[#finding-output-of-workflows][Finding output of workflows]]
- [[#introduction][Introduction]]
- [[#the-arvados-file-interface][The Arvados file interface]]
- [[#using-the-arvados-api][Using the Arvados API]]
* Finding output of workflows
As part of the COVID-19 Biohackathon 2020 we formed a working group to
create a COVID-19 Public Sequence Resource (COVID-19 PubSeq) for
Corona virus sequences. The general idea is to create a repository
that has a low barrier to entry for uploading sequence data using best
practices. I.e., data published with a creative commons 4.0 (CC-4.0)
license with metadata using state-of-the art standards and, perhaps
most importantly, providing standardised workflows that get triggered
on upload, so that results are immediately available in standardised
data formats.
* Introduction
We are using Arvados to run common workflow language (CWL) pipelines.
The most recent output is on display on a [[https://workbench.lugli.arvadosapi.com/collections/lugli-4zz18-z513nlpqm03hpca][web page]] (with time stamp)
and a full list is generated [[https://collections.lugli.arvadosapi.com/c=lugli-4zz18-z513nlpqm03hpca/][here]]. It is nice to start up, but for
most users we need a dedicated and themed results page. People don't
want to wade through thousands of output files!
* The Arvados file interface
Arvados has the web server, but it also has a REST API and associated
command line tools. We are already using the [[https://github.com/arvados/bh20-seq-resource/blob/master/bh20sequploader/main.py#L27][API]] to upload data. If
you follow the pip or [[../INSTALL.md]] GNU Guix instructions for
installing Arvados API you'll find the following command line tools
(also documented [[https://doc.arvados.org/v2.0/sdk/cli/subcommands.html][here]]):
| Command | Description |
|---------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| arv-ls | list files in Arvados |
| arv-put | upload a file to Arvados |
| arv-get | get a textual representation of Arvados objects from the command line. The output can be limited to a subset of the object’s fields. This command can be used with only the knowledge of an object’s UUID |
Now, this is a public instance so we can use the tokens from
the [[https://github.com/arvados/bh20-seq-resource/blob/master/bh20sequploader/main.py#L16][uploader]].
#+BEGIN_SOURCE sh
export ARVADOS_API_HOST='lugli.arvadosapi.com'
export ARVADOS_API_TOKEN='2fbebpmbo3rw3x05ueu2i6nx70zhrsb1p22ycu3ry34m4x4462'
arv-ls lugli-4zz18-z513nlpqm03hpca
#+END_SOURCE
will list all files (the UUID we got from the Arvados results page). To
get the UUID of the files
#+BEGIN_SOURCE sh
curl https://lugli.arvadosapi.com/arvados/v1/config | jq .Users.AnonymousUserToken
env ARVADOS_API_TOKEN=5o42qdxpxp5cj15jqjf7vnxx5xduhm4ret703suuoa3ivfglfh \
arv-get lugli-4zz18-z513nlpqm03hpca
#+END_SOURCE
and fetch one listed JSON file ~chunk001_bin4000.schematic.json~ with
its listed UUID:
: arv-get 2be6af7b4741f2a5c5f8ff2bc6152d73+1955623+Ab9ad65d7fe958a053b3a57d545839de18290843a@5ed7f3c5
* Using the Arvados API
|