COVID-19 PubSeq (part 2)
Table of Contents
As part of the COVID-19 Biohackathon 2020 we formed a working group to create a COVID-19 Public Sequence Resource (COVID-19 PubSeq) for Corona virus sequences. The general idea is to create a repository that has a low barrier to entry for uploading sequence data using best practices. I.e., data published with a creative commons 4.0 (CC-4.0) license with metadata using state-of-the art standards and, perhaps most importantly, providing standardised workflows that get triggered on upload, so that results are immediately available in standardised data formats.
1 Finding output of workflows
As part of the COVID-19 Biohackathon 2020 we formed a working group to create a COVID-19 Public Sequence Resource (COVID-19 PubSeq) for Corona virus sequences. The general idea is to create a repository that has a low barrier to entry for uploading sequence data using best practices. I.e., data published with a creative commons 4.0 (CC-4.0) license with metadata using state-of-the art standards and, perhaps most importantly, providing standardised workflows that get triggered on upload, so that results are immediately available in standardised data formats.
2 Introduction
We are using Arvados to run common workflow language (CWL) pipelines. The most recent output is on display on a web page (with time stamp) and a full list is generated here. It is nice to start up, but for most users we need a dedicated and themed results page. People don't want to wade through thousands of output files!
3 The Arvados file interface
Arvados has the web server, but it also has a REST API and associated command line tools. We are already using the API to upload data. If you follow the pip or ../INSTALL.md GNU Guix instructions for installing Arvados API you'll find the following command line tools (also documented here):
Command | Description |
---|---|
arv-ls | list files in Arvados |
arv-put | upload a file to Arvados |
arv-get | get a textual representation of Arvados objects from the command line. The output can be limited to a subset of the object’s fields. This command can be used with only the knowledge of an object’s UUID |
Now, this is a public instance so we can use the tokens from the uploader.
export ARVADOSAPIHOST='lugli.arvadosapi.com' export ARVADOSAPITOKEN='2fbebpmbo3rw3x05ueu2i6nx70zhrsb1p22ycu3ry34m4x4462' arv-ls lugli-4zz18-z513nlpqm03hpca
will list all files (the UUID we got from the Arvados results page). To get the UUID of the files
curl https://lugli.arvadosapi.com/arvados/v1/config | jq .Users.AnonymousUserToken env ARVADOSAPITOKEN=5o42qdxpxp5cj15jqjf7vnxx5xduhm4ret703suuoa3ivfglfh \ arv-get lugli-4zz18-z513nlpqm03hpca
and fetch one listed JSON file chunk001_bin4000.schematic.json
with
its listed UUID:
arv-get 2be6af7b4741f2a5c5f8ff2bc6152d73+1955623+Ab9ad65d7fe958a053b3a57d545839de18290843a@5ed7f3c5