As part of the COVID-19 Biohackathon 2020 we formed a working group to create a COVID-19 Public Sequence Resource (COVID-19 PubSeq) for Corona virus sequences. The general idea is to create a repository that has a low barrier to entry for uploading sequence data using best practices. I.e., data published with a creative commons 4.0 (CC-4.0) license with metadata using state-of-the art standards and, perhaps most importantly, providing standardised workflows that get triggered on upload, so that results are immediately available in standardised data formats.
Finding output of workflows
As part of the COVID-19 Biohackathon 2020 we formed a working group to create a COVID-19 Public Sequence Resource (COVID-19 PubSeq) for Corona virus sequences. The general idea is to create a repository that has a low barrier to entry for uploading sequence data using best practices. I.e., data published with a creative commons 4.0 (CC-4.0) license with metadata using state-of-the art standards and, perhaps most importantly, providing standardised workflows that get triggered on upload, so that results are immediately available in standardised data formats.
Introduction
We are using Arvados to run common workflow language (CWL) pipelines. The most recent output is on display on a web page (with time stamp) and a full list is generated here. It is nice to start up, but for most users we need a dedicated and themed results page. People don't want to wade through thousands of output files!
The Arvados file interface
Arvados has the web server, but it also has a REST API and associated command line tools. We are already using the API to upload data. If you follow the pip or ../INSTALL.md GNU Guix instructions for installing Arvados API you'll find the following command line tools (also documented here):
Command | Description |
---|---|
arv-ls | list files in Arvados |
arv-put | upload a file to Arvados |
arv-get | get a textual representation of Arvados objects from the command line. The output can be limited to a subset of the object’s fields. This command can be used with only the knowledge of an object’s UUID |
Now, this is a public instance so we can use the tokens from the uploader.
export ARVADOSAPIHOST='lugli.arvadosapi.com' export ARVADOSAPITOKEN='2fbebpmbo3rw3x05ueu2i6nx70zhrsb1p22ycu3ry34m4x4462' arv-ls lugli-4zz18-z513nlpqm03hpca
will list all files (the UUID we got from the Arvados results page). To get the UUID of the files
curl https://lugli.arvadosapi.com/arvados/v1/config | jq .Users.AnonymousUserToken env ARVADOSAPITOKEN=5o42qdxpxp5cj15jqjf7vnxx5xduhm4ret703suuoa3ivfglfh \ arv-get lugli-4zz18-z513nlpqm03hpca
and fetch one listed JSON file chunk001_bin4000.schematic.json
with
its listed UUID:
arv-get 2be6af7b4741f2a5c5f8ff2bc6152d73+1955623+Ab9ad65d7fe958a053b3a57d545839de18290843a@5ed7f3c5