From 1ba6c90b5ec49be6c6424916ac70dfcc11026853 Mon Sep 17 00:00:00 2001 From: Pjotr Prins Date: Thu, 21 May 2020 05:43:14 -0500 Subject: BLOG: Arvados --- doc/blog/using-covid-19-pubseq-part2.org | 68 ++++++++++++++++++++++++++++++++ 1 file changed, 68 insertions(+) create mode 100644 doc/blog/using-covid-19-pubseq-part2.org diff --git a/doc/blog/using-covid-19-pubseq-part2.org b/doc/blog/using-covid-19-pubseq-part2.org new file mode 100644 index 0000000..d61bf42 --- /dev/null +++ b/doc/blog/using-covid-19-pubseq-part2.org @@ -0,0 +1,68 @@ +#+TITLE: COVID-19 PubSeq (part 2) +#+AUTHOR: Pjotr Prins +# C-c C-e h h publish +# C-c ! insert date (use . for active agenda, C-u C-c ! for date, C-u C-c . for time) +# C-c C-t task rotate +# RSS_IMAGE_URL: http://xxxx.xxxx.free.fr/rss_icon.png + +#+HTML_LINK_HOME: http://covid19.genenetwork.org +#+HTML_HEAD: + +* Finding output of workflows + +As part of the COVID-19 Biohackathon 2020 we formed a working group to +create a COVID-19 Public Sequence Resource (COVID-19 PubSeq) for +Corona virus sequences. The general idea is to create a repository +that has a low barrier to entry for uploading sequence data using best +practices. I.e., data published with a creative commons 4.0 (CC-4.0) +license with metadata using state-of-the art standards and, perhaps +most importantly, providing standardised workflows that get triggered +on upload, so that results are immediately available in standardised +data formats. + +* Introduction + +We are using Arvados to run common workflow language (CWL) pipelines. +The most recent output is on display on a [[https://workbench.lugli.arvadosapi.com/collections/lugli-4zz18-z513nlpqm03hpca][web page]] (with time stamp) +and a full list is generated [[https://collections.lugli.arvadosapi.com/c=lugli-4zz18-z513nlpqm03hpca/][here]]. It is nice to start up, but for +most users we need a dedicated and themed results page. People don't +want to wade through thousands of output files! + +* The Arvados file interface + +Arvados has the web server, but it also has a REST API and associated +command line tools. We are already using the [[https://github.com/arvados/bh20-seq-resource/blob/master/bh20sequploader/main.py#L27][API]] to upload data. If +you follow the pip or [[../INSTALL.md]] GNU Guix instructions for +installing Arvados API you'll find the following command line tools +(also documented [[https://doc.arvados.org/v2.0/sdk/cli/subcommands.html][here]]): + +| Command | Description | +|---------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| +| arv-ls | list files in Arvados | +| arv-put | upload a file to Arvados | +| arv-get | get a textual representation of Arvados objects from the command line. The output can be limited to a subset of the object’s fields. This command can be used with only the knowledge of an object’s UUID | + +Now, this is a public instance so we can use the tokens from +the [[https://github.com/arvados/bh20-seq-resource/blob/master/bh20sequploader/main.py#L16][uploader]]. + +#+BEGIN_SOURCE sh +export ARVADOS_API_HOST='lugli.arvadosapi.com' +export ARVADOS_API_TOKEN='2fbebpmbo3rw3x05ueu2i6nx70zhrsb1p22ycu3ry34m4x4462' +arv-ls lugli-4zz18-z513nlpqm03hpca +#+END_SOURCE + +will list all files (the UUID we got from the Arvados results page). To +get the UUID of the files + +#+BEGIN_SOURCE sh +curl https://lugli.arvadosapi.com/arvados/v1/config | jq .Users.AnonymousUserToken +env ARVADOS_API_TOKEN=5o42qdxpxp5cj15jqjf7vnxx5xduhm4ret703suuoa3ivfglfh \ + arv-get lugli-4zz18-z513nlpqm03hpca +#+END_SOURCE + +and fetch one listed JSON file ~chunk001_bin4000.schematic.json~ with +its listed UUID: + +: arv-get 2be6af7b4741f2a5c5f8ff2bc6152d73+1955623+Ab9ad65d7fe958a053b3a57d545839de18290843a@5ed7f3c5 + +* Using the Arvados API -- cgit v1.2.3