diff options
author | Pjotr Prins | 2020-08-25 11:55:11 +0100 |
---|---|---|
committer | Pjotr Prins | 2020-08-25 11:56:12 +0100 |
commit | 1b7199abd2d7f410a46158f9c66b8b373d3574f9 (patch) | |
tree | e951f66a88ac79967fc95746c87139fd3d3ced4a | |
parent | 2baa88b766ec540bd34b96599014dd16e393af39 (diff) | |
download | bh20-seq-resource-1b7199abd2d7f410a46158f9c66b8b373d3574f9.tar.gz bh20-seq-resource-1b7199abd2d7f410a46158f9c66b8b373d3574f9.tar.lz bh20-seq-resource-1b7199abd2d7f410a46158f9c66b8b373d3574f9.zip |
Using Arvados
-rw-r--r-- | doc/blog/using-covid-19-pubseq-part2.html | 161 | ||||
-rw-r--r-- | doc/blog/using-covid-19-pubseq-part2.org | 78 |
2 files changed, 207 insertions, 32 deletions
diff --git a/doc/blog/using-covid-19-pubseq-part2.html b/doc/blog/using-covid-19-pubseq-part2.html index c041ebe..b124c89 100644 --- a/doc/blog/using-covid-19-pubseq-part2.html +++ b/doc/blog/using-covid-19-pubseq-part2.html @@ -3,7 +3,7 @@ "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"> <html xmlns="http://www.w3.org/1999/xhtml" lang="en" xml:lang="en"> <head> -<!-- 2020-05-30 Sat 11:50 --> +<!-- 2020-08-25 Tue 05:55 --> <meta http-equiv="Content-Type" content="text/html;charset=utf-8" /> <meta name="viewport" content="width=device-width, initial-scale=1" /> <title>COVID-19 PubSeq (part 2)</title> @@ -252,19 +252,19 @@ for the JavaScript code in this tag. <h2>Table of Contents</h2> <div id="text-table-of-contents"> <ul> -<li><a href="#org7942167">1. Finding output of workflows</a></li> -<li><a href="#org0022bbe">2. Introduction</a></li> -<li><a href="#org3929710">3. The Arvados file interface</a></li> -<li><a href="#orgc4dba6e">4. Using the Arvados API</a></li> +<li><a href="#orgd3ae0e5">1. Finding output of workflows</a></li> +<li><a href="#orgce95d40">2. The Arvados file interface</a></li> +<li><a href="#org95f2c67">3. The PubSeq Arvados shell</a></li> +<li><a href="#orgfba95f0">4. Wiring up CWL</a></li> +<li><a href="#orgdf910f1">5. Using the Arvados API</a></li> </ul> </div> </div> -<div id="outline-container-org7942167" class="outline-2"> -<h2 id="org7942167"><span class="section-number-2">1</span> Finding output of workflows</h2> +<div id="outline-container-orgd3ae0e5" class="outline-2"> +<h2 id="orgd3ae0e5"><span class="section-number-2">1</span> Finding output of workflows</h2> <div class="outline-text-2" id="text-1"> - - <p> +<p> We are using Arvados to run common workflow language (CWL) pipelines. The most recent output is on display on a <a href="https://workbench.lugli.arvadosapi.com/collections/lugli-4zz18-z513nlpqm03hpca">web page</a> (with time stamp) and a full list is generated <a href="https://collections.lugli.arvadosapi.com/c=lugli-4zz18-z513nlpqm03hpca/">here</a>. It is nice to start up, but for @@ -274,9 +274,9 @@ want to wade through thousands of output files! </div> </div> -<div id="outline-container-org3929710" class="outline-2"> -<h2 id="org3929710"><span class="section-number-2">2</span> The Arvados file interface</h2> -<div class="outline-text-2" id="text-3"> +<div id="outline-container-orgce95d40" class="outline-2"> +<h2 id="orgce95d40"><span class="section-number-2">2</span> The Arvados file interface</h2> +<div class="outline-text-2" id="text-2"> <p> Arvados has the web server, but it also has a REST API and associated command line tools. We are already using the <a href="https://github.com/arvados/bh20-seq-resource/blob/master/bh20sequploader/main.py#L27">API</a> to upload data. If @@ -322,13 +322,11 @@ Now, this is a public instance so we can use the tokens from the <a href="https://github.com/arvados/bh20-seq-resource/blob/master/bh20sequploader/main.py#L16">uploader</a>. </p> -<div class="SOURCE"> -<p> -export ARVADOS<sub>API</sub><sub>HOST</sub>='lugli.arvadosapi.com' -export ARVADOS<sub>API</sub><sub>TOKEN</sub>='2fbebpmbo3rw3x05ueu2i6nx70zhrsb1p22ycu3ry34m4x4462' +<div class="org-src-container"> +<pre class="src src-sh"><span style="color: #ff8A65;">export</span> <span style="color: #ffcc80;">ARVADOS_API_HOST</span>=<span style="color: #9ccc65;">'lugli.arvadosapi.com'</span> +<span style="color: #ff8A65;">export</span> <span style="color: #ffcc80;">ARVADOS_API_TOKEN</span>=<span style="color: #9ccc65;">'2fbebpmbo3rw3x05ueu2i6nx70zhrsb1p22ycu3ry34m4x4462'</span> arv-ls lugli-4zz18-z513nlpqm03hpca -</p> - +</pre> </div> <p> @@ -336,13 +334,11 @@ will list all files (the UUID we got from the Arvados results page). To get the UUID of the files </p> -<div class="SOURCE"> -<p> -curl <a href="https://lugli.arvadosapi.com/arvados/v1/config">https://lugli.arvadosapi.com/arvados/v1/config</a> | jq .Users.AnonymousUserToken -env ARVADOS<sub>API</sub><sub>TOKEN</sub>=5o42qdxpxp5cj15jqjf7vnxx5xduhm4ret703suuoa3ivfglfh \ +<div class="org-src-container"> +<pre class="src src-sh">curl https://lugli.arvadosapi.com/arvados/v1/config | jq .Users.AnonymousUserToken +env <span style="color: #ffcc80;">ARVADOS_API_TOKEN</span>=5o42qdxpxp5cj15jqjf7vnxx5xduhm4ret703suuoa3ivfglfh <span style="color: #9ccc65;">\</span> arv-get lugli-4zz18-z513nlpqm03hpca -</p> - +</pre> </div> <p> @@ -356,12 +352,123 @@ arv-get 2be6af7b4741f2a5c5f8ff2bc6152d73+1955623+Ab9ad65d7fe958a053b3a57d545839d </div> </div> -<div id="outline-container-orgc4dba6e" class="outline-2"> -<h2 id="orgc4dba6e"><span class="section-number-2">3</span> TODO Using the Arvados API</h2> +<div id="outline-container-org95f2c67" class="outline-2"> +<h2 id="org95f2c67"><span class="section-number-2">3</span> The PubSeq Arvados shell</h2> +<div class="outline-text-2" id="text-3"> +<p> +When you login to Arvados (you can request permission from us) it is +possible to upload an ssh key in your profile and get an shell prompt +with +</p> + +<pre class="example"> +ssh pjotrpbl@shell.lugli.arvadosapi.com +Linux ip-10-255-0-202 4.19.0-9-cloud-amd64 #1 SMP Debian 4.19.118-2+deb10u1 (2020-06-07) x86_64 +</pre> + + + +<p> +It is a small Debian VM hosted on AWS somewhere. The PubSeq material +is mounted on <code>/data/pubseq</code>. The log is in <code>nohup.out</code>. Update/edit +the code (bh20-seq-resource git checkout) and restart the service (the +run script). The log says +</p> + +<pre class="example"> +you should have permission to read the log (nohup.out) update / edit the code (bh20-seq-resource git checkout) and restart the service (the run script) +</pre> + + +<p> +which means it will trigger the run on upload. The service is running as a +Python virtualenv: +</p> + +<pre class="example"> +/data/pubseq/bh20-seq-resource/venv3/bin/python3 /data/pubseq/bh20-seq-resource/venv3/bin/bh20-seq-analyzer --no-start-analysis +</pre> + + +<p> +and is restarted by a <code>run</code> script: +</p> + +<pre class="example"> +/data/pubseq/run [options] +</pre> + + +<p> +The run script kills the old process, sets up the API tokens, pulls +the git repo and starts a new run calling into +/data/pubseq/bh20-seq-resource/venv3/bin/bh20-seq-analyzer which is +essentially <a href="https://github.com/arvados/bh20-seq-resource/blob/2baa88b766ec540bd34b96599014dd16e393af39/bh20seqanalyzer/main.py#L354">monitoring</a> for uploads. +</p> +</div> +</div> + +<div id="outline-container-orgfba95f0" class="outline-2"> +<h2 id="orgfba95f0"><span class="section-number-2">4</span> Wiring up CWL</h2> +<div class="outline-text-2" id="text-4"> +<p> +In above script <code>bh20-seq-analyzer</code> you can see that the <a href="https://www.commonwl.org/">Common +Workflow Language</a> (CWL) gets <a href="https://github.com/arvados/bh20-seq-resource/blob/2baa88b766ec540bd34b96599014dd16e393af39/bh20seqanalyzer/main.py#L233">triggered</a>; for example <a href="https://github.com/arvados/bh20-seq-resource/tree/master/workflows/fastq2fasta">fastq2fasta</a> which +is part of the main repo. The actual script is in <a href="https://github.com/arvados/bh20-seq-resource/blob/master/workflows/fastq2fasta/fastq2fasta.cwl">fastq2fasta.cwl</a> and +runs the following tools in sequence: bwa-mem, samtools-view, +samtools-sort, and bam2fasta. +</p> + +<p> +It probably pays to familiarize yourself with CWL and its concepts. We +believe it has a lot going for it though CWL is some steps removed +from traditional shell scripts for running work flows. Main thing to +understand is that CWL is a separation of concerns, i.e., +</p> + +<ol class="org-ol"> +<li>Data</li> +<li>Tools</li> +<li>Flow</li> +</ol> + +<p> +and each of these are described separately. This contrasts largely +with shell scripts (though you can invoke shell scripts from CWL). +Also, CWL is written in JSON/YAML, which means everything can be parsed +as a tree and you can easily get visualisations such as +</p> + +<p> + <a href="https://hpc.guix.info/blog/2019/01/creating-a-reproducible-workflow-with-cwl/"> +<img src="https://hpc.guix.info/static/images/blog/cwl-provenance-graph.png" /></a> +</p> + +<p> +For more see <a href="https://hpc.guix.info/blog/2019/01/creating-a-reproducible-workflow-with-cwl/">Creating a reproducible workflow with CWL</a> by Pjotr Prins. +</p> +</div> +</div> + +<div id="outline-container-orgdf910f1" class="outline-2"> +<h2 id="orgdf910f1"><span class="section-number-2">5</span> Using the Arvados API</h2> +<div class="outline-text-2" id="text-5"> +<p> +Arvados provides a rich API for accessing internals of the Cloud +infrastructure. +</p> + +<p> +In above script <code>bh20-seq-analyzer</code> there are examples of querying the +<a href="https://doc.arvados.org/api/index.html">Arvados API</a> using the <a href="https://pypi.org/project/arvados-python-client/">Python Arvados client and libraries</a>. For example +get a list of <a href="https://github.com/arvados/bh20-seq-resource/blob/2baa88b766ec540bd34b96599014dd16e393af39/bh20seqanalyzer/main.py#L228">projects</a> in Arvados. Main thing is to get the +<code>ARVADOS-API-HOST</code> and <code>ARVADOS-API-TOKEN</code> right as is shown above. +</p> +</div> </div> </div> <div id="postamble" class="status"> -<hr><small>Created by <a href="http://thebird.nl/">Pjotr Prins</a> (pjotr.public768 at thebird 'dot' nl) using Emacs org-mode and a healthy dose of Lisp!<br />Modified 2020-05-30 Sat 11:50</small>. +<hr><small>Created by <a href="http://thebird.nl/">Pjotr Prins</a> (pjotr.public768 at thebird 'dot' nl) using Emacs org-mode and a healthy dose of Lisp!<br />Modified 2020-08-25 Tue 04:32</small>. </div> </body> </html> diff --git a/doc/blog/using-covid-19-pubseq-part2.org b/doc/blog/using-covid-19-pubseq-part2.org index 349fd06..c44b5c7 100644 --- a/doc/blog/using-covid-19-pubseq-part2.org +++ b/doc/blog/using-covid-19-pubseq-part2.org @@ -11,6 +11,8 @@ * Table of Contents :TOC:noexport: - [[#finding-output-of-workflows][Finding output of workflows]] - [[#the-arvados-file-interface][The Arvados file interface]] + - [[#the-pubseq-arvados-shell][The PubSeq Arvados shell]] + - [[#wiring-up-cwl][Wiring up CWL]] - [[#using-the-arvados-api][Using the Arvados API]] * Finding output of workflows @@ -38,24 +40,90 @@ installing Arvados API you'll find the following command line tools Now, this is a public instance so we can use the tokens from the [[https://github.com/arvados/bh20-seq-resource/blob/master/bh20sequploader/main.py#L16][uploader]]. -#+BEGIN_SOURCE sh +#+BEGIN_SRC sh export ARVADOS_API_HOST='lugli.arvadosapi.com' export ARVADOS_API_TOKEN='2fbebpmbo3rw3x05ueu2i6nx70zhrsb1p22ycu3ry34m4x4462' arv-ls lugli-4zz18-z513nlpqm03hpca -#+END_SOURCE +#+END_SRC will list all files (the UUID we got from the Arvados results page). To get the UUID of the files -#+BEGIN_SOURCE sh +#+BEGIN_SRC sh curl https://lugli.arvadosapi.com/arvados/v1/config | jq .Users.AnonymousUserToken env ARVADOS_API_TOKEN=5o42qdxpxp5cj15jqjf7vnxx5xduhm4ret703suuoa3ivfglfh \ arv-get lugli-4zz18-z513nlpqm03hpca -#+END_SOURCE +#+END_SRC and fetch one listed JSON file ~chunk001_bin4000.schematic.json~ with its listed UUID: : arv-get 2be6af7b4741f2a5c5f8ff2bc6152d73+1955623+Ab9ad65d7fe958a053b3a57d545839de18290843a@5ed7f3c5 -* TODO Using the Arvados API +* The PubSeq Arvados shell + +When you login to Arvados (you can request permission from us) it is +possible to upload an ssh key in your profile and get an shell prompt +with + +: ssh pjotrpbl@shell.lugli.arvadosapi.com +: Linux ip-10-255-0-202 4.19.0-9-cloud-amd64 #1 SMP Debian 4.19.118-2+deb10u1 (2020-06-07) x86_64 + + +It is a small Debian VM hosted on AWS somewhere. The PubSeq material +is mounted on ~/data/pubseq~. The log is in ~nohup.out~. Update/edit +the code (bh20-seq-resource git checkout) and restart the service (the +run script). The log says + +: you should have permission to read the log (nohup.out) update / edit the code (bh20-seq-resource git checkout) and restart the service (the run script) + +which means it will trigger the run on upload. The service is running as a +Python virtualenv: + +: /data/pubseq/bh20-seq-resource/venv3/bin/python3 /data/pubseq/bh20-seq-resource/venv3/bin/bh20-seq-analyzer --no-start-analysis + +and is restarted by a ~run~ script: + +: /data/pubseq/run [options] + +The run script kills the old process, sets up the API tokens, pulls +the git repo and starts a new run calling into +/data/pubseq/bh20-seq-resource/venv3/bin/bh20-seq-analyzer which is +essentially [[https://github.com/arvados/bh20-seq-resource/blob/2baa88b766ec540bd34b96599014dd16e393af39/bh20seqanalyzer/main.py#L354][monitoring]] for uploads. + +* Wiring up CWL + +In above script ~bh20-seq-analyzer~ you can see that the [[https://www.commonwl.org/][Common +Workflow Language]] (CWL) gets [[https://github.com/arvados/bh20-seq-resource/blob/2baa88b766ec540bd34b96599014dd16e393af39/bh20seqanalyzer/main.py#L233][triggered]]; for example [[https://github.com/arvados/bh20-seq-resource/tree/master/workflows/fastq2fasta][fastq2fasta]] which +is part of the main repo. The actual script is in [[https://github.com/arvados/bh20-seq-resource/blob/master/workflows/fastq2fasta/fastq2fasta.cwl][fastq2fasta.cwl]] and +runs the following tools in sequence: bwa-mem, samtools-view, +samtools-sort, and bam2fasta. + +It probably pays to familiarize yourself with CWL and its concepts. We +believe it has a lot going for it though CWL is some steps removed +from traditional shell scripts for running work flows. Main thing to +understand is that CWL is a separation of concerns, i.e., + +1. Data +2. Tools +3. Flow + +and each of these are described separately. This contrasts largely +with shell scripts (though you can invoke shell scripts from CWL). +Also, CWL is written in JSON/YAML, which means everything can be parsed +as a tree and you can easily get visualisations such as + +@@html: <a href="https://hpc.guix.info/blog/2019/01/creating-a-reproducible-workflow-with-cwl/"> +<img src="https://hpc.guix.info/static/images/blog/cwl-provenance-graph.png" /></a>@@ + +For more see [[https://hpc.guix.info/blog/2019/01/creating-a-reproducible-workflow-with-cwl/][Creating a reproducible workflow with CWL]] by Pjotr Prins. + +* Using the Arvados API + +Arvados provides a rich API for accessing internals of the Cloud +infrastructure. + +In above script ~bh20-seq-analyzer~ there are examples of querying the +[[https://doc.arvados.org/api/index.html][Arvados API]] using the [[https://pypi.org/project/arvados-python-client/][Python Arvados client and libraries]]. For example +get a list of [[https://github.com/arvados/bh20-seq-resource/blob/2baa88b766ec540bd34b96599014dd16e393af39/bh20seqanalyzer/main.py#L228][projects]] in Arvados. Main thing is to get the +~ARVADOS-API-HOST~ and ~ARVADOS-API-TOKEN~ right as is shown above. |