<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" lang="en" xml:lang="en">
<head>
<!-- 2020-11-10 Tue 05:08 -->
<meta http-equiv="Content-Type" content="text/html;charset=utf-8" />
<meta name="viewport" content="width=device-width, initial-scale=1" />
<title>COVID-19 PubSeq - Arvados</title>
<meta name="generator" content="Org mode" />
<meta name="author" content="Pjotr Prins" />
<style type="text/css">
<!--/*--><![CDATA[/*><!--*/
.title { text-align: center;
margin-bottom: .2em; }
.subtitle { text-align: center;
font-size: medium;
font-weight: bold;
margin-top:0; }
.todo { font-family: monospace; color: red; }
.done { font-family: monospace; color: green; }
.priority { font-family: monospace; color: orange; }
.tag { background-color: #eee; font-family: monospace;
padding: 2px; font-size: 80%; font-weight: normal; }
.timestamp { color: #bebebe; }
.timestamp-kwd { color: #5f9ea0; }
.org-right { margin-left: auto; margin-right: 0px; text-align: right; }
.org-left { margin-left: 0px; margin-right: auto; text-align: left; }
.org-center { margin-left: auto; margin-right: auto; text-align: center; }
.underline { text-decoration: underline; }
#postamble p, #preamble p { font-size: 90%; margin: .2em; }
p.verse { margin-left: 3%; }
pre {
border: 1px solid #ccc;
box-shadow: 3px 3px 3px #eee;
padding: 8pt;
font-family: monospace;
overflow: auto;
margin: 1.2em;
}
pre.src {
position: relative;
overflow: auto;
padding-top: 1.2em;
}
pre.src:before {
display: none;
position: absolute;
background-color: white;
top: -10px;
right: 10px;
padding: 3px;
border: 1px solid black;
}
pre.src:hover:before { display: inline;}
/* Languages per Org manual */
pre.src-asymptote:before { content: 'Asymptote'; }
pre.src-awk:before { content: 'Awk'; }
pre.src-C:before { content: 'C'; }
/* pre.src-C++ doesn't work in CSS */
pre.src-clojure:before { content: 'Clojure'; }
pre.src-css:before { content: 'CSS'; }
pre.src-D:before { content: 'D'; }
pre.src-ditaa:before { content: 'ditaa'; }
pre.src-dot:before { content: 'Graphviz'; }
pre.src-calc:before { content: 'Emacs Calc'; }
pre.src-emacs-lisp:before { content: 'Emacs Lisp'; }
pre.src-fortran:before { content: 'Fortran'; }
pre.src-gnuplot:before { content: 'gnuplot'; }
pre.src-haskell:before { content: 'Haskell'; }
pre.src-hledger:before { content: 'hledger'; }
pre.src-java:before { content: 'Java'; }
pre.src-js:before { content: 'Javascript'; }
pre.src-latex:before { content: 'LaTeX'; }
pre.src-ledger:before { content: 'Ledger'; }
pre.src-lisp:before { content: 'Lisp'; }
pre.src-lilypond:before { content: 'Lilypond'; }
pre.src-lua:before { content: 'Lua'; }
pre.src-matlab:before { content: 'MATLAB'; }
pre.src-mscgen:before { content: 'Mscgen'; }
pre.src-ocaml:before { content: 'Objective Caml'; }
pre.src-octave:before { content: 'Octave'; }
pre.src-org:before { content: 'Org mode'; }
pre.src-oz:before { content: 'OZ'; }
pre.src-plantuml:before { content: 'Plantuml'; }
pre.src-processing:before { content: 'Processing.js'; }
pre.src-python:before { content: 'Python'; }
pre.src-R:before { content: 'R'; }
pre.src-ruby:before { content: 'Ruby'; }
pre.src-sass:before { content: 'Sass'; }
pre.src-scheme:before { content: 'Scheme'; }
pre.src-screen:before { content: 'Gnu Screen'; }
pre.src-sed:before { content: 'Sed'; }
pre.src-sh:before { content: 'shell'; }
pre.src-sql:before { content: 'SQL'; }
pre.src-sqlite:before { content: 'SQLite'; }
/* additional languages in org.el's org-babel-load-languages alist */
pre.src-forth:before { content: 'Forth'; }
pre.src-io:before { content: 'IO'; }
pre.src-J:before { content: 'J'; }
pre.src-makefile:before { content: 'Makefile'; }
pre.src-maxima:before { content: 'Maxima'; }
pre.src-perl:before { content: 'Perl'; }
pre.src-picolisp:before { content: 'Pico Lisp'; }
pre.src-scala:before { content: 'Scala'; }
pre.src-shell:before { content: 'Shell Script'; }
pre.src-ebnf2ps:before { content: 'ebfn2ps'; }
/* additional language identifiers per "defun org-babel-execute"
in ob-*.el */
pre.src-cpp:before { content: 'C++'; }
pre.src-abc:before { content: 'ABC'; }
pre.src-coq:before { content: 'Coq'; }
pre.src-groovy:before { content: 'Groovy'; }
/* additional language identifiers from org-babel-shell-names in
ob-shell.el: ob-shell is the only babel language using a lambda to put
the execution function name together. */
pre.src-bash:before { content: 'bash'; }
pre.src-csh:before { content: 'csh'; }
pre.src-ash:before { content: 'ash'; }
pre.src-dash:before { content: 'dash'; }
pre.src-ksh:before { content: 'ksh'; }
pre.src-mksh:before { content: 'mksh'; }
pre.src-posh:before { content: 'posh'; }
/* Additional Emacs modes also supported by the LaTeX listings package */
pre.src-ada:before { content: 'Ada'; }
pre.src-asm:before { content: 'Assembler'; }
pre.src-caml:before { content: 'Caml'; }
pre.src-delphi:before { content: 'Delphi'; }
pre.src-html:before { content: 'HTML'; }
pre.src-idl:before { content: 'IDL'; }
pre.src-mercury:before { content: 'Mercury'; }
pre.src-metapost:before { content: 'MetaPost'; }
pre.src-modula-2:before { content: 'Modula-2'; }
pre.src-pascal:before { content: 'Pascal'; }
pre.src-ps:before { content: 'PostScript'; }
pre.src-prolog:before { content: 'Prolog'; }
pre.src-simula:before { content: 'Simula'; }
pre.src-tcl:before { content: 'tcl'; }
pre.src-tex:before { content: 'TeX'; }
pre.src-plain-tex:before { content: 'Plain TeX'; }
pre.src-verilog:before { content: 'Verilog'; }
pre.src-vhdl:before { content: 'VHDL'; }
pre.src-xml:before { content: 'XML'; }
pre.src-nxml:before { content: 'XML'; }
/* add a generic configuration mode; LaTeX export needs an additional
(add-to-list 'org-latex-listings-langs '(conf " ")) in .emacs */
pre.src-conf:before { content: 'Configuration File'; }
table { border-collapse:collapse; }
caption.t-above { caption-side: top; }
caption.t-bottom { caption-side: bottom; }
td, th { vertical-align:top; }
th.org-right { text-align: center; }
th.org-left { text-align: center; }
th.org-center { text-align: center; }
td.org-right { text-align: right; }
td.org-left { text-align: left; }
td.org-center { text-align: center; }
dt { font-weight: bold; }
.footpara { display: inline; }
.footdef { margin-bottom: 1em; }
.figure { padding: 1em; }
.figure p { text-align: center; }
.equation-container {
display: table;
text-align: center;
width: 100%;
}
.equation {
vertical-align: middle;
}
.equation-label {
display: table-cell;
text-align: right;
vertical-align: middle;
}
.inlinetask {
padding: 10px;
border: 2px solid gray;
margin: 10px;
background: #ffffcc;
}
#org-div-home-and-up
{ text-align: right; font-size: 70%; white-space: nowrap; }
textarea { overflow-x: auto; }
.linenr { font-size: smaller }
.code-highlighted { background-color: #ffff00; }
.org-info-js_info-navigation { border-style: none; }
#org-info-js_console-label
{ font-size: 10px; font-weight: bold; white-space: nowrap; }
.org-info-js_search-highlight
{ background-color: #ffff00; color: #000000; font-weight: bold; }
.org-svg { width: 90%; }
/*]]>*/-->
</style>
<link rel="Blog stylesheet" type="text/css" href="blog.css" />
<script type="text/javascript">
// @license magnet:?xt=urn:btih:e95b018ef3580986a04669f1b5879592219e2a7a&dn=public-domain.txt Public Domain
<!--/*--><![CDATA[/*><!--*/
function CodeHighlightOn(elem, id)
{
var target = document.getElementById(id);
if(null != target) {
elem.classList.add("code-highlighted");
target.classList.add("code-highlighted");
}
}
function CodeHighlightOff(elem, id)
{
var target = document.getElementById(id);
if(null != target) {
elem.classList.remove("code-highlighted");
target.classList.remove("code-highlighted");
}
}
/*]]>*///-->
// @license-end
</script>
</head>
<body>
<div id="org-div-home-and-up">
<a accesskey="h" href=""> UP </a>
|
<a accesskey="H" href="http://covid19.genenetwork.org"> HOME </a>
</div><div id="content">
<h1 class="title">COVID-19 PubSeq - Arvados</h1>
<div id="table-of-contents">
<h2>Table of Contents</h2>
<div id="text-table-of-contents">
<ul>
<li><a href="#org10ef830">1. The Arvados Web Server</a></li>
<li><a href="#orgb6a7a42">2. The Arvados file interface</a></li>
<li><a href="#org0c7b94e">3. The PubSeq Arvados shell</a></li>
<li><a href="#org756005d">4. Wiring up CWL</a></li>
<li><a href="#orgf30b46f">5. Using the Arvados API</a></li>
<li><a href="#org3af3122">6. Troubleshooting</a></li>
</ul>
</div>
</div>
<div id="outline-container-org10ef830" class="outline-2">
<h2 id="org10ef830"><span class="section-number-2">1</span> The Arvados Web Server</h2>
<div class="outline-text-2" id="text-1">
<p>
We are using Arvados to run common workflow language (CWL) pipelines.
The most recent output is on display on a <a href="https://workbench.lugli.arvadosapi.com/collections/lugli-4zz18-z513nlpqm03hpca">web page</a> (with time stamp)
and a full output list is generated <a href="https://collections.lugli.arvadosapi.com/c=lugli-4zz18-z513nlpqm03hpca/">here</a>.
</p>
<p>
Arvados has a web front which allows navigation through input and output data,
workflows and the output of analysis pipelines (here CWL workflows).
</p>
<p>
<img src="static/image/arvados-workflow-output.png" />
</p>
</div>
</div>
<div id="outline-container-orgb6a7a42" class="outline-2">
<h2 id="orgb6a7a42"><span class="section-number-2">2</span> The Arvados file interface</h2>
<div class="outline-text-2" id="text-2">
<p>
Arvados has the web server, but it also has a REST API and associated
command line tools. We are already using the <a href="https://github.com/arvados/bh20-seq-resource/blob/master/bh20sequploader/main.py#L27">API</a> to upload data. If
you follow the pip or <a href="../INSTALL.md">../INSTALL.md</a> GNU Guix instructions for
installing Arvados API you'll find the following command line tools
(also documented <a href="https://doc.arvados.org/v2.0/sdk/cli/subcommands.html">here</a>):
</p>
<table border="2" cellspacing="0" cellpadding="6" rules="groups" frame="hsides">
<colgroup>
<col class="org-left" />
<col class="org-left" />
</colgroup>
<thead>
<tr>
<th scope="col" class="org-left">Command</th>
<th scope="col" class="org-left">Description</th>
</tr>
</thead>
<tbody>
<tr>
<td class="org-left">arv-ls</td>
<td class="org-left">list files in Arvados</td>
</tr>
<tr>
<td class="org-left">arv-put</td>
<td class="org-left">upload a file to Arvados</td>
</tr>
<tr>
<td class="org-left">arv-get</td>
<td class="org-left">get a textual representation of Arvados objects from the command line. The output can be limited to a subset of the object’s fields. This command can be used with only the knowledge of an object’s UUID</td>
</tr>
</tbody>
</table>
<p>
Now, this is a public instance so we can use the tokens from
the <a href="https://github.com/arvados/bh20-seq-resource/blob/master/bh20sequploader/main.py#L16">uploader</a>.
</p>
<div class="org-src-container">
<pre class="src src-sh"><span style="color: #ff8A65;">export</span> <span style="color: #ffcc80;">ARVADOS_API_HOST</span>=<span style="color: #9ccc65;">'lugli.arvadosapi.com'</span>
<span style="color: #ff8A65;">export</span> <span style="color: #ffcc80;">ARVADOS_API_TOKEN</span>=<span style="color: #9ccc65;">'2fbebpmbo3rw3x05ueu2i6nx70zhrsb1p22ycu3ry34m4x4462'</span>
arv-ls lugli-4zz18-z513nlpqm03hpca
</pre>
</div>
<p>
will list all files (the UUID we got from the Arvados results page). To
get the UUID of the files
</p>
<div class="org-src-container">
<pre class="src src-sh">curl https://lugli.arvadosapi.com/arvados/v1/config | jq .Users.AnonymousUserToken
env <span style="color: #ffcc80;">ARVADOS_API_TOKEN</span>=5o42qdxpxp5cj15jqjf7vnxx5xduhm4ret703suuoa3ivfglfh <span style="color: #9ccc65;">\</span>
arv-get lugli-4zz18-z513nlpqm03hpca
</pre>
</div>
<p>
and fetch one listed JSON file <code>chunk001_bin4000.schematic.json</code> with
its listed UUID:
</p>
<pre class="example">
arv-get 2be6af7b4741f2a5c5f8ff2bc6152d73+1955623+Ab9ad65d7fe958a053b3a57d545839de18290843a@5ed7f3c5
</pre>
</div>
</div>
<div id="outline-container-org0c7b94e" class="outline-2">
<h2 id="org0c7b94e"><span class="section-number-2">3</span> The PubSeq Arvados shell</h2>
<div class="outline-text-2" id="text-3">
<p>
When you login to Arvados (you can request permission from us) it is
possible to upload an ssh key in your profile and get an shell prompt
with
</p>
<pre class="example">
ssh pjotrpbl@shell.lugli.arvadosapi.com
Linux ip-10-255-0-202 4.19.0-9-cloud-amd64 #1 SMP Debian 4.19.118-2+deb10u1 (2020-06-07) x86_64
</pre>
<p>
It is a small Debian VM hosted on AWS somewhere. The PubSeq material
is mounted on <code>/data/pubseq</code>. The log is in <code>nohup.out</code>. Update/edit
the code (bh20-seq-resource git checkout) and restart the service (the
run script). The log says
</p>
<pre class="example">
you should have permission to read the log (nohup.out) update / edit the code (bh20-seq-resource git checkout) and restart the service (the run script)
</pre>
<p>
which means it will trigger the run on upload. The service is running as a
Python virtualenv:
</p>
<pre class="example">
/data/pubseq/bh20-seq-resource/venv3/bin/python3 /data/pubseq/bh20-seq-resource/venv3/bin/bh20-seq-analyzer --no-start-analysis
</pre>
<p>
and is restarted by a <code>run</code> script:
</p>
<pre class="example">
/data/pubseq/run [options]
</pre>
<p>
The run script kills the old process, sets up the API tokens, pulls
the git repo and starts a new run calling into
/data/pubseq/bh20-seq-resource/venv3/bin/bh20-seq-analyzer which is
essentially <a href="https://github.com/arvados/bh20-seq-resource/blob/2baa88b766ec540bd34b96599014dd16e393af39/bh20seqanalyzer/main.py#L354">monitoring</a> for uploads.
</p>
<p>
On <code>run --help</code>
</p>
<pre class="example" id="org93c3a8a">
optional arguments:
-h, --help show this help message and exit
--uploader-project UPLOADER_PROJECT
--pangenome-analysis-project PANGENOME_ANALYSIS_PROJECT
--fastq-project FASTQ_PROJECT
--validated-project VALIDATED_PROJECT
--workflow-def-project WORKFLOW_DEF_PROJECT
--pangenome-workflow-uuid PANGENOME_WORKFLOW_UUID
--fastq-workflow-uuid FASTQ_WORKFLOW_UUID
--exclude-list EXCLUDE_LIST
--latest-result-collection LATEST_RESULT_COLLECTION
--kickoff
--no-start-analysis
--once
--print-status PRINT_STATUS
--revalidate
</pre>
</div>
</div>
<div id="outline-container-org756005d" class="outline-2">
<h2 id="org756005d"><span class="section-number-2">4</span> Wiring up CWL</h2>
<div class="outline-text-2" id="text-4">
<p>
In above script <code>bh20-seq-analyzer</code> you can see that the <a href="https://www.commonwl.org/">Common
Workflow Language</a> (CWL) gets <a href="https://github.com/arvados/bh20-seq-resource/blob/2baa88b766ec540bd34b96599014dd16e393af39/bh20seqanalyzer/main.py#L233">triggered</a>; for example <a href="https://github.com/arvados/bh20-seq-resource/tree/master/workflows/fastq2fasta">fastq2fasta</a> which
is part of the main repo. The actual script is in <a href="https://github.com/arvados/bh20-seq-resource/blob/master/workflows/fastq2fasta/fastq2fasta.cwl">fastq2fasta.cwl</a> and
runs the following tools in sequence: bwa-mem, samtools-view,
samtools-sort, and bam2fasta.
</p>
<p>
It probably pays to familiarize yourself with CWL and its concepts. We
believe it has a lot going for it though CWL is some steps removed
from traditional shell scripts for running work flows. Main thing to
understand is that CWL is a separation of concerns, i.e.,
</p>
<ol class="org-ol">
<li>Data</li>
<li>Tools</li>
<li>Flow</li>
</ol>
<p>
and each of these are described separately. This contrasts largely
with shell scripts (though you can invoke shell scripts from CWL).
Also, CWL is written in JSON/YAML, which means everything can be parsed
as a tree and you can easily get visualisations such as
</p>
<p>
<a href="https://hpc.guix.info/blog/2019/01/creating-a-reproducible-workflow-with-cwl/">
<img src="https://hpc.guix.info/static/images/blog/cwl-provenance-graph.png" /></a>
</p>
<p>
For more see <a href="https://hpc.guix.info/blog/2019/01/creating-a-reproducible-workflow-with-cwl/">Creating a reproducible workflow with CWL</a> by Pjotr Prins.
</p>
</div>
</div>
<div id="outline-container-orgf30b46f" class="outline-2">
<h2 id="orgf30b46f"><span class="section-number-2">5</span> Using the Arvados API</h2>
<div class="outline-text-2" id="text-5">
<p>
Arvados provides a rich API for accessing internals of the Cloud
infrastructure.
</p>
<p>
In above script <code>bh20-seq-analyzer</code> there are examples of querying the
<a href="https://doc.arvados.org/api/index.html">Arvados API</a> using the <a href="https://pypi.org/project/arvados-python-client/">Python Arvados client and libraries</a>. For example
get a list of <a href="https://github.com/arvados/bh20-seq-resource/blob/2baa88b766ec540bd34b96599014dd16e393af39/bh20seqanalyzer/main.py#L228">projects</a> in Arvados. Main thing is to get the
<code>ARVADOS-API-HOST</code> and <code>ARVADOS-API-TOKEN</code> right as is shown above.
</p>
</div>
</div>
<div id="outline-container-org3af3122" class="outline-2">
<h2 id="org3af3122"><span class="section-number-2">6</span> Troubleshooting</h2>
<div class="outline-text-2" id="text-6">
<p>
When workflows have errors we should check the logs in Arvados.
</p>
<p>
Go to the <a href="https://workbench.lugli.arvadosapi.com/projects/lugli-j7d0g-825x3r5vcs41dus">project</a> page for 'COVID-19-BH20 Shared Project' -> 'Public
Sequence Resource'. Click on analysis runs
<a href="https://workbench.lugli.arvadosapi.com/projects/lugli-j7d0g-y4k4uswcqi3ku56">https://workbench.lugli.arvadosapi.com/projects/lugli-j7d0g-y4k4uswcqi3ku56</a>
and 'Subprojects'. Click one of the runs and then on 'Processes' and you'll
see what parts failed.
</p>
</div>
</div>
</div>
<div id="postamble" class="status">
<hr><small>Created by <a href="http://thebird.nl/">Pjotr Prins</a> (pjotr.public768 at thebird 'dot' nl) using Emacs org-mode and a healthy dose of Lisp!<br />Modified 2020-11-09 Mon 01:20</small>.
</div>
</body>
</html>