aboutsummaryrefslogtreecommitdiff
path: root/doc/blog/using-covid-19-pubseq-part2.org
diff options
context:
space:
mode:
Diffstat (limited to 'doc/blog/using-covid-19-pubseq-part2.org')
-rw-r--r--doc/blog/using-covid-19-pubseq-part2.org78
1 files changed, 73 insertions, 5 deletions
diff --git a/doc/blog/using-covid-19-pubseq-part2.org b/doc/blog/using-covid-19-pubseq-part2.org
index 349fd06..c44b5c7 100644
--- a/doc/blog/using-covid-19-pubseq-part2.org
+++ b/doc/blog/using-covid-19-pubseq-part2.org
@@ -11,6 +11,8 @@
* Table of Contents :TOC:noexport:
- [[#finding-output-of-workflows][Finding output of workflows]]
- [[#the-arvados-file-interface][The Arvados file interface]]
+ - [[#the-pubseq-arvados-shell][The PubSeq Arvados shell]]
+ - [[#wiring-up-cwl][Wiring up CWL]]
- [[#using-the-arvados-api][Using the Arvados API]]
* Finding output of workflows
@@ -38,24 +40,90 @@ installing Arvados API you'll find the following command line tools
Now, this is a public instance so we can use the tokens from
the [[https://github.com/arvados/bh20-seq-resource/blob/master/bh20sequploader/main.py#L16][uploader]].
-#+BEGIN_SOURCE sh
+#+BEGIN_SRC sh
export ARVADOS_API_HOST='lugli.arvadosapi.com'
export ARVADOS_API_TOKEN='2fbebpmbo3rw3x05ueu2i6nx70zhrsb1p22ycu3ry34m4x4462'
arv-ls lugli-4zz18-z513nlpqm03hpca
-#+END_SOURCE
+#+END_SRC
will list all files (the UUID we got from the Arvados results page). To
get the UUID of the files
-#+BEGIN_SOURCE sh
+#+BEGIN_SRC sh
curl https://lugli.arvadosapi.com/arvados/v1/config | jq .Users.AnonymousUserToken
env ARVADOS_API_TOKEN=5o42qdxpxp5cj15jqjf7vnxx5xduhm4ret703suuoa3ivfglfh \
arv-get lugli-4zz18-z513nlpqm03hpca
-#+END_SOURCE
+#+END_SRC
and fetch one listed JSON file ~chunk001_bin4000.schematic.json~ with
its listed UUID:
: arv-get 2be6af7b4741f2a5c5f8ff2bc6152d73+1955623+Ab9ad65d7fe958a053b3a57d545839de18290843a@5ed7f3c5
-* TODO Using the Arvados API
+* The PubSeq Arvados shell
+
+When you login to Arvados (you can request permission from us) it is
+possible to upload an ssh key in your profile and get an shell prompt
+with
+
+: ssh pjotrpbl@shell.lugli.arvadosapi.com
+: Linux ip-10-255-0-202 4.19.0-9-cloud-amd64 #1 SMP Debian 4.19.118-2+deb10u1 (2020-06-07) x86_64
+
+
+It is a small Debian VM hosted on AWS somewhere. The PubSeq material
+is mounted on ~/data/pubseq~. The log is in ~nohup.out~. Update/edit
+the code (bh20-seq-resource git checkout) and restart the service (the
+run script). The log says
+
+: you should have permission to read the log (nohup.out) update / edit the code (bh20-seq-resource git checkout) and restart the service (the run script)
+
+which means it will trigger the run on upload. The service is running as a
+Python virtualenv:
+
+: /data/pubseq/bh20-seq-resource/venv3/bin/python3 /data/pubseq/bh20-seq-resource/venv3/bin/bh20-seq-analyzer --no-start-analysis
+
+and is restarted by a ~run~ script:
+
+: /data/pubseq/run [options]
+
+The run script kills the old process, sets up the API tokens, pulls
+the git repo and starts a new run calling into
+/data/pubseq/bh20-seq-resource/venv3/bin/bh20-seq-analyzer which is
+essentially [[https://github.com/arvados/bh20-seq-resource/blob/2baa88b766ec540bd34b96599014dd16e393af39/bh20seqanalyzer/main.py#L354][monitoring]] for uploads.
+
+* Wiring up CWL
+
+In above script ~bh20-seq-analyzer~ you can see that the [[https://www.commonwl.org/][Common
+Workflow Language]] (CWL) gets [[https://github.com/arvados/bh20-seq-resource/blob/2baa88b766ec540bd34b96599014dd16e393af39/bh20seqanalyzer/main.py#L233][triggered]]; for example [[https://github.com/arvados/bh20-seq-resource/tree/master/workflows/fastq2fasta][fastq2fasta]] which
+is part of the main repo. The actual script is in [[https://github.com/arvados/bh20-seq-resource/blob/master/workflows/fastq2fasta/fastq2fasta.cwl][fastq2fasta.cwl]] and
+runs the following tools in sequence: bwa-mem, samtools-view,
+samtools-sort, and bam2fasta.
+
+It probably pays to familiarize yourself with CWL and its concepts. We
+believe it has a lot going for it though CWL is some steps removed
+from traditional shell scripts for running work flows. Main thing to
+understand is that CWL is a separation of concerns, i.e.,
+
+1. Data
+2. Tools
+3. Flow
+
+and each of these are described separately. This contrasts largely
+with shell scripts (though you can invoke shell scripts from CWL).
+Also, CWL is written in JSON/YAML, which means everything can be parsed
+as a tree and you can easily get visualisations such as
+
+@@html: <a href="https://hpc.guix.info/blog/2019/01/creating-a-reproducible-workflow-with-cwl/">
+<img src="https://hpc.guix.info/static/images/blog/cwl-provenance-graph.png" /></a>@@
+
+For more see [[https://hpc.guix.info/blog/2019/01/creating-a-reproducible-workflow-with-cwl/][Creating a reproducible workflow with CWL]] by Pjotr Prins.
+
+* Using the Arvados API
+
+Arvados provides a rich API for accessing internals of the Cloud
+infrastructure.
+
+In above script ~bh20-seq-analyzer~ there are examples of querying the
+[[https://doc.arvados.org/api/index.html][Arvados API]] using the [[https://pypi.org/project/arvados-python-client/][Python Arvados client and libraries]]. For example
+get a list of [[https://github.com/arvados/bh20-seq-resource/blob/2baa88b766ec540bd34b96599014dd16e393af39/bh20seqanalyzer/main.py#L228][projects]] in Arvados. Main thing is to get the
+~ARVADOS-API-HOST~ and ~ARVADOS-API-TOKEN~ right as is shown above.