diff options
Diffstat (limited to 'doc')
-rw-r--r-- | doc/INSTALL.md | 9 | ||||
-rw-r--r-- | doc/blog/using-covid-19-pubseq-part1.org | 159 |
2 files changed, 140 insertions, 28 deletions
diff --git a/doc/INSTALL.md b/doc/INSTALL.md index 8f4b2f0..d9948e6 100644 --- a/doc/INSTALL.md +++ b/doc/INSTALL.md @@ -1,3 +1,5 @@ +#+OPTIONS: ^:nil + # INSTALLATION Other options for running this tool. @@ -29,12 +31,15 @@ arvados-python-client-2.0.1 ciso8601-2.1.3 future-0.18.2 google-api-python-clien 3. Run the tool directly with ```sh -guix environment guix --ad-hoc git python openssl python-pycurl python-magic nss-certs python-pyshex -- python3 bh20sequploader/main.py example/sequence.fasta example/metadata.yaml +guix environment guix --ad-hoc git python openssl python-pycurl python-magic nss-certs python-pyshex -- python3 bh20sequploader/main.py example/sequence.fasta example/maximum_metadata_example.yaml ``` Note that python-pyshex is packaged in http://git.genenetwork.org/guix-bioinformatics/guix-bioinformatics +so you'll need it to the GUIX_PACKAGE_PATH - see the README in that +repository. + ### Using the Web Uploader To run the web uploader in a GNU Guix environment/container @@ -50,3 +55,5 @@ guix environment guix --ad-hoc git python python-flask python-pyyaml python-pycu ``` WIP: add gunicorn container + +Note: see above on GUIX_PACKAGE_PATH. diff --git a/doc/blog/using-covid-19-pubseq-part1.org b/doc/blog/using-covid-19-pubseq-part1.org index 647165d..617a01d 100644 --- a/doc/blog/using-covid-19-pubseq-part1.org +++ b/doc/blog/using-covid-19-pubseq-part1.org @@ -68,53 +68,151 @@ If you download the GFA or FASTA sequences you'll find sequences are named something like *keep:e17abc8a0269875ed4cfbff5d9897c6c+123/sequence.fasta* which refers to an internal Arvados Keep representation of the FASTA -sequence. Keep is content-addressable which means that +sequence. Keep is content-addressable which means that the value e17abc8a0269875ed4cfbff5d9897c6c uniquely identifies the file by its -contents. If the contents change, the identifier would change! We use +contents. If the contents change, the identifier changes! We use these identifiers throughout. -* Fetch submitter info and other metadata +* Predicates + +Lets look at all the predicates in the dataset by pasting +the following in a SPARQL end point http://sparql.genenetwork.org/sparql/ + +#+begin_src sql +select distinct ?p +{ + ?o ?p ?s +} +#+end_src + +you can ignore the openlink and w3 ones. To reduce results to a named +graph set the default graph to +http://covid-19.genenetwork.org/graph/metadata.ttl in the top input +box. There you can find a predicate for submitter that looks like +http://biohackathon.org/bh20-seq-schema#MainSchema/submitter. -We are interested in e17abc8a0269875ed4cfbff5d9897c6c and now we -want to get some metadata. We can use a SPARQL end point hosted at -http://sparql.genenetwork.org/sparql/. Paste in a query like +To list all submitters, try #+begin_src sql -select ?p ?s +select distinct ?s { - <http://arvados.org/keep:e17abc8a0269875ed4cfbff5d9897c6c+123/sequence.fasta> ?p ?s + ?o <http://biohackathon.org/bh20-seq-schema#MainSchema/submitter> ?s } #+end_src -which will tell you that original FASTA ID is "MT293175.1". It also -says the submitter is nodeID://b31228. +Oh wait, it returns things like nodeID://b76150! That is not helpful, +these are anonymous nodes in the graph. These point to another triple +and by #+begin_src sql -select distinct ?id ?p ?s +select distinct ?s { - <http://arvados.org/keep:e17abc8a0269875ed4cfbff5d9897c6c+123/sequence.fasta> <http://biohackathon.org/bh20-seq-schema#MainSchema/submitter> ?id . + ?o <http://biohackathon.org/bh20-seq-schema#MainSchema/submitter> ?id . ?id ?p ?s } #+end_src -Tells you the submitter is "Roychoudhury,P.;Greninger,A.;Jerome,K." -with [[http://purl.obolibrary.org/obo/NCIT_C42781][predicate]] explaining "The individual who is responsible for the -content of a document." Welcome to the power of the semantic web. +you get a list of all submitters including "University of Washington, +Seattle, WA 98109, USA". + +To lift the full URL out of the query you can use a header like + +#+begin_src sql +PREFIX pubseq: <http://biohackathon.org/bh20-seq-schema#MainSchema/> +select distinct ?dataset ?submitter +{ + ?dataset pubseq:submitter ?id . + ?id ?p ?submitter +} +#+end_src + +which reads a bit better. We can also see the datasets. One of them submitted +by University of Washington is +is http://arvados.org/keep:00fede2c6f52b053a14edca01cfa02b7+126/sequence.fasta +(note the ID may have changed so pick one with above query). + + +* Fetch submitter info and other metadata -To get more information about the relevant sample +To get dataests with submitters we can do the above #+begin_src sql -select ?sample ?p ?o +PREFIX pubseq: <http://biohackathon.org/bh20-seq-schema#MainSchema/> +select distinct ?dataset ?p ?submitter { - <http://arvados.org/keep:e17abc8a0269875ed4cfbff5d9897c6c+123/sequence.fasta> <http://biohackathon.org/bh20-seq-schema#MainSchema/sample> ?sample . - ?sample ?p ?o + ?dataset pubseq:submitter ?id . + ?id ?p ?submitter } #+end_src -we find it originates from Washington state (object -https://www.wikidata.org/wiki/Q1223) , dated "30-Mar-2020". The -sequencing was executed with Illumina and pipeline "custom pipeline -v. 2020-03" which is arguably not that descriptive. +Tells you one submitter is "Roychoudhury,P.;Greninger,A.;Jerome,K." +with a URL [[http://purl.obolibrary.org/obo/NCIT_C42781][predicate]] (http://purl.obolibrary.org/obo/NCIT_C42781) +explaining "The individual who is responsible for the content of a +document." Well formed URIs point to real information about the URI +itself. Welcome to the power of the semantic web. + +Let's focus on one sample with + +#+begin_src sql +PREFIX pubseq: <http://biohackathon.org/bh20-seq-schema#MainSchema/> +select distinct ?dataset ?submitter +{ + ?dataset pubseq:submitter ?id . + ?id ?p ?submitter . + FILTER(CONTAINS(?submitter,"Roychoudhury")) . +} +#+end_src + +That is a lot of samples! We just want to pick one, so let's +see if we can get a sample ID by listing sample predicates + +#+begin_src sql +PREFIX pubseq: <http://biohackathon.org/bh20-seq-schema#MainSchema/> +select distinct ?p +{ + ?dataset ?p ?o . + ?dataset pubseq:submitter ?id . +} +#+end_src + +which lists a predicate named +http://biohackathon.org/bh20-seq-schema#MainSchema/sample. +Let's zoom in on those of Roychoudhury with + + +#+begin_src sql +PREFIX pubseq: <http://biohackathon.org/bh20-seq-schema#MainSchema/> +select distinct ?sid ?sample ?p1 ?dataset ?submitter +{ + ?dataset pubseq:submitter ?id . + ?id ?p ?submitter . + FILTER(CONTAINS(?submitter,"Roychoudhury")) . + ?dataset pubseq:sample ?sid . + ?sid ?p1 ?sample +} +#+end_src + +which shows pretty much [[http://sparql.genenetwork.org/sparql/?default-graph-uri=&query=PREFIX+pubseq%3A+%3Chttp%3A%2F%2Fbiohackathon.org%2Fbh20-seq-schema%23MainSchema%2F%3E%0D%0Aselect+distinct+%3Fsid+%3Fsample+%3Fp1+%3Fdataset+%3Fsubmitter%0D%0A%7B%0D%0A+++%3Fdataset+pubseq%3Asubmitter+%3Fid+.%0D%0A+++%3Fid+%3Fp+%3Fsubmitter+.%0D%0A+++FILTER%28CONTAINS%28%3Fsubmitter%2C%22Roychoudhury%22%29%29+.%0D%0A+++%3Fdataset+pubseq%3Asample+%3Fsid+.%0D%0A+++%3Fsid+%3Fp1+%3Fsample%0D%0A%7D&format=text%2Fhtml&timeout=0&debug=on&run=+Run+Query+][everything known]] about their submissions in +this database. Let's focus on one sample "MT326090.1" with predicate +http://semanticscience.org/resource/SIO_000115. + +#+begin_src sql +PREFIX pubseq: <http://biohackathon.org/bh20-seq-schema#MainSchema/> +PREFIX sio: <http://semanticscience.org/resource/> +select distinct ?sample ?p ?o +{ + ?sample sio:SIO_000115 "MT326090.1" . + ?sample ?p ?o . +} +#+end_src + +This [[http://sparql.genenetwork.org/sparql/?default-graph-uri=&query=PREFIX+pubseq%3A+%3Chttp%3A%2F%2Fbiohackathon.org%2Fbh20-seq-schema%23MainSchema%2F%3E%0D%0APREFIX+sio%3A+%3Chttp%3A%2F%2Fsemanticscience.org%2Fresource%2F%3E%0D%0Aselect+distinct+%3Fsample+%3Fp+%3Fo%0D%0A%7B%0D%0A+++%3Fsample+sio%3ASIO_000115+%22MT326090.1%22+.%0D%0A+++%3Fsample+%3Fp+%3Fo+.%0D%0A%7D&format=text%2Fhtml&timeout=0&debug=on&run=+Run+Query+][query]] tells us the sample was submitted "2020-03-21" and +originates from http://www.wikidata.org/entity/Q30, i.e., the USA and +is a biospecimen collected from the back of the throat by swabbing. +We can track it back to the original GenBank [[http://identifiers.org/insdc/MT326090.1#sequence][submission]]. + +We have also added country and label data to make it a bit easier +to view/query the database. * Fetch all sequences from Washington state @@ -133,12 +231,19 @@ select ?seq ?sample which lists 300 sequences originating from Washington state! Which is almost half of the set coming out of GenBank. +* Discussion + +The public sequence uploader collects sequences, raw data and +(machine) queriable metadata. Not only that: data gets analyzed in the +pangenome and results are presented immediately. The data can be +referenced in publications and origins are citeable. + * Acknowledgements The overall effort was due to magnificent freely donated input by a great number of people. I particularly want to thank Thomas Liener for the great effort he made with the ontology group in getting ontology's -and schema sorted! Peter Amstutz and Curii helped build the on-demand -compute and back-ends. Thanks also to Michael Crusoe for supporting -the CWL initiative. And without Erik Garrison this initiative would -not have existed! +and schema sorted! Peter Amstutz and [[https://arvados.org/][Arvados/Curii]] helped build the +on-demand compute and back-ends. Thanks also to Michael Crusoe for +supporting the [[https://www.commonwl.org/][Common Workflow Language]] initiative. And without Erik +Garrison this initiative would not have existed! |