From 7b2d388dbed11384c6a388a5437cca0b8f2914fd Mon Sep 17 00:00:00 2001 From: Pjotr Prins Date: Sun, 19 Jul 2020 09:11:41 +0100 Subject: Wiring up export function --- doc/blog/using-covid-19-pubseq-part1.html | 82 +++++++++++++++++++------------ doc/blog/using-covid-19-pubseq-part1.org | 22 ++++++--- doc/blog/using-covid-19-pubseq-part6.org | 19 ++++++- 3 files changed, 83 insertions(+), 40 deletions(-) (limited to 'doc') diff --git a/doc/blog/using-covid-19-pubseq-part1.html b/doc/blog/using-covid-19-pubseq-part1.html index 0e6136c..5fd86d1 100644 --- a/doc/blog/using-covid-19-pubseq-part1.html +++ b/doc/blog/using-covid-19-pubseq-part1.html @@ -3,7 +3,7 @@ "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"> - + COVID-19 PubSeq (part 1) @@ -248,20 +248,20 @@ for the JavaScript code in this tag.

Table of Contents

-
-

1 What does this mean?

+
+

1 What does this mean?

This means that when someone uploads a SARS-CoV-2 sequence using one @@ -313,9 +313,8 @@ initiative!

- -
-

2 Fetch sequence data

+
+

2 Fetch sequence data

The latest run of the pipeline can be viewed here. Each of these @@ -339,8 +338,8 @@ these identifiers throughout.

-
-

3 Predicates

+
+

3 Predicates

To explore an RDF dataset, the first query we can do is open and gets @@ -446,15 +445,18 @@ select (COUNT(distinct ?dataset) as ?num) }

+ +

+Run this query. +

- -
-

4 Fetch submitter info and other metadata

+
+

4 Fetch submitter info and other metadata

-To get dataests with submitters we can do the above +To get datasets with submitters we can do the above

@@ -467,6 +469,10 @@ select distinct ?dataset ?p ?submitter
+

+Run this query. +

+

Tells you one submitter is "Roychoudhury,P.;Greninger,A.;Jerome,K." with a URL predicate (http://purl.obolibrary.org/obo/NCIT_C42781) @@ -525,6 +531,10 @@ select distinct ?sid ?sample ?p1 ?dataset ?submitter

+

+Run query. +

+

which shows pretty much everything known about their submissions in this database. Let's focus on one sample "MT326090.1" with predicate @@ -543,21 +553,26 @@ select distinct ?sample ?p ?o

-This query tells us the sample was submitted "2020-03-21" and +Run query. +

+ +

+This query tells us the sample was submitted "2020-03-21" and originates from http://www.wikidata.org/entity/Q30, i.e., the USA and is a biospecimen collected from the back of the throat by swabbing. -We can track it back to the original GenBank submission. +We can track it back to the original GenBank submission using the +http://identifiers.org/insdc/MT326090.1 link.

We have also added country and label data to make it a bit easier -to view/query the database. +to view/query the database and place the sequence on the map.

-
-

5 Fetch all sequences from Washington state

+
+

5 Fetch all sequences from Washington state

Now we know how to get at the origin we can do it the other way round @@ -574,8 +589,8 @@ and fetch all sequences referring to Washington state

-which lists 300 sequences originating from Washington state! Which is almost -half of the set coming out of GenBank. +which lists 300 sequences originating from Washington state! Which in +April was almost half of the set coming out of GenBank.

@@ -591,12 +606,15 @@ entity is Q43: }

+ +

+Run query. +

- -
-

6 Discussion

+
+

6 Discussion

The public sequence uploader collects sequences, raw data and @@ -607,8 +625,8 @@ referenced in publications and origins are citeable.

-
-

7 Acknowledgements

+
+

7 Acknowledgements

The overall effort was due to magnificent freely donated input by a @@ -623,7 +641,7 @@ Garrison this initiative would not have existed!

-
Created by Pjotr Prins (pjotr.public768 at thebird 'dot' nl) using Emacs org-mode and a healthy dose of Lisp!
Modified 2020-07-17 Fri 05:02
. +
Created by Pjotr Prins (pjotr.public768 at thebird 'dot' nl) using Emacs org-mode and a healthy dose of Lisp!
Modified 2020-07-19 Sun 02:32
.
diff --git a/doc/blog/using-covid-19-pubseq-part1.org b/doc/blog/using-covid-19-pubseq-part1.org index 0fd5589..9c8a1c0 100644 --- a/doc/blog/using-covid-19-pubseq-part1.org +++ b/doc/blog/using-covid-19-pubseq-part1.org @@ -60,7 +60,6 @@ graph in triples. Soon we will at multi sequence alignments (MSA) and more. Anyone can contribute data, tools and workflows to this initiative! - * Fetch sequence data The latest run of the pipeline can be viewed [[https://workbench.lugli.arvadosapi.com/collections/lugli-4zz18-z513nlpqm03hpca][here]]. Each of these @@ -162,10 +161,11 @@ select (COUNT(distinct ?dataset) as ?num) } #+end_src +Run this [[http://sparql.genenetwork.org/sparql/?default-graph-uri=&query=PREFIX+pubseq%3A+%3Chttp%3A%2F%2Fbiohackathon.org%2Fbh20-seq-schema%23MainSchema%2F%3E%0D%0Aselect+%28COUNT%28distinct+%3Fdataset%29+as+%3Fnum%29%0D%0A%7B%0D%0A+++%3Fdataset+pubseq%3Asubmitter+%3Fid+.%0D%0A+++%3Fid+%3Fp+%3Fsubmitter%0D%0A%7D&format=text%2Fhtml&timeout=0&debug=on&run=+Run+Query+][query]]. * Fetch submitter info and other metadata -To get dataests with submitters we can do the above +To get datasets with submitters we can do the above #+begin_src sql PREFIX pubseq: @@ -176,6 +176,8 @@ select distinct ?dataset ?p ?submitter } #+end_src +Run this [[http://sparql.genenetwork.org/sparql/?default-graph-uri=&query=PREFIX+pubseq%3A+%3Chttp%3A%2F%2Fbiohackathon.org%2Fbh20-seq-schema%23MainSchema%2F%3E%0D%0Aselect+distinct+%3Fdataset+%3Fp+%3Fsubmitter%0D%0A%7B%0D%0A+++%3Fdataset+pubseq%3Asubmitter+%3Fid+.%0D%0A+++%3Fid+%3Fp+%3Fsubmitter%0D%0A%7D&format=text%2Fhtml&timeout=0&debug=on&run=+Run+Query+][query]]. + Tells you one submitter is "Roychoudhury,P.;Greninger,A.;Jerome,K." with a URL [[http://purl.obolibrary.org/obo/NCIT_C42781][predicate]] (http://purl.obolibrary.org/obo/NCIT_C42781) explaining "The individual who is responsible for the content of a @@ -223,6 +225,8 @@ select distinct ?sid ?sample ?p1 ?dataset ?submitter } #+end_src +Run [[http://sparql.genenetwork.org/sparql/?default-graph-uri=&query=%0D%0APREFIX+pubseq%3A+%3Chttp%3A%2F%2Fbiohackathon.org%2Fbh20-seq-schema%23MainSchema%2F%3E%0D%0Aselect+distinct+%3Fsid+%3Fsample+%3Fp1+%3Fdataset+%3Fsubmitter%0D%0A%7B%0D%0A+++%3Fdataset+pubseq%3Asubmitter+%3Fid+.%0D%0A+++%3Fid+%3Fp+%3Fsubmitter+.%0D%0A+++FILTER%28CONTAINS%28%3Fsubmitter%2C%22Roychoudhury%22%29%29+.%0D%0A+++%3Fdataset+pubseq%3Asample+%3Fsid+.%0D%0A+++%3Fsid+%3Fp1+%3Fsample%0D%0A%7D%0D%0A&format=text%2Fhtml&timeout=0&debug=on&run=+Run+Query+][query]]. + which shows pretty much [[http://sparql.genenetwork.org/sparql/?default-graph-uri=&query=PREFIX+pubseq%3A+%3Chttp%3A%2F%2Fbiohackathon.org%2Fbh20-seq-schema%23MainSchema%2F%3E%0D%0Aselect+distinct+%3Fsid+%3Fsample+%3Fp1+%3Fdataset+%3Fsubmitter%0D%0A%7B%0D%0A+++%3Fdataset+pubseq%3Asubmitter+%3Fid+.%0D%0A+++%3Fid+%3Fp+%3Fsubmitter+.%0D%0A+++FILTER%28CONTAINS%28%3Fsubmitter%2C%22Roychoudhury%22%29%29+.%0D%0A+++%3Fdataset+pubseq%3Asample+%3Fsid+.%0D%0A+++%3Fsid+%3Fp1+%3Fsample%0D%0A%7D&format=text%2Fhtml&timeout=0&debug=on&run=+Run+Query+][everything known]] about their submissions in this database. Let's focus on one sample "MT326090.1" with predicate http://semanticscience.org/resource/SIO_000115. @@ -237,13 +241,16 @@ select distinct ?sample ?p ?o } #+end_src -This [[http://sparql.genenetwork.org/sparql/?default-graph-uri=&query=PREFIX+pubseq%3A+%3Chttp%3A%2F%2Fbiohackathon.org%2Fbh20-seq-schema%23MainSchema%2F%3E%0D%0APREFIX+sio%3A+%3Chttp%3A%2F%2Fsemanticscience.org%2Fresource%2F%3E%0D%0Aselect+distinct+%3Fsample+%3Fp+%3Fo%0D%0A%7B%0D%0A+++%3Fsample+sio%3ASIO_000115+%22MT326090.1%22+.%0D%0A+++%3Fsample+%3Fp+%3Fo+.%0D%0A%7D&format=text%2Fhtml&timeout=0&debug=on&run=+Run+Query+][query]] tells us the sample was submitted "2020-03-21" and +Run [[http://sparql.genenetwork.org/sparql/?default-graph-uri=&query=%0D%0APREFIX+pubseq%3A+%3Chttp%3A%2F%2Fbiohackathon.org%2Fbh20-seq-schema%23MainSchema%2F%3E%0D%0APREFIX+sio%3A+%3Chttp%3A%2F%2Fsemanticscience.org%2Fresource%2F%3E%0D%0Aselect+distinct+%3Fsample+%3Fp+%3Fo%0D%0A%7B%0D%0A+++%3Fsample+sio%3ASIO_000115+%22MT326090.1%22+.%0D%0A+++%3Fsample+%3Fp+%3Fo+.%0D%0A%7D&format=text%2Fhtml&timeout=0&debug=on&run=+Run+Query+][query]]. + +This query tells us the sample was submitted "2020-03-21" and originates from http://www.wikidata.org/entity/Q30, i.e., the USA and is a biospecimen collected from the back of the throat by swabbing. -We can track it back to the original GenBank [[http://identifiers.org/insdc/MT326090.1#sequence][submission]]. +We can track it back to the original GenBank [[http://identifiers.org/insdc/MT326090.1#sequence][submission]] using the +http://identifiers.org/insdc/MT326090.1 link. We have also added country and label data to make it a bit easier -to view/query the database. +to view/query the database and place the sequence on the [[http://covid19.genenetwork.org/][map]]. * Fetch all sequences from Washington state @@ -258,8 +265,8 @@ select ?seq ?sample } #+end_src -which lists 300 sequences originating from Washington state! Which is almost -half of the set coming out of GenBank. +which lists 300 sequences originating from Washington state! Which in +April was almost half of the set coming out of GenBank. Likewise to list all sequences from Turkey we can find the wikidata entity is [[https://www.wikidata.org/wiki/Q43][Q43]]: @@ -272,6 +279,7 @@ select ?seq ?sample } #+end_src +Run [[http://sparql.genenetwork.org/sparql/?default-graph-uri=&query=%0D%0Aselect+%3Fseq+%3Fsample%0D%0A%7B%0D%0A++++%3Fseq+%3Chttp%3A%2F%2Fbiohackathon.org%2Fbh20-seq-schema%23MainSchema%2Fsample%3E+%3Fsample+.%0D%0A++++%3Fsample+%3Chttp%3A%2F%2Fpurl.obolibrary.org%2Fobo%2FGAZ_00000448%3E+%3Chttp%3A%2F%2Fwww.wikidata.org%2Fentity%2FQ43%3E%0D%0A%7D&format=text%2Fhtml&timeout=0&debug=on&run=+Run+Query+][query]]. * Discussion diff --git a/doc/blog/using-covid-19-pubseq-part6.org b/doc/blog/using-covid-19-pubseq-part6.org index 8964700..6ee68bb 100644 --- a/doc/blog/using-covid-19-pubseq-part6.org +++ b/doc/blog/using-covid-19-pubseq-part6.org @@ -9,11 +9,26 @@ * Table of Contents :TOC:noexport: + - [[#short-version][Short version]] - [[#generating-output-for-ebi][Generating output for EBI]] - [[#defining-the-ebi-study][Defining the EBI study]] - [[#define-the-ebi-sample][Define the EBI sample]] - [[#define-the-ebi-sequence][Define the EBI sequence]] +* Short version + +PubSeq can export files that can be uploaded to EBI/ENA. This saves +you work. Steps are: + +1. Register and account for EBI/ENA as explained [[https://ena-docs.readthedocs.io/en/latest/submit/general-guide.html][here]]. +2. Register a study online or use XML files discussed below +3. Export a sample XML and push to EBI/ENA +4. Zip sequence data and push to EBI/ENA + +Because PubSeq's metadata for is richer than the metadata EBI/ENA asks +for, it is easy to generate and export the forms using the [[http://covid19.genenetwork.org/export][EXPORT]] +page. + * Generating output for EBI Would it not be great an uploader to PubSeq also can export samples @@ -81,6 +96,8 @@ also a submission 'command' is required looking like #+END_SRC +Working XML examples we tested can be found [[https://github.com/arvados/bh20-seq-resource/tree/master/scripts/submit_ebi/example][here]]. + The webin system accepts such sources using a command like : curl -u username:password -F "SUBMISSION=@submission.xml" \ @@ -88,7 +105,7 @@ The webin system accepts such sources using a command like as described [[https://ena-docs.readthedocs.io/en/latest/submit/study/programmatic.html#submit-the-xmls-using-curl][here]]. Note that this is the test server. For the final version use www.ebi.ac.uk instead of wwwdev.ebi.ac.uk. You may also -need the --insecure switch to circumvent certificate checking. +need the =--insecure= switch to circumvent certificate checking. /work in progress (WIP)/ -- cgit v1.2.3