aboutsummaryrefslogtreecommitdiff
path: root/doc/blog/using-covid-19-pubseq-part3.org
diff options
context:
space:
mode:
authorPjotr Prins2020-05-30 18:13:48 -0500
committerPjotr Prins2020-05-30 18:13:48 -0500
commit264be797c55aaff6eb9639d5a15d9081e2256253 (patch)
tree1ee90ad507d3faec99b50a74536dd9f6d1f094e4 /doc/blog/using-covid-19-pubseq-part3.org
parentac7a79bb2aa6480a2ee3e881732ae314e8ccbf7d (diff)
downloadbh20-seq-resource-264be797c55aaff6eb9639d5a15d9081e2256253.tar.gz
bh20-seq-resource-264be797c55aaff6eb9639d5a15d9081e2256253.tar.lz
bh20-seq-resource-264be797c55aaff6eb9639d5a15d9081e2256253.zip
BLOG
Diffstat (limited to 'doc/blog/using-covid-19-pubseq-part3.org')
-rw-r--r--doc/blog/using-covid-19-pubseq-part3.org116
1 files changed, 101 insertions, 15 deletions
diff --git a/doc/blog/using-covid-19-pubseq-part3.org b/doc/blog/using-covid-19-pubseq-part3.org
index 4dd3078..03f37ab 100644
--- a/doc/blog/using-covid-19-pubseq-part3.org
+++ b/doc/blog/using-covid-19-pubseq-part3.org
@@ -6,26 +6,26 @@
#+HTML_HEAD: <link rel="Blog stylesheet" type="text/css" href="blog.css" />
-* Uploading Data
-/Work in progress!/
* Table of Contents :TOC:noexport:
- [[#uploading-data][Uploading Data]]
- - [[#introduction][Introduction]]
- [[#step-1-upload-sequence][Step 1: Upload sequence]]
- [[#step-2-add-metadata][Step 2: Add metadata]]
- [[#obligatory-fields][Obligatory fields]]
- [[#optional-fields][Optional fields]]
- [[#step-3-submit-to-covid-19-pubseq][Step 3: Submit to COVID-19 PubSeq]]
- - [[#step-4-check-output][Step 4: Check output]]
- [[#trouble-shooting][Trouble shooting]]
+ - [[#step-4-check-output][Step 4: Check output]]
+ - [[#bulk-sequence-uploader][Bulk sequence uploader]]
+ - [[#run-the-uploader-cli][Run the uploader (CLI)]]
+ - [[#example-uploading-bulk-genbank-sequences][Example: uploading bulk GenBank sequences]]
-* Introduction
+* Uploading Data
The COVID-19 PubSeq allows you to upload your SARS-Cov-2 strains to a
-public resource for global comparisons. Compute it triggered on
-upload. Read the [[./about][ABOUT]] page for more information.
+public resource for global comparisons. A recompute of the pangenome
+gets triggered on upload. Read the [[./about][ABOUT]] page for more information.
* Step 1: Upload sequence
@@ -59,7 +59,7 @@ A number of fields are obligatory: sample id, date, location,
technology and authors. The others are optional, but it is valuable to
enter them when information is available. Metadata is defined in this
[[https://github.com/arvados/bh20-seq-resource/blob/master/bh20sequploader/bh20seq-schema.yml][schema]]. From this schema we generate the input form. Note that
-opitional fields have a question mark in the ~type~. You can add
+optional fields have a question mark in the ~type~. You can add
metadata yourself, btw, because this is a public resource! See also
[[./blog?id=using-covid-19-pubseq-part5][Modify metadata]] for more information.
@@ -86,7 +86,7 @@ Estimated collection date. The GenBank page says April 6, 2020.
*** Collection location
-A search on wikidata says Los Angelos is
+A search on wikidata says Los Angeles is
https://www.wikidata.org/entity/Q65
*** Sequencing technology
@@ -136,12 +136,6 @@ Once you have the sequence and the metadata together, hit
the 'Add to Pangenome' button. The data will be checked,
submitted and the workflows should kick in!
-* Step 4: Check output
-
-The current pipeline takes 5.5 hours to complete! Once it completes
-the updated data can be checked on the [[./download][DOWNLOAD]] page. After completion
-of above output this [[http://sparql.genenetwork.org/sparql/?default-graph-uri=&query=PREFIX+pubseq%3A+%3Chttp%3A%2F%2Fbiohackathon.org%2Fbh20-seq-schema%23MainSchema%2F%3E%0D%0APREFIX+sio%3A+%3Chttp%3A%2F%2Fsemanticscience.org%2Fresource%2F%3E%0D%0Aselect+distinct+%3Fsample+%3Fp+%3Fo%0D%0A%7B%0D%0A+++%3Fsample+sio%3ASIO_000115+%22MT536190.1%22+.%0D%0A+++%3Fsample+%3Fp+%3Fo+.%0D%0A%7D&format=text%2Fhtml&timeout=0&debug=on&run=+Run+Query+][SPARQL query]] shows some of the metadata we put
-in.
** Trouble shooting
@@ -151,3 +145,95 @@ fixing it to look like http://www.wikidata.org/entity/Q65 (note http
instead on https and entity instead of wiki) the submission went
through. Reload the page (it won't empty the fields) to re-enable the
submit button.
+
+
+* Step 4: Check output
+
+The current pipeline takes 5.5 hours to complete! Once it completes
+the updated data can be checked on the [[./download][DOWNLOAD]] page. After completion
+of above output this [[http://sparql.genenetwork.org/sparql/?default-graph-uri=&query=PREFIX+pubseq%3A+%3Chttp%3A%2F%2Fbiohackathon.org%2Fbh20-seq-schema%23MainSchema%2F%3E%0D%0APREFIX+sio%3A+%3Chttp%3A%2F%2Fsemanticscience.org%2Fresource%2F%3E%0D%0Aselect+distinct+%3Fsample+%3Fp+%3Fo%0D%0A%7B%0D%0A+++%3Fsample+sio%3ASIO_000115+%22MT536190.1%22+.%0D%0A+++%3Fsample+%3Fp+%3Fo+.%0D%0A%7D&format=text%2Fhtml&timeout=0&debug=on&run=+Run+Query+][SPARQL query]] shows some of the metadata we put
+in.
+
+* Bulk sequence uploader
+
+Above steps require a manual upload of one sequence with metadata.
+What if you have a number of sequences you want to upload in bulk?
+For this we have a command line version of the uploader that can
+directly submit to COVID-19 PubSeq. It accepts a FASTA sequence
+file an associated metadata in [[https://github.com/arvados/bh20-seq-resource/blob/master/example/maximum_metadata_example.yaml][YAML]] format. The YAML matches
+the web form and gets validated from the same [[https://github.com/arvados/bh20-seq-resource/blob/master/bh20sequploader/bh20seq-schema.yml][schema]] looks. The YAML
+that you need to create/generate for your samples looks like
+
+#+begin_src json
+id: placeholder
+
+host:
+ host_id: XX1
+ host_species: http://purl.obolibrary.org/obo/NCBITaxon_9606
+ host_sex: http://purl.obolibrary.org/obo/PATO_0000384
+ host_age: 20
+ host_age_unit: http://purl.obolibrary.org/obo/UO_0000036
+ host_health_status: http://purl.obolibrary.org/obo/NCIT_C25269
+ host_treatment: Process in which the act is intended to modify or alter host status (Compounds)
+ host_vaccination: [vaccines1,vaccine2]
+ ethnicity: http://purl.obolibrary.org/obo/HANCESTRO_0010
+ additional_host_information: Optional free text field for additional information
+
+sample:
+ sample_id: Id of the sample as defined by the submitter
+ collector_name: Name of the person that took the sample
+ collecting_institution: Institute that was responsible of sampling
+ specimen_source: [http://purl.obolibrary.org/obo/NCIT_C155831,http://purl.obolibrary.org/obo/NCIT_C155835]
+ collection_date: "2020-01-01"
+ collection_location: http://www.wikidata.org/entity/Q148
+ sample_storage_conditions: frozen specimen
+ source_database_accession: [http://identifiers.org/insdc/LC522350.1#sequence]
+ additional_collection_information: Optional free text field for additional information
+
+virus:
+ virus_species: http://purl.obolibrary.org/obo/NCBITaxon_2697049
+ virus_strain: SARS-CoV-2/human/CHN/HS_8/2020
+
+technology:
+ sample_sequencing_technology: [http://www.ebi.ac.uk/efo/EFO_0009173,http://www.ebi.ac.uk/efo/EFO_0009173]
+ sequence_assembly_method: Protocol used for assembly
+ sequencing_coverage: [70.0, 100.0]
+ additional_technology_information: Optional free text field for additional information
+
+submitter:
+ authors: [John Doe, Joe Boe, Jonny Oe]
+ submitter_name: [John Doe]
+ submitter_address: John Doe's address
+ originating_lab: John Doe kitchen
+ lab_address: John Doe's address
+ provider_sample_id: XXX1
+ submitter_sample_id: XXX2
+ publication: PMID00001113
+ submitter_orcid: [https://orcid.org/0000-0000-0000-0000,https://orcid.org/0000-0000-0000-0001]
+ additional_submitter_information: Optional free text field for additional information
+#+end_src
+
+** Run the uploader (CLI)
+
+Installing with pip you should be
+able to run
+
+: bh20sequploader sequence.fasta metadata.yaml
+
+
+Alternatively the script can be installed from [[https://github.com/arvados/bh20-seq-resource#installation][github]]. Run on the
+command line
+
+: python3 bh20sequploader/main.py example/sequence.fasta example/maximum_metadata_example.yaml
+
+after installing dependencies (also described in [[https://github.com/arvados/bh20-seq-resource/blob/master/doc/INSTALL.md][INSTALL]] with the GNU
+Guix package manager).
+
+The web interface using this exact same script so it should just work
+(TM).
+
+** Example: uploading bulk GenBank sequences
+
+We also use above script to bulk upload GenBank sequences with a [[https://github.com/arvados/bh20-seq-resource/blob/master/scripts/from_genbank_to_fasta_and_yaml.py][FASTA
+and YAML]] extractor specific for GenBank. This means that the steps we
+took above for uploading a GenBank sequence are already automated.