BLOG

author: Pjotr Prins 2020-05-30 18:13:48 -0500
committer: Pjotr Prins 2020-05-30 18:13:48 -0500
commit: 264be797c55aaff6eb9639d5a15d9081e2256253 (patch)
tree: 1ee90ad507d3faec99b50a74536dd9f6d1f094e4 /doc/blog/using-covid-19-pubseq-part3.org
parent: ac7a79bb2aa6480a2ee3e881732ae314e8ccbf7d (diff)
download: bh20-seq-resource-264be797c55aaff6eb9639d5a15d9081e2256253.tar.gz
bh20-seq-resource-264be797c55aaff6eb9639d5a15d9081e2256253.tar.lz
bh20-seq-resource-264be797c55aaff6eb9639d5a15d9081e2256253.zip
1 files changed, 101 insertions, 15 deletions
diff --git a/doc/blog/using-covid-19-pubseq-part3.org b/doc/blog/using-covid-19-pubseq-part3.org
index 4dd3078..03f37ab 100644
--- a/doc/blog/using-covid-19-pubseq-part3.org
+++ b/doc/blog/using-covid-19-pubseq-part3.org
@@ -6,26 +6,26 @@
 
 #+HTML_HEAD: <link rel="Blog stylesheet" type="text/css" href="blog.css" />
 
-* Uploading Data
 
-/Work in progress!/
 
 * Table of Contents                                                     :TOC:noexport:
  - [[#uploading-data][Uploading Data]]
- - [[#introduction][Introduction]]
  - [[#step-1-upload-sequence][Step 1: Upload sequence]]
  - [[#step-2-add-metadata][Step 2: Add metadata]]
    - [[#obligatory-fields][Obligatory fields]]
    - [[#optional-fields][Optional fields]]
  - [[#step-3-submit-to-covid-19-pubseq][Step 3: Submit to COVID-19 PubSeq]]
- - [[#step-4-check-output][Step 4: Check output]]
    - [[#trouble-shooting][Trouble shooting]]
+ - [[#step-4-check-output][Step 4: Check output]]
+ - [[#bulk-sequence-uploader][Bulk sequence uploader]]
+   - [[#run-the-uploader-cli][Run the uploader (CLI)]]
+   - [[#example-uploading-bulk-genbank-sequences][Example: uploading bulk GenBank sequences]]
 
-* Introduction
+* Uploading Data
 
 The COVID-19 PubSeq allows you to upload your SARS-Cov-2 strains to a
-public resource for global comparisons. Compute it triggered on
-upload. Read the [[./about][ABOUT]] page for more information.
+public resource for global comparisons. A recompute of the pangenome
+gets triggered on upload. Read the [[./about][ABOUT]] page for more information.
 
 * Step 1: Upload sequence
 
@@ -59,7 +59,7 @@ A number of fields are obligatory: sample id, date, location,
 technology and authors. The others are optional, but it is valuable to
 enter them when information is available. Metadata is defined in this
 [[https://github.com/arvados/bh20-seq-resource/blob/master/bh20sequploader/bh20seq-schema.yml][schema]]. From this schema we generate the input form. Note that
-opitional fields have a question mark in the ~type~. You can add
+optional fields have a question mark in the ~type~. You can add
 metadata yourself, btw, because this is a public resource! See also
 [[./blog?id=using-covid-19-pubseq-part5][Modify metadata]] for more information.
 
@@ -86,7 +86,7 @@ Estimated collection date. The GenBank page says April 6, 2020.
 
 *** Collection location
 
-A search on wikidata says Los Angelos is
+A search on wikidata says Los Angeles is
 https://www.wikidata.org/entity/Q65
 
 *** Sequencing technology
@@ -136,12 +136,6 @@ Once you have the sequence and the metadata together, hit
 the 'Add to Pangenome' button. The data will be checked,
 submitted and the workflows should kick in!
 
-* Step 4: Check output
-
-The current pipeline takes 5.5 hours to complete! Once it completes
-the updated data can be checked on the [[./download][DOWNLOAD]] page. After completion
-of above output this [[http://sparql.genenetwork.org/sparql/?default-graph-uri=&query=PREFIX+pubseq%3A+%3Chttp%3A%2F%2Fbiohackathon.org%2Fbh20-seq-schema%23MainSchema%2F%3E%0D%0APREFIX+sio%3A+%3Chttp%3A%2F%2Fsemanticscience.org%2Fresource%2F%3E%0D%0Aselect+distinct+%3Fsample+%3Fp+%3Fo%0D%0A%7B%0D%0A+++%3Fsample+sio%3ASIO_000115+%22MT536190.1%22+.%0D%0A+++%3Fsample+%3Fp+%3Fo+.%0D%0A%7D&format=text%2Fhtml&timeout=0&debug=on&run=+Run+Query+][SPARQL query]] shows some of the metadata we put
-in.
 
 ** Trouble shooting
 
@@ -151,3 +145,95 @@ fixing it to look like http://www.wikidata.org/entity/Q65 (note http
 instead on https and entity instead of wiki) the submission went
 through. Reload the page (it won't empty the fields) to re-enable the
 submit button.
+
+
+* Step 4: Check output
+
+The current pipeline takes 5.5 hours to complete! Once it completes
+the updated data can be checked on the [[./download][DOWNLOAD]] page. After completion
+of above output this [[http://sparql.genenetwork.org/sparql/?default-graph-uri=&query=PREFIX+pubseq%3A+%3Chttp%3A%2F%2Fbiohackathon.org%2Fbh20-seq-schema%23MainSchema%2F%3E%0D%0APREFIX+sio%3A+%3Chttp%3A%2F%2Fsemanticscience.org%2Fresource%2F%3E%0D%0Aselect+distinct+%3Fsample+%3Fp+%3Fo%0D%0A%7B%0D%0A+++%3Fsample+sio%3ASIO_000115+%22MT536190.1%22+.%0D%0A+++%3Fsample+%3Fp+%3Fo+.%0D%0A%7D&format=text%2Fhtml&timeout=0&debug=on&run=+Run+Query+][SPARQL query]] shows some of the metadata we put
+in.
+
+* Bulk sequence uploader
+
+Above steps require a manual upload of one sequence with metadata.
+What if you have a number of sequences you want to upload in bulk?
+For this we have a command line version of the uploader that can
+directly submit to COVID-19 PubSeq. It accepts a FASTA sequence
+file an associated metadata in [[https://github.com/arvados/bh20-seq-resource/blob/master/example/maximum_metadata_example.yaml][YAML]] format. The YAML matches
+the web form and gets validated from the same [[https://github.com/arvados/bh20-seq-resource/blob/master/bh20sequploader/bh20seq-schema.yml][schema]] looks. The YAML
+that you need to create/generate for your samples looks like
+
+#+begin_src json
+id: placeholder
+
+host:
+    host_id: XX1
+    host_species: http://purl.obolibrary.org/obo/NCBITaxon_9606
+    host_sex: http://purl.obolibrary.org/obo/PATO_0000384
+    host_age: 20
+    host_age_unit: http://purl.obolibrary.org/obo/UO_0000036
+    host_health_status: http://purl.obolibrary.org/obo/NCIT_C25269
+    host_treatment: Process in which the act is intended to modify or alter host status (Compounds)
+    host_vaccination: [vaccines1,vaccine2]
+    ethnicity: http://purl.obolibrary.org/obo/HANCESTRO_0010
+    additional_host_information: Optional free text field for additional information
+
+sample:
+    sample_id: Id of the sample as defined by the submitter
+    collector_name: Name of the person that took the sample
+    collecting_institution: Institute that was responsible of sampling
+    specimen_source: [http://purl.obolibrary.org/obo/NCIT_C155831,http://purl.obolibrary.org/obo/NCIT_C155835]
+    collection_date: "2020-01-01"
+    collection_location: http://www.wikidata.org/entity/Q148
+    sample_storage_conditions: frozen specimen
+    source_database_accession: [http://identifiers.org/insdc/LC522350.1#sequence]
+    additional_collection_information: Optional free text field for additional information
+
+virus:
+    virus_species: http://purl.obolibrary.org/obo/NCBITaxon_2697049
+    virus_strain: SARS-CoV-2/human/CHN/HS_8/2020
+
+technology:
+    sample_sequencing_technology: [http://www.ebi.ac.uk/efo/EFO_0009173,http://www.ebi.ac.uk/efo/EFO_0009173]
+    sequence_assembly_method: Protocol used for assembly
+    sequencing_coverage: [70.0, 100.0]
+    additional_technology_information: Optional free text field for additional information
+
+submitter:
+    authors: [John Doe, Joe Boe, Jonny Oe]
+    submitter_name: [John Doe]
+    submitter_address: John Doe's address
+    originating_lab: John Doe kitchen
+    lab_address: John Doe's address
+    provider_sample_id: XXX1
+    submitter_sample_id: XXX2
+    publication: PMID00001113
+    submitter_orcid: [https://orcid.org/0000-0000-0000-0000,https://orcid.org/0000-0000-0000-0001]
+    additional_submitter_information: Optional free text field for additional information
+#+end_src
+
+** Run the uploader (CLI)
+
+Installing with pip you should be
+able to run
+
+: bh20sequploader sequence.fasta metadata.yaml
+
+
+Alternatively the script can be installed from [[https://github.com/arvados/bh20-seq-resource#installation][github]]. Run on the
+command line
+
+: python3 bh20sequploader/main.py example/sequence.fasta example/maximum_metadata_example.yaml
+
+after installing dependencies (also described in [[https://github.com/arvados/bh20-seq-resource/blob/master/doc/INSTALL.md][INSTALL]] with the GNU
+Guix package manager).
+
+The web interface using this exact same script so it should just work
+(TM).
+
+** Example: uploading bulk GenBank sequences
+
+We also use above script to bulk upload GenBank sequences with a [[https://github.com/arvados/bh20-seq-resource/blob/master/scripts/from_genbank_to_fasta_and_yaml.py][FASTA
+and YAML]] extractor specific for GenBank. This means that the steps we
+took above for uploading a GenBank sequence are already automated.
author	Pjotr Prins	2020-05-30 18:13:48 -0500
committer	Pjotr Prins	2020-05-30 18:13:48 -0500
commit	264be797c55aaff6eb9639d5a15d9081e2256253 (patch)
tree	1ee90ad507d3faec99b50a74536dd9f6d1f094e4 /doc/blog/using-covid-19-pubseq-part3.org
parent	ac7a79bb2aa6480a2ee3e881732ae314e8ccbf7d (diff)
download	bh20-seq-resource-264be797c55aaff6eb9639d5a15d9081e2256253.tar.gz bh20-seq-resource-264be797c55aaff6eb9639d5a15d9081e2256253.tar.lz bh20-seq-resource-264be797c55aaff6eb9639d5a15d9081e2256253.zip