diff options
Diffstat (limited to 'doc/blog/using-covid-19-pubseq-part3.org')
-rw-r--r-- | doc/blog/using-covid-19-pubseq-part3.org | 116 |
1 files changed, 101 insertions, 15 deletions
diff --git a/doc/blog/using-covid-19-pubseq-part3.org b/doc/blog/using-covid-19-pubseq-part3.org index 4dd3078..03f37ab 100644 --- a/doc/blog/using-covid-19-pubseq-part3.org +++ b/doc/blog/using-covid-19-pubseq-part3.org @@ -6,26 +6,26 @@ #+HTML_HEAD: <link rel="Blog stylesheet" type="text/css" href="blog.css" /> -* Uploading Data -/Work in progress!/ * Table of Contents :TOC:noexport: - [[#uploading-data][Uploading Data]] - - [[#introduction][Introduction]] - [[#step-1-upload-sequence][Step 1: Upload sequence]] - [[#step-2-add-metadata][Step 2: Add metadata]] - [[#obligatory-fields][Obligatory fields]] - [[#optional-fields][Optional fields]] - [[#step-3-submit-to-covid-19-pubseq][Step 3: Submit to COVID-19 PubSeq]] - - [[#step-4-check-output][Step 4: Check output]] - [[#trouble-shooting][Trouble shooting]] + - [[#step-4-check-output][Step 4: Check output]] + - [[#bulk-sequence-uploader][Bulk sequence uploader]] + - [[#run-the-uploader-cli][Run the uploader (CLI)]] + - [[#example-uploading-bulk-genbank-sequences][Example: uploading bulk GenBank sequences]] -* Introduction +* Uploading Data The COVID-19 PubSeq allows you to upload your SARS-Cov-2 strains to a -public resource for global comparisons. Compute it triggered on -upload. Read the [[./about][ABOUT]] page for more information. +public resource for global comparisons. A recompute of the pangenome +gets triggered on upload. Read the [[./about][ABOUT]] page for more information. * Step 1: Upload sequence @@ -59,7 +59,7 @@ A number of fields are obligatory: sample id, date, location, technology and authors. The others are optional, but it is valuable to enter them when information is available. Metadata is defined in this [[https://github.com/arvados/bh20-seq-resource/blob/master/bh20sequploader/bh20seq-schema.yml][schema]]. From this schema we generate the input form. Note that -opitional fields have a question mark in the ~type~. You can add +optional fields have a question mark in the ~type~. You can add metadata yourself, btw, because this is a public resource! See also [[./blog?id=using-covid-19-pubseq-part5][Modify metadata]] for more information. @@ -86,7 +86,7 @@ Estimated collection date. The GenBank page says April 6, 2020. *** Collection location -A search on wikidata says Los Angelos is +A search on wikidata says Los Angeles is https://www.wikidata.org/entity/Q65 *** Sequencing technology @@ -136,12 +136,6 @@ Once you have the sequence and the metadata together, hit the 'Add to Pangenome' button. The data will be checked, submitted and the workflows should kick in! -* Step 4: Check output - -The current pipeline takes 5.5 hours to complete! Once it completes -the updated data can be checked on the [[./download][DOWNLOAD]] page. After completion -of above output this [[http://sparql.genenetwork.org/sparql/?default-graph-uri=&query=PREFIX+pubseq%3A+%3Chttp%3A%2F%2Fbiohackathon.org%2Fbh20-seq-schema%23MainSchema%2F%3E%0D%0APREFIX+sio%3A+%3Chttp%3A%2F%2Fsemanticscience.org%2Fresource%2F%3E%0D%0Aselect+distinct+%3Fsample+%3Fp+%3Fo%0D%0A%7B%0D%0A+++%3Fsample+sio%3ASIO_000115+%22MT536190.1%22+.%0D%0A+++%3Fsample+%3Fp+%3Fo+.%0D%0A%7D&format=text%2Fhtml&timeout=0&debug=on&run=+Run+Query+][SPARQL query]] shows some of the metadata we put -in. ** Trouble shooting @@ -151,3 +145,95 @@ fixing it to look like http://www.wikidata.org/entity/Q65 (note http instead on https and entity instead of wiki) the submission went through. Reload the page (it won't empty the fields) to re-enable the submit button. + + +* Step 4: Check output + +The current pipeline takes 5.5 hours to complete! Once it completes +the updated data can be checked on the [[./download][DOWNLOAD]] page. After completion +of above output this [[http://sparql.genenetwork.org/sparql/?default-graph-uri=&query=PREFIX+pubseq%3A+%3Chttp%3A%2F%2Fbiohackathon.org%2Fbh20-seq-schema%23MainSchema%2F%3E%0D%0APREFIX+sio%3A+%3Chttp%3A%2F%2Fsemanticscience.org%2Fresource%2F%3E%0D%0Aselect+distinct+%3Fsample+%3Fp+%3Fo%0D%0A%7B%0D%0A+++%3Fsample+sio%3ASIO_000115+%22MT536190.1%22+.%0D%0A+++%3Fsample+%3Fp+%3Fo+.%0D%0A%7D&format=text%2Fhtml&timeout=0&debug=on&run=+Run+Query+][SPARQL query]] shows some of the metadata we put +in. + +* Bulk sequence uploader + +Above steps require a manual upload of one sequence with metadata. +What if you have a number of sequences you want to upload in bulk? +For this we have a command line version of the uploader that can +directly submit to COVID-19 PubSeq. It accepts a FASTA sequence +file an associated metadata in [[https://github.com/arvados/bh20-seq-resource/blob/master/example/maximum_metadata_example.yaml][YAML]] format. The YAML matches +the web form and gets validated from the same [[https://github.com/arvados/bh20-seq-resource/blob/master/bh20sequploader/bh20seq-schema.yml][schema]] looks. The YAML +that you need to create/generate for your samples looks like + +#+begin_src json +id: placeholder + +host: + host_id: XX1 + host_species: http://purl.obolibrary.org/obo/NCBITaxon_9606 + host_sex: http://purl.obolibrary.org/obo/PATO_0000384 + host_age: 20 + host_age_unit: http://purl.obolibrary.org/obo/UO_0000036 + host_health_status: http://purl.obolibrary.org/obo/NCIT_C25269 + host_treatment: Process in which the act is intended to modify or alter host status (Compounds) + host_vaccination: [vaccines1,vaccine2] + ethnicity: http://purl.obolibrary.org/obo/HANCESTRO_0010 + additional_host_information: Optional free text field for additional information + +sample: + sample_id: Id of the sample as defined by the submitter + collector_name: Name of the person that took the sample + collecting_institution: Institute that was responsible of sampling + specimen_source: [http://purl.obolibrary.org/obo/NCIT_C155831,http://purl.obolibrary.org/obo/NCIT_C155835] + collection_date: "2020-01-01" + collection_location: http://www.wikidata.org/entity/Q148 + sample_storage_conditions: frozen specimen + source_database_accession: [http://identifiers.org/insdc/LC522350.1#sequence] + additional_collection_information: Optional free text field for additional information + +virus: + virus_species: http://purl.obolibrary.org/obo/NCBITaxon_2697049 + virus_strain: SARS-CoV-2/human/CHN/HS_8/2020 + +technology: + sample_sequencing_technology: [http://www.ebi.ac.uk/efo/EFO_0009173,http://www.ebi.ac.uk/efo/EFO_0009173] + sequence_assembly_method: Protocol used for assembly + sequencing_coverage: [70.0, 100.0] + additional_technology_information: Optional free text field for additional information + +submitter: + authors: [John Doe, Joe Boe, Jonny Oe] + submitter_name: [John Doe] + submitter_address: John Doe's address + originating_lab: John Doe kitchen + lab_address: John Doe's address + provider_sample_id: XXX1 + submitter_sample_id: XXX2 + publication: PMID00001113 + submitter_orcid: [https://orcid.org/0000-0000-0000-0000,https://orcid.org/0000-0000-0000-0001] + additional_submitter_information: Optional free text field for additional information +#+end_src + +** Run the uploader (CLI) + +Installing with pip you should be +able to run + +: bh20sequploader sequence.fasta metadata.yaml + + +Alternatively the script can be installed from [[https://github.com/arvados/bh20-seq-resource#installation][github]]. Run on the +command line + +: python3 bh20sequploader/main.py example/sequence.fasta example/maximum_metadata_example.yaml + +after installing dependencies (also described in [[https://github.com/arvados/bh20-seq-resource/blob/master/doc/INSTALL.md][INSTALL]] with the GNU +Guix package manager). + +The web interface using this exact same script so it should just work +(TM). + +** Example: uploading bulk GenBank sequences + +We also use above script to bulk upload GenBank sequences with a [[https://github.com/arvados/bh20-seq-resource/blob/master/scripts/from_genbank_to_fasta_and_yaml.py][FASTA +and YAML]] extractor specific for GenBank. This means that the steps we +took above for uploading a GenBank sequence are already automated. |