aboutsummaryrefslogtreecommitdiff
path: root/doc/blog/using-covid-19-pubseq-part3.org
diff options
context:
space:
mode:
Diffstat (limited to 'doc/blog/using-covid-19-pubseq-part3.org')
-rw-r--r--doc/blog/using-covid-19-pubseq-part3.org161
1 files changed, 107 insertions, 54 deletions
diff --git a/doc/blog/using-covid-19-pubseq-part3.org b/doc/blog/using-covid-19-pubseq-part3.org
index fb68251..f3ba073 100644
--- a/doc/blog/using-covid-19-pubseq-part3.org
+++ b/doc/blog/using-covid-19-pubseq-part3.org
@@ -7,10 +7,19 @@
#+HTML_HEAD: <link rel="Blog stylesheet" type="text/css" href="blog.css" />
#+OPTIONS: ^:nil
+* Introduction
+
+In this document we explain how to upload data into COVID-19 PubSeq.
+This can happen through a web page, or through a command line
+script. We'll also show how to parametrize uploads by using templates.
+The procedure is much easier than with other repositories and can be
+fully automated. Once uploaded you can use our export API to prepare
+for other repositories.
* Table of Contents :TOC:noexport:
- - [[#uploading-data][Uploading Data]]
+ - [[#introduction][Introduction]]
+ - [[#uploading-data][Uploading data]]
- [[#step-1-upload-sequence][Step 1: Upload sequence]]
- [[#step-2-add-metadata][Step 2: Add metadata]]
- [[#obligatory-fields][Obligatory fields]]
@@ -23,7 +32,7 @@
- [[#example-uploading-bulk-genbank-sequences][Example: uploading bulk GenBank sequences]]
- [[#example-preparing-metadata][Example: preparing metadata]]
-* Uploading Data
+* Uploading data
The COVID-19 PubSeq allows you to upload your SARS-Cov-2 strains to a
public resource for global comparisons. A recompute of the pangenome
@@ -165,55 +174,90 @@ file an associated metadata in [[https://github.com/arvados/bh20-seq-resource/bl
the web form and gets validated from the same [[https://github.com/arvados/bh20-seq-resource/blob/master/bh20sequploader/bh20seq-schema.yml][schema]] looks. The YAML
that you need to create/generate for your samples looks like
+A minimal example of metadata looks like
+
+#+begin_src json
+ id: placeholder
+
+ license:
+ license_type: http://creativecommons.org/licenses/by/4.0/
+
+ host:
+ host_species: http://purl.obolibrary.org/obo/NCBITaxon_9606
+
+ sample:
+ sample_id: XX
+ collection_date: "2020-01-01"
+ collection_location: http://www.wikidata.org/entity/Q148
+
+ virus:
+ virus_species: http://purl.obolibrary.org/obo/NCBITaxon_2697049
+
+ technology:
+ sample_sequencing_technology: [http://www.ebi.ac.uk/efo/EFO_0008632]
+
+ submitter:
+ authors: [John Doe]
+#+end_src
+
+a more elaborate example (note most fields are optional) may look like
+
#+begin_src json
-id: placeholder
-
-host:
- host_id: XX1
- host_species: http://purl.obolibrary.org/obo/NCBITaxon_9606
- host_sex: http://purl.obolibrary.org/obo/PATO_0000384
- host_age: 20
- host_age_unit: http://purl.obolibrary.org/obo/UO_0000036
- host_health_status: http://purl.obolibrary.org/obo/NCIT_C25269
- host_treatment: Process in which the act is intended to modify or alter host status (Compounds)
- host_vaccination: [vaccines1,vaccine2]
- ethnicity: http://purl.obolibrary.org/obo/HANCESTRO_0010
- additional_host_information: Optional free text field for additional information
-
-sample:
- sample_id: Id of the sample as defined by the submitter
- collector_name: Name of the person that took the sample
- collecting_institution: Institute that was responsible of sampling
- specimen_source: [http://purl.obolibrary.org/obo/NCIT_C155831,http://purl.obolibrary.org/obo/NCIT_C155835]
- collection_date: "2020-01-01"
- collection_location: http://www.wikidata.org/entity/Q148
- sample_storage_conditions: frozen specimen
- source_database_accession: [http://identifiers.org/insdc/LC522350.1#sequence]
- additional_collection_information: Optional free text field for additional information
-
-virus:
- virus_species: http://purl.obolibrary.org/obo/NCBITaxon_2697049
- virus_strain: SARS-CoV-2/human/CHN/HS_8/2020
-
-technology:
- sample_sequencing_technology: [http://www.ebi.ac.uk/efo/EFO_0009173,http://www.ebi.ac.uk/efo/EFO_0009173]
- sequence_assembly_method: Protocol used for assembly
- sequencing_coverage: [70.0, 100.0]
- additional_technology_information: Optional free text field for additional information
-
-submitter:
- authors: [John Doe, Joe Boe, Jonny Oe]
- submitter_name: [John Doe]
- submitter_address: John Doe's address
- originating_lab: John Doe kitchen
- lab_address: John Doe's address
- provider_sample_id: XXX1
- submitter_sample_id: XXX2
- publication: PMID00001113
- submitter_orcid: [https://orcid.org/0000-0000-0000-0000,https://orcid.org/0000-0000-0000-0001]
- additional_submitter_information: Optional free text field for additional information
+ id: placeholder
+
+ host:
+ host_id: XX1
+ host_species: http://purl.obolibrary.org/obo/NCBITaxon_9606
+ host_sex: http://purl.obolibrary.org/obo/PATO_0000384
+ host_age: 20
+ host_age_unit: http://purl.obolibrary.org/obo/UO_0000036
+ host_health_status: http://purl.obolibrary.org/obo/NCIT_C25269
+ host_treatment: Process in which the act is intended to modify or alter host status (Compounds)
+ host_vaccination: [vaccines1,vaccine2]
+ ethnicity: http://purl.obolibrary.org/obo/HANCESTRO_0010
+ additional_host_information: Optional free text field for additional information
+
+ sample:
+ sample_id: Id of the sample as defined by the submitter
+ collector_name: Name of the person that took the sample
+ collecting_institution: Institute that was responsible of sampling
+ specimen_source: [http://purl.obolibrary.org/obo/NCIT_C155831,http://purl.obolibrary.org/obo/NCIT_C155835]
+ collection_date: "2020-01-01"
+ collection_location: http://www.wikidata.org/entity/Q148
+ sample_storage_conditions: frozen specimen
+ source_database_accession: [http://identifiers.org/insdc/LC522350.1#sequence]
+ additional_collection_information: Optional free text field for additional information
+
+ virus:
+ virus_species: http://purl.obolibrary.org/obo/NCBITaxon_2697049
+ virus_strain: SARS-CoV-2/human/CHN/HS_8/2020
+
+ technology:
+ sample_sequencing_technology: [http://www.ebi.ac.uk/efo/EFO_0009173,http://www.ebi.ac.uk/efo/EFO_0009173]
+ sequence_assembly_method: Protocol used for assembly
+ sequencing_coverage: [70.0, 100.0]
+ additional_technology_information: Optional free text field for additional information
+
+ submitter:
+ authors: [John Doe, Joe Boe, Jonny Oe]
+ submitter_name: [John Doe]
+ submitter_address: John Doe's address
+ originating_lab: John Doe kitchen
+ lab_address: John Doe's address
+ provider_sample_id: XXX1
+ submitter_sample_id: XXX2
+ publication: PMID00001113
+ submitter_orcid: [https://orcid.org/0000-0000-0000-0000,https://orcid.org/0000-0000-0000-0001]
+ additional_submitter_information: Optional free text field for additional information
#+end_src
+more metadata is yummy when stored in RDF. [[https://yummydata.org/][Yummydata]] is useful to a wider community. Note
+that many of the terms in above example are URIs, such as
+host_species: http://purl.obolibrary.org/obo/NCBITaxon_9606. We use
+web ontologies for these to make the data less ambiguous and more
+FAIR. Check out the option fields as defined in the schema. If it is not listed
+a little bit of web searching may be required or [[./contact][contact]] us.
+
** Run the uploader (CLI)
Installing with pip you should be
@@ -221,7 +265,6 @@ able to run
: bh20sequploader sequence.fasta metadata.yaml
-
Alternatively the script can be installed from [[https://github.com/arvados/bh20-seq-resource#installation][github]]. Run on the
command line
@@ -274,13 +317,23 @@ done
** Example: preparing metadata
-Usually, metadata are available in tabular format, like spreadsheets. As an example, we provide a script
-[[https://github.com/arvados/bh20-seq-resource/tree/master/scripts/esr_samples][esr_samples.py]] to show you how to parse
-your metadata in YAML files ready for the upload. To execute the script, go in the ~bh20-seq-resource/scripts/esr_samples
-and execute
+Usually, metadata are available in a tabular format, such as
+spreadsheets. As an example, we provide a script [[https://github.com/arvados/bh20-seq-resource/tree/master/scripts/esr_samples][esr_samples.py]] to
+show you how to parse your metadata in YAML files ready for the
+upload. To execute the script, go in the
+~bh20-seq-resource/scripts/esr_samples and execute
#+BEGIN_SRC sh
python3 esr_samples.py
#+END_SRC
-You will find the YAML files in the `yaml` folder which will be created in the same directory.
+You will find the YAML files in the `yaml` folder which will be
+created in the same directory.
+
+In the example we use Python pandas to read the spreadsheet into a
+tabular structure. Next we use a [[https://github.com/arvados/bh20-seq-resource/blob/master/scripts/esr_samples/template.yaml][template.yaml]] file that gets filled
+in by ~esr_samples.py~ so we get a metadata YAML file for each sample.
+
+Next run the earlier CLI uploader for each YAML and FASTA combination.
+It can't be much easier than this. For ESR we uploaded a batch of 600
+sequences this way writing a few lines of Python [[https://github.com/arvados/bh20-seq-resource/blob/master/scripts/esr_samples/esr_samples.py][code]]. See [[http://covid19.genenetwork.org/resource/20VR0995][example]].