COVID-19 PubSeq (part 1)

1. What does this mean?
2. Fetch sequence data
3. Predicates
4. Fetch submitter info and other metadata
5. Fetch all sequences from Washington state
6. Discussion
7. Acknowledgements
1. What does this mean?
2. Fetch sequence data
3. Predicates
4. Fetch submitter info and other metadata
5. Fetch all sequences from Washington state
6. Discussion
7. Acknowledgements

-As part of the COVID-19 Biohackathon 2020 we formed a working group -to create a COVID-19 Public Sequence Resource (COVID-19 PubSeq) for -Corona virus sequences. The general idea is to create a repository -that has a low barrier to entry for uploading sequence data using best -practices. I.e., data published with a creative commons 4.0 (CC-4.0) -license with metadata using state-of-the art standards and, perhaps -most importantly, providing standardised workflows that get triggered -on upload, so that results are immediately available in standardised -data formats. -

1 What does this mean?

+ +

1 What does this mean?

This means that when someone uploads a SARS-CoV-2 sequence using one @@ -328,8 +314,8 @@ initiative!

2 Fetch sequence data

The latest run of the pipeline can be viewed here. Each of these @@ -353,8 +339,8 @@ these identifiers throughout.

3 Predicates

To explore an RDF dataset, the first query we can do is open and gets @@ -464,8 +450,8 @@ Now we got this far, lets -

4 Fetch submitter info and other metadata

To get dataests with submitters we can do the above @@ -575,8 +561,8 @@ to view/query the database.

5 Fetch all sequences from Washington state

Now we know how to get at the origin we can do it the other way round @@ -603,8 +589,8 @@ half of the set coming out of GenBank.

6 Discussion

The public sequence uploader collects sequences, raw data and @@ -615,8 +601,8 @@ referenced in publications and origins are citeable.

7 Acknowledgements

The overall effort was due to magnificent freely donated input by a @@ -631,7 +617,7 @@ Garrison this initiative would not have existed!

Created by Pjotr Prins (pjotr.public768 at thebird 'dot' nl) using Emacs org-mode and a healthy dose of Lisp!
Modified 2020-05-29 Fri 10:12. +

Created by Pjotr Prins (pjotr.public768 at thebird 'dot' nl) using Emacs org-mode and a healthy dose of Lisp!
Modified 2020-05-29 Fri 12:06.

diff --git a/doc/blog/using-covid-19-pubseq-part1.org b/doc/blog/using-covid-19-pubseq-part1.org index 5a749d6..0fd5589 100644 --- a/doc/blog/using-covid-19-pubseq-part1.org +++ b/doc/blog/using-covid-19-pubseq-part1.org @@ -5,18 +5,8 @@ # C-c C-t task rotate # RSS_IMAGE_URL: http://xxxx.xxxx.free.fr/rss_icon.png -#+HTML_LINK_HOME: http://covid19.genenetwork.org #+HTML_HEAD: -As part of the COVID-19 Biohackathon 2020 we formed a working group -to create a COVID-19 Public Sequence Resource (COVID-19 PubSeq) for -Corona virus sequences. The general idea is to create a repository -that has a low barrier to entry for uploading sequence data using best -practices. I.e., data published with a creative commons 4.0 (CC-4.0) -license with metadata using state-of-the art standards and, perhaps -most importantly, providing standardised workflows that get triggered -on upload, so that results are immediately available in standardised -data formats. * Table of Contents :TOC:noexport: - [[#what-does-this-mean][What does this mean?]] @@ -261,7 +251,6 @@ Now we know how to get at the origin we can do it the other way round and fetch all sequences referring to Washington state #+begin_src sql - select ?seq ?sample { ?seq ?sample . @@ -272,6 +261,18 @@ select ?seq ?sample which lists 300 sequences originating from Washington state! Which is almost half of the set coming out of GenBank. +Likewise to list all sequences from Turkey we can find the wikidata +entity is [[https://www.wikidata.org/wiki/Q43][Q43]]: + +#+begin_src sql +select ?seq ?sample +{ + ?seq ?sample . + ?sample +} +#+end_src + + * Discussion The public sequence uploader collects sequences, raw data and diff --git a/doc/blog/using-covid-19-pubseq-part3.html b/doc/blog/using-covid-19-pubseq-part3.html index 7903791..6838bc7 100644 --- a/doc/blog/using-covid-19-pubseq-part3.html +++ b/doc/blog/using-covid-19-pubseq-part3.html @@ -3,7 +3,7 @@ "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"> - + COVID-19 PubSeq Uploading Data (part 3) @@ -248,16 +248,42 @@ for the JavaScript code in this tag.

1. Uploading Data
2. Introduction
3. Step 1: Sequence
4. Step 2: Metadata
1. Uploading Data
2. Introduction
3. Step 1: Upload sequence
4. Step 2: Add metadata +
- 4.1. Obligatory fields +
  +
- 4.2. Optional fields +
  +
+
5. Step 3: Submit to COVID-19 PubSeq +
- 5.1. Trouble shooting
+

1 Uploading Data

Work in progress! @@ -265,8 +291,8 @@ for the JavaScript code in this tag.

2 Introduction

The COVID-19 PubSeq allows you to upload your SARS-Cov-2 strains to a @@ -276,27 +302,214 @@ upload. Read the ABOUT page for more information.

3 Step 1: Sequence

3 Step 1: Upload sequence

+To upload a sequence in the web upload page hit the browse button and +select the FASTA file on your local hard disk. +

We start with an assembled or mapped sequence in FASTA format. The PubSeq uploader contains a QC step which checks whether it is a likely SARS-CoV-2 sequence. While PubSeq deduplicates sequences and never -overwrites metadata it probably pays to check whether your data +overwrites metadata, you may still want to check whether your data already is in the system by querying some metadata as described in -Query metadata with SPARQL. +Query metadata with SPARQL or by simply downloading and checking one +of the files on the download page. We find GenBank MT536190.1 has not +been included yet. A FASTA text file can be downloaded to your local +disk and uploaded through our web upload page. Make sure the file does +not include any HTML! +

+ +

+Note: we currently only allow FASTA uploads. In the near future we'll +allow for uploading raw sequence files. This is important for creating +an improved pangenome. +

+ +

4 Step 2: Add metadata

+The web upload page contains fields for adding metadata. Metadata is +not only important for attribution, is also important for +analysis. The metadata is available for queries, see Query metadata +with SPARQL, and can be used to annotate variations of the virus in +different ways. +

+ +

+A number of fields are obligatory: sample id, date, location, +technology and authors. The others are optional, but it is valuable to +enter them when information is available. Metadata is defined in this +schema. From this schema we generate the input form. Note that +opitional fields have a question mark in the type. You can add +metadata yourself, btw, because this is a public resource! See also +Modify metadata for more information. +

+ +

+To get more information about a field click on the question mark on +the web form. Here we add some extra information. +

+ +

4.1 Obligatory fields

4.1.1 Sample ID (sample_id)

+This is a string field that defines a unique sample identifier by the +submitter. In addition to sample_id we also have host_id, +provider_sample_id and submitter_sample_id where host is the host the +sample came from, provider sample is the institution sample id and +submitter is the submitting individual id. host_id is important when +multiple sequences come from the same host. Make sure not to have +spaces in the sample_id. +

+ +

+Here we add the GenBank ID MT536190.1. +

+ +

4.1.2 Collection date

+Estimated collection date. The GenBank page says April 6, 2020. +

+ +

4.1.3 Collection location

+A search on wikidata says Los Angelos is +https://www.wikidata.org/entity/Q65 +

+ +

4.1.4 Sequencing technology

+GenBank entry says Illumina, so we can fill that in +

+ +

4.1.5 Authors

+GenBank entry says 'Lamers,S., Nolan,D.J., Rose,R., Cross,S., Moraga +Amador,D., Yang,T., Caruso,L., Navia,W., Von Borstel,L., Hui Zhou,X., +Freehan,A. and Garcia-Diaz,J.', so we can fill that in. +

+ +

4.2 Optional fields

+All other fields are optional. But let's see what we can add. +

+ +

4.2.1 Host information

+Sadly, not much is known about the host from GenBank. A little +sleuthing renders an interesting paper by some of the authors titled +SARS-CoV-2 is consistent across multiple samples and methodologies +which dates after the sample, but has no reference other than that the +raw data came from the SRA database, so it probably does not describe +this particular sample. We don't know what this strain of SARS-Cov-2 +did to the person and what the person was like (say age group). +

+ +

4.2.2 Collecting institution

+We can fill that in. +

+ +

4.2.3 Specimen source

+We have that: nasopharyngeal swab

4.2.4 Source database accession

+Genbank which is http://identifiers.org/insdc/MT536190.1#sequence. +Note we plug in our own identifier MT536190.1. +

4 Step 2: Metadata

4.2.5 Strain name

+SARS-CoV-2/human/USA/LA-BIE-070/2020 +

+ +

5 Step 3: Submit to COVID-19 PubSeq

+Once you have the sequence and the metadata together, hit +the 'Add to Pangenome' button. The data will be checked, +submitted and the workflows should kick in! +

+ +

5.1 Trouble shooting

+We got an error saying: {"stem": "http://www.wikidata.org/entity/",… +which means that our location field was not formed correctly! After +fixing it to look like http://www.wikidata.org/entity/Q65 (note http +instead on https and entity instead of wiki) the submission went +through. Reload the page (it won't empty the fields) to re-enable the +submit button. +

Created by Pjotr Prins (pjotr.public768 at thebird 'dot' nl) using Emacs org-mode and a healthy dose of Lisp!
Modified 2020-05-29 Fri 10:00. +

Created by Pjotr Prins (pjotr.public768 at thebird 'dot' nl) using Emacs org-mode and a healthy dose of Lisp!
Modified 2020-05-29 Fri 14:22.

diff --git a/doc/blog/using-covid-19-pubseq-part3.org b/doc/blog/using-covid-19-pubseq-part3.org index 296bef6..ade902d 100644 --- a/doc/blog/using-covid-19-pubseq-part3.org +++ b/doc/blog/using-covid-19-pubseq-part3.org @@ -3,7 +3,6 @@ # C-c C-e h h publish # C-c ! insert date (use . for active agenda, C-u C-c ! for date, C-u C-c . for time) # C-c C-t task rotate -# RSS_IMAGE_URL: http://xxxx.xxxx.free.fr/rss_icon.png #+HTML_HEAD: @@ -14,8 +13,12 @@ * Table of Contents :TOC:noexport: - [[#uploading-data][Uploading Data]] - [[#introduction][Introduction]] - - [[#step-1-sequence][Step 1: Sequence]] - - [[#step-2-metadata][Step 2: Metadata]] + - [[#step-1-upload-sequence][Step 1: Upload sequence]] + - [[#step-2-add-metadata][Step 2: Add metadata]] + - [[#obligatory-fields][Obligatory fields]] + - [[#optional-fields][Optional fields]] + - [[#step-3-submit-to-covid-19-pubseq][Step 3: Submit to COVID-19 PubSeq]] + - [[#trouble-shooting][Trouble shooting]] * Introduction @@ -23,14 +26,120 @@ The COVID-19 PubSeq allows you to upload your SARS-Cov-2 strains to a public resource for global comparisons. Compute it triggered on upload. Read the [[./about][ABOUT]] page for more information. -* Step 1: Sequence +* Step 1: Upload sequence + +To upload a sequence in the [[http://covid19.genenetwork.org/][web upload page]] hit the browse button and +select the FASTA file on your local hard disk. We start with an assembled or mapped sequence in FASTA format. The PubSeq uploader contains a [[https://github.com/arvados/bh20-seq-resource/blob/master/bh20sequploader/qc_fasta.py][QC step]] which checks whether it is a likely SARS-CoV-2 sequence. While PubSeq deduplicates sequences and never -overwrites metadata it probably pays to check whether your data +overwrites metadata, you may still want to check whether your data already is in the system by querying some metadata as described in -[[./blog?id=using-covid-19-pubseq-part1][Query metadata with SPARQL]]. +[[./blog?id=using-covid-19-pubseq-part1][Query metadata with SPARQL]] or by simply downloading and checking one +of the files on the [[./download][download]] page. We find GenBank [[https://www.ncbi.nlm.nih.gov/nuccore/MT536190][MT536190.1]] has not +been included yet. A FASTA text file can be [[https://www.ncbi.nlm.nih.gov/nuccore/MT536190.1?report=fasta&log$=seqview&format=text][downloaded]] to your local +disk and uploaded through our [[./][web upload page]]. Make sure the file does +not include any HTML! + +Note: we currently only allow FASTA uploads. In the near future we'll +allow for uploading raw sequence files. This is important for creating +an improved pangenome. + +* Step 2: Add metadata + +The [[./][web upload page]] contains fields for adding metadata. Metadata is +not only important for attribution, is also important for +analysis. The metadata is available for queries, see [[./blog?id=using-covid-19-pubseq-part1][Query metadata +with SPARQL]], and can be used to annotate variations of the virus in +different ways. + +A number of fields are obligatory: sample id, date, location, +technology and authors. The others are optional, but it is valuable to +enter them when information is available. Metadata is defined in this +[[https://github.com/arvados/bh20-seq-resource/blob/master/bh20sequploader/bh20seq-schema.yml][schema]]. From this schema we generate the input form. Note that +opitional fields have a question mark in the ~type~. You can add +metadata yourself, btw, because this is a public resource! See also +[[./blog?id=using-covid-19-pubseq-part5][Modify metadata]] for more information. + +To get more information about a field click on the question mark on +the web form. Here we add some extra information. + +** Obligatory fields + +*** Sample ID (sample_id) + +This is a string field that defines a unique sample identifier by the +submitter. In addition to sample_id we also have host_id, +provider_sample_id and submitter_sample_id where host is the host the +sample came from, provider sample is the institution sample id and +submitter is the submitting individual id. host_id is important when +multiple sequences come from the same host. Make sure not to have +spaces in the sample_id. + +Here we add the GenBank ID MT536190.1. + +*** Collection date + +Estimated collection date. The GenBank page says April 6, 2020. + +*** Collection location + +A search on wikidata says Los Angelos is +https://www.wikidata.org/entity/Q65 + +*** Sequencing technology + +GenBank entry says Illumina, so we can fill that in + +*** Authors + +GenBank entry says 'Lamers,S., Nolan,D.J., Rose,R., Cross,S., Moraga +Amador,D., Yang,T., Caruso,L., Navia,W., Von Borstel,L., Hui Zhou,X., +Freehan,A. and Garcia-Diaz,J.', so we can fill that in. + +** Optional fields + +All other fields are optional. But let's see what we can add. + +*** Host information + +Sadly, not much is known about the host from GenBank. A little +sleuthing renders an interesting paper by some of the authors titled +[[https://www.medrxiv.org/content/10.1101/2020.04.24.20078691v1][SARS-CoV-2 is consistent across multiple samples and methodologies]] +which dates after the sample, but has no reference other than that the +raw data came from the SRA database, so it probably does not describe +this particular sample. We don't know what this strain of SARS-Cov-2 +did to the person and what the person was like (say age group). + +*** Collecting institution + +We can fill that in. + +*** Specimen source + +We have that: nasopharyngeal swab + +*** Source database accession + +Genbank which is http://identifiers.org/insdc/MT536190.1#sequence. +Note we plug in our own identifier MT536190.1. + +*** Strain name + +SARS-CoV-2/human/USA/LA-BIE-070/2020 + +* Step 3: Submit to COVID-19 PubSeq + +Once you have the sequence and the metadata together, hit +the 'Add to Pangenome' button. The data will be checked, +submitted and the workflows should kick in! +** Trouble shooting -* Step 2: Metadata +We got an error saying: {"stem": "http://www.wikidata.org/entity/",... +which means that our location field was not formed correctly! After +fixing it to look like http://www.wikidata.org/entity/Q65 (note http +instead on https and entity instead of wiki) the submission went +through. Reload the page (it won't empty the fields) to re-enable the +submit button. -- cgit 1.4.1

COVID-19 PubSeq (part 1)

Table of Contents

1 What does this mean?

1 What does this mean?

2 Fetch sequence data

2 Fetch sequence data

3 Predicates

3 Predicates

4 Fetch submitter info and other metadata

4 Fetch submitter info and other metadata

5 Fetch all sequences from Washington state

5 Fetch all sequences from Washington state

6 Discussion

6 Discussion

7 Acknowledgements

7 Acknowledgements

Table of Contents

1 Uploading Data

1 Uploading Data

2 Introduction

2 Introduction

3 Step 1: Sequence

3 Step 1: Upload sequence

4 Step 2: Add metadata

4.1 Obligatory fields

4.1.1 Sample ID (sampleid)

4.1.2 Collection date

4.1.3 Collection location

4.1.4 Sequencing technology

4.1.5 Authors

4.2 Optional fields

4.2.1 Host information

4.2.2 Collecting institution

4.2.3 Specimen source

4.2.4 Source database accession

4 Step 2: Metadata

4.2.5 Strain name

5 Step 3: Submit to COVID-19 PubSeq

5.1 Trouble shooting

4.1.1 Sample ID (sample_id)