From 0495b892fba350096c8b1bd741c55e148e7fc2de Mon Sep 17 00:00:00 2001 From: Pjotr Prins Date: Fri, 29 May 2020 14:23:25 -0500 Subject: Blog info for uploading sequence --- doc/blog/using-covid-19-pubseq-part1.html | 64 +++----- doc/blog/using-covid-19-pubseq-part1.org | 23 +-- doc/blog/using-covid-19-pubseq-part3.html | 245 ++++++++++++++++++++++++++++-- doc/blog/using-covid-19-pubseq-part3.org | 123 ++++++++++++++- 4 files changed, 382 insertions(+), 73 deletions(-) (limited to 'doc/blog') diff --git a/doc/blog/using-covid-19-pubseq-part1.html b/doc/blog/using-covid-19-pubseq-part1.html index 5e52b82..1959fac 100644 --- a/doc/blog/using-covid-19-pubseq-part1.html +++ b/doc/blog/using-covid-19-pubseq-part1.html @@ -3,7 +3,7 @@ "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"> - + COVID-19 PubSeq (part 1) @@ -242,40 +242,26 @@ for the JavaScript code in this tag. -
- UP - | - HOME -
+

COVID-19 PubSeq (part 1)

-

-As part of the COVID-19 Biohackathon 2020 we formed a working group -to create a COVID-19 Public Sequence Resource (COVID-19 PubSeq) for -Corona virus sequences. The general idea is to create a repository -that has a low barrier to entry for uploading sequence data using best -practices. I.e., data published with a creative commons 4.0 (CC-4.0) -license with metadata using state-of-the art standards and, perhaps -most importantly, providing standardised workflows that get triggered -on upload, so that results are immediately available in standardised -data formats. -

-
-

1 What does this mean?

+ +
+

1 What does this mean?

This means that when someone uploads a SARS-CoV-2 sequence using one @@ -328,8 +314,8 @@ initiative!

-
-

2 Fetch sequence data

+
+

2 Fetch sequence data

The latest run of the pipeline can be viewed here. Each of these @@ -353,8 +339,8 @@ these identifiers throughout.

-
-

3 Predicates

+
+

3 Predicates

To explore an RDF dataset, the first query we can do is open and gets @@ -464,8 +450,8 @@ Now we got this far, lets -

4 Fetch submitter info and other metadata

+
+

4 Fetch submitter info and other metadata

To get dataests with submitters we can do the above @@ -575,8 +561,8 @@ to view/query the database.

-
-

5 Fetch all sequences from Washington state

+
+

5 Fetch all sequences from Washington state

Now we know how to get at the origin we can do it the other way round @@ -603,8 +589,8 @@ half of the set coming out of GenBank.

-
-

6 Discussion

+
+

6 Discussion

The public sequence uploader collects sequences, raw data and @@ -615,8 +601,8 @@ referenced in publications and origins are citeable.

-
-

7 Acknowledgements

+
+

7 Acknowledgements

The overall effort was due to magnificent freely donated input by a @@ -631,7 +617,7 @@ Garrison this initiative would not have existed!

-
Created by
Pjotr Prins (pjotr.public768 at thebird 'dot' nl) using Emacs org-mode and a healthy dose of Lisp!
Modified 2020-05-29 Fri 10:12
. +
Created by Pjotr Prins (pjotr.public768 at thebird 'dot' nl) using Emacs org-mode and a healthy dose of Lisp!
Modified 2020-05-29 Fri 12:06
.
diff --git a/doc/blog/using-covid-19-pubseq-part1.org b/doc/blog/using-covid-19-pubseq-part1.org index 5a749d6..0fd5589 100644 --- a/doc/blog/using-covid-19-pubseq-part1.org +++ b/doc/blog/using-covid-19-pubseq-part1.org @@ -5,18 +5,8 @@ # C-c C-t task rotate # RSS_IMAGE_URL: http://xxxx.xxxx.free.fr/rss_icon.png -#+HTML_LINK_HOME: http://covid19.genenetwork.org #+HTML_HEAD: -As part of the COVID-19 Biohackathon 2020 we formed a working group -to create a COVID-19 Public Sequence Resource (COVID-19 PubSeq) for -Corona virus sequences. The general idea is to create a repository -that has a low barrier to entry for uploading sequence data using best -practices. I.e., data published with a creative commons 4.0 (CC-4.0) -license with metadata using state-of-the art standards and, perhaps -most importantly, providing standardised workflows that get triggered -on upload, so that results are immediately available in standardised -data formats. * Table of Contents :TOC:noexport: - [[#what-does-this-mean][What does this mean?]] @@ -261,7 +251,6 @@ Now we know how to get at the origin we can do it the other way round and fetch all sequences referring to Washington state #+begin_src sql - select ?seq ?sample { ?seq ?sample . @@ -272,6 +261,18 @@ select ?seq ?sample which lists 300 sequences originating from Washington state! Which is almost half of the set coming out of GenBank. +Likewise to list all sequences from Turkey we can find the wikidata +entity is [[https://www.wikidata.org/wiki/Q43][Q43]]: + +#+begin_src sql +select ?seq ?sample +{ + ?seq ?sample . + ?sample +} +#+end_src + + * Discussion The public sequence uploader collects sequences, raw data and diff --git a/doc/blog/using-covid-19-pubseq-part3.html b/doc/blog/using-covid-19-pubseq-part3.html index 7903791..6838bc7 100644 --- a/doc/blog/using-covid-19-pubseq-part3.html +++ b/doc/blog/using-covid-19-pubseq-part3.html @@ -3,7 +3,7 @@ "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"> - + COVID-19 PubSeq Uploading Data (part 3) @@ -248,16 +248,42 @@ for the JavaScript code in this tag.

Table of Contents

-
-

1 Uploading Data

+
+

1 Uploading Data

Work in progress! @@ -265,8 +291,8 @@ for the JavaScript code in this tag.

-
-

2 Introduction

+
+

2 Introduction

The COVID-19 PubSeq allows you to upload your SARS-Cov-2 strains to a @@ -276,27 +302,214 @@ upload. Read the ABOUT page for more information.

-
-

3 Step 1: Sequence

+
+

3 Step 1: Upload sequence

+

+To upload a sequence in the web upload page hit the browse button and +select the FASTA file on your local hard disk. +

+

We start with an assembled or mapped sequence in FASTA format. The PubSeq uploader contains a QC step which checks whether it is a likely SARS-CoV-2 sequence. While PubSeq deduplicates sequences and never -overwrites metadata it probably pays to check whether your data +overwrites metadata, you may still want to check whether your data already is in the system by querying some metadata as described in -Query metadata with SPARQL. +Query metadata with SPARQL or by simply downloading and checking one +of the files on the download page. We find GenBank MT536190.1 has not +been included yet. A FASTA text file can be downloaded to your local +disk and uploaded through our web upload page. Make sure the file does +not include any HTML! +

+ +

+Note: we currently only allow FASTA uploads. In the near future we'll +allow for uploading raw sequence files. This is important for creating +an improved pangenome. +

+
+
+ +
+

4 Step 2: Add metadata

+
+

+The web upload page contains fields for adding metadata. Metadata is +not only important for attribution, is also important for +analysis. The metadata is available for queries, see Query metadata +with SPARQL, and can be used to annotate variations of the virus in +different ways. +

+ +

+A number of fields are obligatory: sample id, date, location, +technology and authors. The others are optional, but it is valuable to +enter them when information is available. Metadata is defined in this +schema. From this schema we generate the input form. Note that +opitional fields have a question mark in the type. You can add +metadata yourself, btw, because this is a public resource! See also +Modify metadata for more information. +

+ +

+To get more information about a field click on the question mark on +the web form. Here we add some extra information. +

+
+ +
+

4.1 Obligatory fields

+
+
+
+

4.1.1 Sample ID (sampleid)

+
+

+This is a string field that defines a unique sample identifier by the +submitter. In addition to sampleid we also have hostid, +providersampleid and submittersampleid where host is the host the +sample came from, provider sample is the institution sample id and +submitter is the submitting individual id. hostid is important when +multiple sequences come from the same host. Make sure not to have +spaces in the sampleid. +

+ +

+Here we add the GenBank ID MT536190.1. +

+
+
+ +
+

4.1.2 Collection date

+
+

+Estimated collection date. The GenBank page says April 6, 2020. +

+
+
+ +
+

4.1.3 Collection location

+
+

+A search on wikidata says Los Angelos is +https://www.wikidata.org/entity/Q65 +

+
+
+ +
+

4.1.4 Sequencing technology

+
+

+GenBank entry says Illumina, so we can fill that in +

+
+
+ +
+

4.1.5 Authors

+
+

+GenBank entry says 'Lamers,S., Nolan,D.J., Rose,R., Cross,S., Moraga +Amador,D., Yang,T., Caruso,L., Navia,W., Von Borstel,L., Hui Zhou,X., +Freehan,A. and Garcia-Diaz,J.', so we can fill that in. +

+
+
+
+ +
+

4.2 Optional fields

+
+

+All other fields are optional. But let's see what we can add. +

+
+ +
+

4.2.1 Host information

+
+

+Sadly, not much is known about the host from GenBank. A little +sleuthing renders an interesting paper by some of the authors titled +SARS-CoV-2 is consistent across multiple samples and methodologies +which dates after the sample, but has no reference other than that the +raw data came from the SRA database, so it probably does not describe +this particular sample. We don't know what this strain of SARS-Cov-2 +did to the person and what the person was like (say age group). +

+
+
+ +
+

4.2.2 Collecting institution

+
+

+We can fill that in. +

+
+
+ +
+

4.2.3 Specimen source

+
+

+We have that: nasopharyngeal swab

+
+

4.2.4 Source database accession

+
+

+Genbank which is http://identifiers.org/insdc/MT536190.1#sequence. +Note we plug in our own identifier MT536190.1. +

+
+
-
-

4 Step 2: Metadata

+
+

4.2.5 Strain name

+
+

+SARS-CoV-2/human/USA/LA-BIE-070/2020 +

+
+
+
+
+ +
+

5 Step 3: Submit to COVID-19 PubSeq

+
+

+Once you have the sequence and the metadata together, hit +the 'Add to Pangenome' button. The data will be checked, +submitted and the workflows should kick in! +

+
+ +
+

5.1 Trouble shooting

+
+

+We got an error saying: {"stem": "http://www.wikidata.org/entity/",… +which means that our location field was not formed correctly! After +fixing it to look like http://www.wikidata.org/entity/Q65 (note http +instead on https and entity instead of wiki) the submission went +through. Reload the page (it won't empty the fields) to re-enable the +submit button. +

+
+
-
Created by Pjotr Prins (pjotr.public768 at thebird 'dot' nl) using Emacs org-mode and a healthy dose of Lisp!
Modified 2020-05-29 Fri 10:00
. +
Created by Pjotr Prins (pjotr.public768 at thebird 'dot' nl) using Emacs org-mode and a healthy dose of Lisp!
Modified 2020-05-29 Fri 14:22
.
diff --git a/doc/blog/using-covid-19-pubseq-part3.org b/doc/blog/using-covid-19-pubseq-part3.org index 296bef6..ade902d 100644 --- a/doc/blog/using-covid-19-pubseq-part3.org +++ b/doc/blog/using-covid-19-pubseq-part3.org @@ -3,7 +3,6 @@ # C-c C-e h h publish # C-c ! insert date (use . for active agenda, C-u C-c ! for date, C-u C-c . for time) # C-c C-t task rotate -# RSS_IMAGE_URL: http://xxxx.xxxx.free.fr/rss_icon.png #+HTML_HEAD: @@ -14,8 +13,12 @@ * Table of Contents :TOC:noexport: - [[#uploading-data][Uploading Data]] - [[#introduction][Introduction]] - - [[#step-1-sequence][Step 1: Sequence]] - - [[#step-2-metadata][Step 2: Metadata]] + - [[#step-1-upload-sequence][Step 1: Upload sequence]] + - [[#step-2-add-metadata][Step 2: Add metadata]] + - [[#obligatory-fields][Obligatory fields]] + - [[#optional-fields][Optional fields]] + - [[#step-3-submit-to-covid-19-pubseq][Step 3: Submit to COVID-19 PubSeq]] + - [[#trouble-shooting][Trouble shooting]] * Introduction @@ -23,14 +26,120 @@ The COVID-19 PubSeq allows you to upload your SARS-Cov-2 strains to a public resource for global comparisons. Compute it triggered on upload. Read the [[./about][ABOUT]] page for more information. -* Step 1: Sequence +* Step 1: Upload sequence + +To upload a sequence in the [[http://covid19.genenetwork.org/][web upload page]] hit the browse button and +select the FASTA file on your local hard disk. We start with an assembled or mapped sequence in FASTA format. The PubSeq uploader contains a [[https://github.com/arvados/bh20-seq-resource/blob/master/bh20sequploader/qc_fasta.py][QC step]] which checks whether it is a likely SARS-CoV-2 sequence. While PubSeq deduplicates sequences and never -overwrites metadata it probably pays to check whether your data +overwrites metadata, you may still want to check whether your data already is in the system by querying some metadata as described in -[[./blog?id=using-covid-19-pubseq-part1][Query metadata with SPARQL]]. +[[./blog?id=using-covid-19-pubseq-part1][Query metadata with SPARQL]] or by simply downloading and checking one +of the files on the [[./download][download]] page. We find GenBank [[https://www.ncbi.nlm.nih.gov/nuccore/MT536190][MT536190.1]] has not +been included yet. A FASTA text file can be [[https://www.ncbi.nlm.nih.gov/nuccore/MT536190.1?report=fasta&log$=seqview&format=text][downloaded]] to your local +disk and uploaded through our [[./][web upload page]]. Make sure the file does +not include any HTML! + +Note: we currently only allow FASTA uploads. In the near future we'll +allow for uploading raw sequence files. This is important for creating +an improved pangenome. + +* Step 2: Add metadata + +The [[./][web upload page]] contains fields for adding metadata. Metadata is +not only important for attribution, is also important for +analysis. The metadata is available for queries, see [[./blog?id=using-covid-19-pubseq-part1][Query metadata +with SPARQL]], and can be used to annotate variations of the virus in +different ways. + +A number of fields are obligatory: sample id, date, location, +technology and authors. The others are optional, but it is valuable to +enter them when information is available. Metadata is defined in this +[[https://github.com/arvados/bh20-seq-resource/blob/master/bh20sequploader/bh20seq-schema.yml][schema]]. From this schema we generate the input form. Note that +opitional fields have a question mark in the ~type~. You can add +metadata yourself, btw, because this is a public resource! See also +[[./blog?id=using-covid-19-pubseq-part5][Modify metadata]] for more information. + +To get more information about a field click on the question mark on +the web form. Here we add some extra information. + +** Obligatory fields + +*** Sample ID (sample_id) + +This is a string field that defines a unique sample identifier by the +submitter. In addition to sample_id we also have host_id, +provider_sample_id and submitter_sample_id where host is the host the +sample came from, provider sample is the institution sample id and +submitter is the submitting individual id. host_id is important when +multiple sequences come from the same host. Make sure not to have +spaces in the sample_id. + +Here we add the GenBank ID MT536190.1. + +*** Collection date + +Estimated collection date. The GenBank page says April 6, 2020. + +*** Collection location + +A search on wikidata says Los Angelos is +https://www.wikidata.org/entity/Q65 + +*** Sequencing technology + +GenBank entry says Illumina, so we can fill that in + +*** Authors + +GenBank entry says 'Lamers,S., Nolan,D.J., Rose,R., Cross,S., Moraga +Amador,D., Yang,T., Caruso,L., Navia,W., Von Borstel,L., Hui Zhou,X., +Freehan,A. and Garcia-Diaz,J.', so we can fill that in. + +** Optional fields + +All other fields are optional. But let's see what we can add. + +*** Host information + +Sadly, not much is known about the host from GenBank. A little +sleuthing renders an interesting paper by some of the authors titled +[[https://www.medrxiv.org/content/10.1101/2020.04.24.20078691v1][SARS-CoV-2 is consistent across multiple samples and methodologies]] +which dates after the sample, but has no reference other than that the +raw data came from the SRA database, so it probably does not describe +this particular sample. We don't know what this strain of SARS-Cov-2 +did to the person and what the person was like (say age group). + +*** Collecting institution + +We can fill that in. + +*** Specimen source + +We have that: nasopharyngeal swab + +*** Source database accession + +Genbank which is http://identifiers.org/insdc/MT536190.1#sequence. +Note we plug in our own identifier MT536190.1. + +*** Strain name + +SARS-CoV-2/human/USA/LA-BIE-070/2020 + +* Step 3: Submit to COVID-19 PubSeq + +Once you have the sequence and the metadata together, hit +the 'Add to Pangenome' button. The data will be checked, +submitted and the workflows should kick in! +** Trouble shooting -* Step 2: Metadata +We got an error saying: {"stem": "http://www.wikidata.org/entity/",... +which means that our location field was not formed correctly! After +fixing it to look like http://www.wikidata.org/entity/Q65 (note http +instead on https and entity instead of wiki) the submission went +through. Reload the page (it won't empty the fields) to re-enable the +submit button. -- cgit v1.2.3