COVID-19 PubSeq (part 1)
Table of Contents
-
-
- 1. What does this mean? -
- 2. Fetch sequence data -
- 3. Predicates -
- 4. Fetch submitter info and other metadata -
- 5. Fetch all sequences from Washington state -
- 6. Discussion -
- 7. Acknowledgements +
- 1. What does this mean? +
- 2. Fetch sequence data +
- 3. Predicates +
- 4. Fetch submitter info and other metadata +
- 5. Fetch all sequences from Washington state +
- 6. Discussion +
- 7. Acknowledgements
-As part of the COVID-19 Biohackathon 2020 we formed a working group -to create a COVID-19 Public Sequence Resource (COVID-19 PubSeq) for -Corona virus sequences. The general idea is to create a repository -that has a low barrier to entry for uploading sequence data using best -practices. I.e., data published with a creative commons 4.0 (CC-4.0) -license with metadata using state-of-the art standards and, perhaps -most importantly, providing standardised workflows that get triggered -on upload, so that results are immediately available in standardised -data formats. -
-1 What does this mean?
+ +1 What does this mean?
This means that when someone uploads a SARS-CoV-2 sequence using one @@ -328,8 +314,8 @@ initiative!
2 Fetch sequence data
+2 Fetch sequence data
The latest run of the pipeline can be viewed here. Each of these @@ -353,8 +339,8 @@ these identifiers throughout.
3 Predicates
+3 Predicates
To explore an RDF dataset, the first query we can do is open and gets
@@ -464,8 +450,8 @@ Now we got this far, lets
-
To get dataests with submitters we can do the above
@@ -575,8 +561,8 @@ to view/query the database.
Now we know how to get at the origin we can do it the other way round
@@ -603,8 +589,8 @@ half of the set coming out of GenBank.
The public sequence uploader collects sequences, raw data and
@@ -615,8 +601,8 @@ referenced in publications and origins are citeable.
The overall effort was due to magnificent freely donated input by a
@@ -631,7 +617,7 @@ Garrison this initiative would not have existed!
Work in progress!
@@ -265,8 +291,8 @@ for the JavaScript code in this tag.
The COVID-19 PubSeq allows you to upload your SARS-Cov-2 strains to a
@@ -276,27 +302,214 @@ upload. Read the ABOUT page for more information.
+To upload a sequence in the web upload page hit the browse button and
+select the FASTA file on your local hard disk.
+
We start with an assembled or mapped sequence in FASTA format. The
PubSeq uploader contains a QC step which checks whether it is a likely
SARS-CoV-2 sequence. While PubSeq deduplicates sequences and never
-overwrites metadata it probably pays to check whether your data
+overwrites metadata, you may still want to check whether your data
already is in the system by querying some metadata as described in
-Query metadata with SPARQL.
+Query metadata with SPARQL or by simply downloading and checking one
+of the files on the download page. We find GenBank MT536190.1 has not
+been included yet. A FASTA text file can be downloaded to your local
+disk and uploaded through our web upload page. Make sure the file does
+not include any HTML!
+
+Note: we currently only allow FASTA uploads. In the near future we'll
+allow for uploading raw sequence files. This is important for creating
+an improved pangenome.
+
+The web upload page contains fields for adding metadata. Metadata is
+not only important for attribution, is also important for
+analysis. The metadata is available for queries, see Query metadata
+with SPARQL, and can be used to annotate variations of the virus in
+different ways.
+
+A number of fields are obligatory: sample id, date, location,
+technology and authors. The others are optional, but it is valuable to
+enter them when information is available. Metadata is defined in this
+schema. From this schema we generate the input form. Note that
+opitional fields have a question mark in the
+To get more information about a field click on the question mark on
+the web form. Here we add some extra information.
+
+This is a string field that defines a unique sample identifier by the
+submitter. In addition to sampleid we also have hostid,
+providersampleid and submittersampleid where host is the host the
+sample came from, provider sample is the institution sample id and
+submitter is the submitting individual id. hostid is important when
+multiple sequences come from the same host. Make sure not to have
+spaces in the sampleid.
+
+Here we add the GenBank ID MT536190.1.
+
+Estimated collection date. The GenBank page says April 6, 2020.
+
+A search on wikidata says Los Angelos is
+https://www.wikidata.org/entity/Q65
+
+GenBank entry says Illumina, so we can fill that in
+
+GenBank entry says 'Lamers,S., Nolan,D.J., Rose,R., Cross,S., Moraga
+Amador,D., Yang,T., Caruso,L., Navia,W., Von Borstel,L., Hui Zhou,X.,
+Freehan,A. and Garcia-Diaz,J.', so we can fill that in.
+
+All other fields are optional. But let's see what we can add.
+
+Sadly, not much is known about the host from GenBank. A little
+sleuthing renders an interesting paper by some of the authors titled
+SARS-CoV-2 is consistent across multiple samples and methodologies
+which dates after the sample, but has no reference other than that the
+raw data came from the SRA database, so it probably does not describe
+this particular sample. We don't know what this strain of SARS-Cov-2
+did to the person and what the person was like (say age group).
+
+We can fill that in.
+
+We have that: nasopharyngeal swab
+Genbank which is http://identifiers.org/insdc/MT536190.1#sequence.
+Note we plug in our own identifier MT536190.1.
+
+SARS-CoV-2/human/USA/LA-BIE-070/2020
+
+Once you have the sequence and the metadata together, hit
+the 'Add to Pangenome' button. The data will be checked,
+submitted and the workflows should kick in!
+
+We got an error saying: {"stem": "http://www.wikidata.org/entity/",…
+which means that our location field was not formed correctly! After
+fixing it to look like http://www.wikidata.org/entity/Q65 (note http
+instead on https and entity instead of wiki) the submission went
+through. Reload the page (it won't empty the fields) to re-enable the
+submit button.
+4 Fetch submitter info and other metadata
+4 Fetch submitter info and other metadata
5 Fetch all sequences from Washington state
+5 Fetch all sequences from Washington state
6 Discussion
+6 Discussion
7 Acknowledgements
+7 Acknowledgements
Created by Pjotr Prins (pjotr.public768 at thebird 'dot' nl) using Emacs org-mode and a healthy dose of Lisp!
Modified 2020-05-29 Fri 10:12.
+
Created by Pjotr Prins (pjotr.public768 at thebird 'dot' nl) using Emacs org-mode and a healthy dose of Lisp!
Modified 2020-05-29 Fri 12:06.
Table of Contents
1 Uploading Data
+1 Uploading Data
2 Introduction
+2 Introduction
3 Step 1: Sequence
+3 Step 1: Upload sequence
4 Step 2: Add metadata
+type
. You can add
+metadata yourself, btw, because this is a public resource! See also
+Modify metadata for more information.
+4.1 Obligatory fields
+4.1.1 Sample ID (sampleid)
+4.1.2 Collection date
+4.1.3 Collection location
+4.1.4 Sequencing technology
+4.1.5 Authors
+4.2 Optional fields
+4.2.1 Host information
+4.2.2 Collecting institution
+4.2.3 Specimen source
+4.2.4 Source database accession
+4 Step 2: Metadata
+4.2.5 Strain name
+5 Step 3: Submit to COVID-19 PubSeq
+5.1 Trouble shooting
+
Created by Pjotr Prins (pjotr.public768 at thebird 'dot' nl) using Emacs org-mode and a healthy dose of Lisp!
Modified 2020-05-29 Fri 10:00.
+
Created by Pjotr Prins (pjotr.public768 at thebird 'dot' nl) using Emacs org-mode and a healthy dose of Lisp!
Modified 2020-05-29 Fri 14:22.