aboutsummaryrefslogtreecommitdiff
path: root/doc/web/download.org
blob: 44fbeb1ac069650f0d95c637c48508827636dab2 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
#+TITLE: Download
#+AUTHOR: Pjotr Prins

* Table of Contents                                                     :TOC:noexport:
 - [[#workflow-runs][Workflow runs]]
 - [[#fasta-files][FASTA files]]
 - [[#metadata][Metadata]]
 - [[#pangenome][Pangenome]]
   - [[#pangenome-gfa-format][Pangenome GFA format]]
   - [[#pangenome-in-odgi-format][Pangenome in ODGI format]]
   - [[#pangenome-rdf-format][Pangenome RDF format]]
   - [[#pangenome-browser-format][Pangenome Browser format]]
 - [[#log-of-workflow-output][Log of workflow output]]
 - [[#all-files][All files]]
 - [[#planned][Planned]]
   - [[#raw-sequence-data][Raw sequence data]]
   - [[#multiple-sequence-alignment-msa][Multiple Sequence Alignment (MSA)]]
   - [[#phylogenetic-tree][Phylogenetic tree]]
   - [[#protein-prediction][Protein prediction]]
 - [[#source-code][Source code]]
 - [[#citing-pubseq][Citing PubSeq]]

* Workflow runs

The last runs can be viewed [[https://workbench.lugli.arvadosapi.com/projects/lugli-j7d0g-y4k4uswcqi3ku56#Subprojects][here]]. If you click on a run you can see
the workflows that ran under ~Processes~. Output (also intermediate)
is listed under ~Data collections~. All current data is listed
[[https://collections.lugli.arvadosapi.com/c=lugli-4zz18-z513nlpqm03hpca/][here]]. Note that it takes time for a run to complete and show.

* FASTA files

The *public sequence resource* provides all uploaded sequences as
FASTA files.  They can be referred to from metadata individually. We
also provide a single file [[https://collections.lugli.arvadosapi.com/c=lugli-4zz18-z513nlpqm03hpca/relabeledSeqs_dedup.fasta][FASTA download]].

* Metadata

Metadata can be downloaded as [[https://www.w3.org/TR/turtle/][Turtle RDF]] as a [[https://collections.lugli.arvadosapi.com/c=lugli-4zz18-z513nlpqm03hpca/mergedmetadata.ttl][mergedmetadat.ttl]] which
can be loaded into any RDF triple-store. We provide a Virtuoso SPARQL
endpoint ourselves which can be queried from
http://sparql.genenetwork.org/sparql/. Query examples can be found in
the [[https://github.com/arvados/bh20-seq-resource/blob/master/doc/blog/using-covid-19-pubseq-part1.org][DOCS]]

The Swiss Institute of Bioinformatics has included this data in
https://covid-19-sparql.expasy.org/ and made it part of [[https://www.uniprot.org/][Uniprot]].

An RDF file that includes the sequences themselves in a variation
graph can be downloaded from below Pangenome RDF format.

* Pangenome

Pangenome data is made available in multiple guises. Variation graphs
(VG) provide a succinct encoding of the sequences of many genomes.

** Pangenome GFA format

[[https://github.com/GFA-spec/GFA-spec][GFA]] is a standard for graphical fragment assembly and consumed
by tools such as [[https://github.com/vgteam/vg][vgtools]].

** Pangenome in ODGI format

[[https://github.com/vgteam/odgi][ODGI]] is a format that supports an optimised dynamic genome/graph
implementation.

** Pangenome RDF format

An RDF file that includes the sequences themselves in a variation
graph can be downloaded from
[[https://collections.lugli.arvadosapi.com/c=lugli-4zz18-z513nlpqm03hpca/][relabeledSeqs-dedup-relabeledSeqs-dedup.ttl.xz]].


** Pangenome Browser format

The many JSON files that are named as
[[https://collections.lugli.arvadosapi.com/c=lugli-4zz18-z513nlpqm03hpca/][results/1/chunk001200.bin1.schematic.json]] are consumed by the
Pangenome browser.

* Log of workflow output

Including in below link is a log file of the last workflow runs.

* All files

https://collections.lugli.arvadosapi.com/c=lugli-4zz18-z513nlpqm03hpca/

* Planned

We are planning the add the following output (see also

** Raw sequence data

See [[https://github.com/arvados/bh20-seq-resource/issues/16][fastq tracker]] and [[https://github.com/arvados/bh20-seq-resource/issues/63][BAM tracker]].

** Multiple Sequence Alignment (MSA)

See [[https://github.com/arvados/bh20-seq-resource/issues/11][MSA tracker]].

** Phylogenetic tree

See [[https://github.com/arvados/bh20-seq-resource/issues/43][Phylo tracker]].

** Protein prediction

We aim to make protein predictions available.

* Source code

All source code for this website and tooling is available
from
https://github.com/arvados/bh20-seq-resource

* Citing PubSeq

See the [[./about][FAQ]].