aboutsummaryrefslogtreecommitdiff
path: root/doc/blog/using-covid-19-pubseq-part6.org
blob: 664170b28c67d0206b268be4c4eb4a88f177ec24 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
#+TITLE: COVID-19 PubSeq (part 6)
#+AUTHOR: Pjotr Prins
# C-c C-e h h   publish
# C-c !         insert date (use . for active agenda, C-u C-c ! for date, C-u C-c . for time)
# C-c C-t       task rotate
# RSS_IMAGE_URL: http://xxxx.xxxx.free.fr/rss_icon.png

#+HTML_HEAD: <link rel="Blog stylesheet" type="text/css" href="blog.css" />


* Table of Contents                                                     :TOC:noexport:
 - [[#short-version][Short version]]
 - [[#generating-output-for-ebi][Generating output for EBI]]
 - [[#defining-the-ebi-study][Defining the EBI study]]
 - [[#define-the-ebi-sample][Define the EBI sample]]
 - [[#define-the-ebi-sequence][Define the EBI sequence]]

* Short version

PubSeq can export files that can be uploaded to EBI/ENA. This saves
you work. Steps are:

1. Register and account for EBI/ENA as explained [[https://ena-docs.readthedocs.io/en/latest/submit/general-guide.html][here]].
2. Register a study online or use XML files discussed below
3. Export a sample XML and push to EBI/ENA
4. Zip sequence data and push to EBI/ENA

Because PubSeq's metadata for is richer than the metadata EBI/ENA asks
for, it is easy to generate and export the forms using the [[http://covid19.genenetwork.org/export][EXPORT]]
page.

* Generating output for EBI

Would it not be great an uploader to PubSeq also can export samples
to, say, EBI? That is what we discuss in this section. The submission
process is somewhat laborious and when you have submitted to PubSeq
why not export the same to EBI too with the least amount of effort?

COVID-19 PubSeq is a data source - both sequence data and metadata -
that can be used to push data to other sources, such as EBI. You can
register [[https://ena-docs.readthedocs.io/en/latest/submit/samples/programmatic.html][samples programmatically]] with a specific XML interface.  Note
that (at this point) if you want to submit a sequence (FASTA) it can
only be done through the [[https://ena-docs.readthedocs.io/en/latest/submit/general-guide/webin-cli.html][Webin-CLI]]. Raw data (FASTQ) can go through
the XML interface.

EBI sequence resources are presented through ENA. For example
[[https://www.ebi.ac.uk/ena/browser/view/MT394864][Sequence: MT394864.1]].

EBI has XML Formats for

- SUBMISSION
- STUDY
- SAMPLE
- EXPERIMENT
- RUN
- ANALYSIS
- DAC
- POLICY
- DATASET
- PROJECT

with the schemas listed [[ftp://ftp.ebi.ac.uk/pub/databases/ena/doc/xsd/sra_1_5/][here]].  Since we are submitting sequences we
should follow submitting [[https://ena-docs.readthedocs.io/en/latest/submit/assembly.html][full genome assembly guidelines]] and
[[https://ena-docs.readthedocs.io/en/latest/submit/general-guide/programmatic.html][ENA guidelines]]. The first step is to define the study, next the sample
and finally the sequence (assembly).

* Defining the EBI study

A study is defined [[https://ena-docs.readthedocs.io/en/latest/submit/study/programmatic.html][here]] and looks like

#+BEGIN_SRC xml
<PROJECT_SET>
   <PROJECT alias="COVID-19 Washington DC">
      <TITLE>Sequencing SARS-CoV-2 in the Washington DC area</TITLE>
      <DESCRIPTION>This study collects samples from COVID-19 patients in the Washington DC area</DESCRIPTION>
      <SUBMISSION_PROJECT>
         <SEQUENCING_PROJECT/>
      </SUBMISSION_PROJECT>
   </PROJECT>
</PROJECT_SET>
#+END_SRC

also a submission 'command' is required looking like

#+BEGIN_SRC xml
<SUBMISSION>
   <ACTIONS>
      <ACTION>
         <ADD/>
      </ACTION>
      <ACTION>
         <HOLD HoldUntilDate="TODO: release date"/>
      </ACTION>
   </ACTIONS>
</SUBMISSION>

#+END_SRC

Working XML examples we tested can be found [[https://github.com/arvados/bh20-seq-resource/tree/master/scripts/submit_ebi/example][here]].

The webin system accepts such sources using a command like

: curl -u username:password -F "SUBMISSION=@submission.xml" \
:   -F "PROJECT=@project.xml" "https://wwwdev.ebi.ac.uk/ena/submit/drop-box/submit/"

as described [[https://ena-docs.readthedocs.io/en/latest/submit/study/programmatic.html#submit-the-xmls-using-curl][here]]. Note that this is the test server. For the final
version use www.ebi.ac.uk instead of wwwdev.ebi.ac.uk.  You may also
need the =--insecure= switch to circumvent certificate checking.

/work in progress (WIP)/

* Define the EBI sample

EBI's sample form for virus is defined [[https://www.ebi.ac.uk/ena/browser/view/ERC000033][here]].

/work in progress (WIP)/

* Define the EBI sequence

/work in progress (WIP)/