aboutsummaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authorPeter Amstutz2020-08-05 16:06:11 -0400
committerGitHub2020-08-05 16:06:11 -0400
commitfdb1b012fc04ee07f401541e181e28fe442c9454 (patch)
tree8486db1087692dffcea9d93814e436d9cf150b47
parent86f31ef60f65a820bf9ac25c3fc01c88f2a9ebfe (diff)
parent2d20bf90497588a297ca98a78ee0fbbcadf95569 (diff)
downloadbh20-seq-resource-fdb1b012fc04ee07f401541e181e28fe442c9454.tar.gz
bh20-seq-resource-fdb1b012fc04ee07f401541e181e28fe442c9454.tar.lz
bh20-seq-resource-fdb1b012fc04ee07f401541e181e28fe442c9454.zip
Merge pull request #99 from AndreaGuarracino/patch-2
several fixes in the website, added links to video talk and poster, new pangenome generation workflow
-rw-r--r--README.md2
-rw-r--r--bh20simplewebuploader/static/image/BCC2020_AndreaGuarracino_COVID19PubSeq_Poster.pdfbin0 -> 2971149 bytes
-rw-r--r--bh20simplewebuploader/static/image/BCC2020_AndreaGuarracino_COVID19PubSeq_Poster.pngbin0 -> 160370 bytes
-rw-r--r--bh20simplewebuploader/static/main.css2
-rw-r--r--bh20simplewebuploader/templates/blurb.html6
-rw-r--r--bh20simplewebuploader/templates/footer.html5
-rw-r--r--doc/blog/using-covid-19-pubseq-part2.html33
-rw-r--r--doc/blog/using-covid-19-pubseq-part2.org25
-rw-r--r--doc/blog/using-covid-19-pubseq-part3.html2
-rw-r--r--doc/blog/using-covid-19-pubseq-part3.org2
-rw-r--r--doc/web/about.html1489
-rw-r--r--doc/web/about.org20
-rw-r--r--image/homepage.pngbin0 -> 243544 bytes
-rw-r--r--image/website.pngbin220860 -> 0 bytes
-rw-r--r--workflows/pangenome-generate/odgi-build-from-spoa-gfa.cwl29
-rw-r--r--workflows/pangenome-generate/pangenome-generate_spoa.cwl122
-rw-r--r--workflows/pangenome-generate/sort_fasta_by_quality_and_len.cwl18
-rw-r--r--workflows/pangenome-generate/sort_fasta_by_quality_and_len.py35
-rw-r--r--workflows/pangenome-generate/spoa.cwl27
19 files changed, 1219 insertions, 598 deletions
diff --git a/README.md b/README.md
index 8c3a589..03e4297 100644
--- a/README.md
+++ b/README.md
@@ -9,7 +9,7 @@ web interface. You can use it to upload the genomes of SARS-CoV-2
samples to make them publicly and freely available to other
researchers. For more information see the [paper](./paper/paper.md).
-![alt text](./image/website.png "Website")
+![alt text](./image/homepage.png "Website")
To get started, first [install the uploader](#installation), and use the `bh20-seq-uploader` command to [upload your data](#usage).
diff --git a/bh20simplewebuploader/static/image/BCC2020_AndreaGuarracino_COVID19PubSeq_Poster.pdf b/bh20simplewebuploader/static/image/BCC2020_AndreaGuarracino_COVID19PubSeq_Poster.pdf
new file mode 100644
index 0000000..7da8cd6
--- /dev/null
+++ b/bh20simplewebuploader/static/image/BCC2020_AndreaGuarracino_COVID19PubSeq_Poster.pdf
Binary files differ
diff --git a/bh20simplewebuploader/static/image/BCC2020_AndreaGuarracino_COVID19PubSeq_Poster.png b/bh20simplewebuploader/static/image/BCC2020_AndreaGuarracino_COVID19PubSeq_Poster.png
new file mode 100644
index 0000000..eae2721
--- /dev/null
+++ b/bh20simplewebuploader/static/image/BCC2020_AndreaGuarracino_COVID19PubSeq_Poster.png
Binary files differ
diff --git a/bh20simplewebuploader/static/main.css b/bh20simplewebuploader/static/main.css
index bdcc0bc..7c33d9c 100644
--- a/bh20simplewebuploader/static/main.css
+++ b/bh20simplewebuploader/static/main.css
@@ -177,7 +177,7 @@ span.dropt:hover {text-decoration: none; background: #ffffff; z-index: 6; }
.about {
display: grid;
- grid-template-columns: 1fr 1fr;
+ grid-template-columns: 1fr 1fr 1fr;
grid-auto-flow: row;
}
diff --git a/bh20simplewebuploader/templates/blurb.html b/bh20simplewebuploader/templates/blurb.html
index 9eef7c2..067cc3b 100644
--- a/bh20simplewebuploader/templates/blurb.html
+++ b/bh20simplewebuploader/templates/blurb.html
@@ -2,12 +2,12 @@
This is the COVID-19 Public Sequence Resource (COVID-19 PubSeq) for
SARS-CoV-2 virus sequences. COVID-19 PubSeq is a repository for
sequences with a low barrier to entry for uploading sequence data
- using best practices, including <a href="https://en.wikipedia.org/wiki/FAIR_data">FAIR data</a>. I.e., data published with a creative commons
- CC0 or CC-4.0 license with metadata using state-of-the art standards
+ using best practices, including <a href="https://en.wikipedia.org/wiki/FAIR_data">FAIR data</a>. Data are published with
+ metadata using state-of-the art standards
and, perhaps most importantly, providing standardised workflows that
get triggered on upload, so that results are immediately available
in standardised data formats.
-
+
Your uploaded sequence will automatically be processed and
incorporated into the public pangenome with metadata using worklows
from the High Performance Open Biology Lab
diff --git a/bh20simplewebuploader/templates/footer.html b/bh20simplewebuploader/templates/footer.html
index 26ea82a..abf46c3 100644
--- a/bh20simplewebuploader/templates/footer.html
+++ b/bh20simplewebuploader/templates/footer.html
@@ -15,6 +15,11 @@
</p>
</div>
+ <div>
+ <a href="static/image/BCC2020_AndreaGuarracino_COVID19PubSeq_Poster.pdf">
+ <img src=static/image/BCC2020_AndreaGuarracino_COVID19PubSeq_Poster.png" alt="BCC2020 Andrea Guarracino COVID19 PubSeq Poster"/>
+ </a>
+ </div>
<div class="sponsors">
<div class="sponsorimg">
<a href="https://github.com/virtual-biohackathons/covid-19-bh20">
diff --git a/doc/blog/using-covid-19-pubseq-part2.html b/doc/blog/using-covid-19-pubseq-part2.html
index c047441..c041ebe 100644
--- a/doc/blog/using-covid-19-pubseq-part2.html
+++ b/doc/blog/using-covid-19-pubseq-part2.html
@@ -259,39 +259,12 @@ for the JavaScript code in this tag.
</ul>
</div>
</div>
-<p>
-As part of the COVID-19 Biohackathon 2020 we formed a working group to
-create a COVID-19 Public Sequence Resource (COVID-19 PubSeq) for
-Corona virus sequences. The general idea is to create a repository
-that has a low barrier to entry for uploading sequence data using best
-practices. I.e., data published with a creative commons 4.0 (CC-4.0)
-license with metadata using state-of-the art standards and, perhaps
-most importantly, providing standardised workflows that get triggered
-on upload, so that results are immediately available in standardised
-data formats.
-</p>
<div id="outline-container-org7942167" class="outline-2">
<h2 id="org7942167"><span class="section-number-2">1</span> Finding output of workflows</h2>
<div class="outline-text-2" id="text-1">
-<p>
-As part of the COVID-19 Biohackathon 2020 we formed a working group to
-create a COVID-19 Public Sequence Resource (COVID-19 PubSeq) for
-Corona virus sequences. The general idea is to create a repository
-that has a low barrier to entry for uploading sequence data using best
-practices. I.e., data published with a creative commons 4.0 (CC-4.0)
-license with metadata using state-of-the art standards and, perhaps
-most importantly, providing standardised workflows that get triggered
-on upload, so that results are immediately available in standardised
-data formats.
-</p>
-</div>
-</div>
-<div id="outline-container-org0022bbe" class="outline-2">
-<h2 id="org0022bbe"><span class="section-number-2">2</span> Introduction</h2>
-<div class="outline-text-2" id="text-2">
-<p>
+ <p>
We are using Arvados to run common workflow language (CWL) pipelines.
The most recent output is on display on a <a href="https://workbench.lugli.arvadosapi.com/collections/lugli-4zz18-z513nlpqm03hpca">web page</a> (with time stamp)
and a full list is generated <a href="https://collections.lugli.arvadosapi.com/c=lugli-4zz18-z513nlpqm03hpca/">here</a>. It is nice to start up, but for
@@ -302,7 +275,7 @@ want to wade through thousands of output files!
</div>
<div id="outline-container-org3929710" class="outline-2">
-<h2 id="org3929710"><span class="section-number-2">3</span> The Arvados file interface</h2>
+<h2 id="org3929710"><span class="section-number-2">2</span> The Arvados file interface</h2>
<div class="outline-text-2" id="text-3">
<p>
Arvados has the web server, but it also has a REST API and associated
@@ -384,7 +357,7 @@ arv-get 2be6af7b4741f2a5c5f8ff2bc6152d73+1955623+Ab9ad65d7fe958a053b3a57d545839d
</div>
<div id="outline-container-orgc4dba6e" class="outline-2">
-<h2 id="orgc4dba6e"><span class="section-number-2">4</span> Using the Arvados API</h2>
+<h2 id="orgc4dba6e"><span class="section-number-2">3</span> TODO Using the Arvados API</h2>
</div>
</div>
<div id="postamble" class="status">
diff --git a/doc/blog/using-covid-19-pubseq-part2.org b/doc/blog/using-covid-19-pubseq-part2.org
index d2a1cbc..349fd06 100644
--- a/doc/blog/using-covid-19-pubseq-part2.org
+++ b/doc/blog/using-covid-19-pubseq-part2.org
@@ -8,36 +8,13 @@
#+HTML_LINK_HOME: http://covid19.genenetwork.org
#+HTML_HEAD: <link rel="Blog stylesheet" type="text/css" href="blog.css" />
-As part of the COVID-19 Biohackathon 2020 we formed a working group to
-create a COVID-19 Public Sequence Resource (COVID-19 PubSeq) for
-Corona virus sequences. The general idea is to create a repository
-that has a low barrier to entry for uploading sequence data using best
-practices. I.e., data published with a creative commons 4.0 (CC-4.0)
-license with metadata using state-of-the art standards and, perhaps
-most importantly, providing standardised workflows that get triggered
-on upload, so that results are immediately available in standardised
-data formats.
-
* Table of Contents :TOC:noexport:
- [[#finding-output-of-workflows][Finding output of workflows]]
- - [[#introduction][Introduction]]
- [[#the-arvados-file-interface][The Arvados file interface]]
- [[#using-the-arvados-api][Using the Arvados API]]
* Finding output of workflows
-As part of the COVID-19 Biohackathon 2020 we formed a working group to
-create a COVID-19 Public Sequence Resource (COVID-19 PubSeq) for
-Corona virus sequences. The general idea is to create a repository
-that has a low barrier to entry for uploading sequence data using best
-practices. I.e., data published with a creative commons 4.0 (CC-4.0)
-license with metadata using state-of-the art standards and, perhaps
-most importantly, providing standardised workflows that get triggered
-on upload, so that results are immediately available in standardised
-data formats.
-
-* Introduction
-
We are using Arvados to run common workflow language (CWL) pipelines.
The most recent output is on display on a [[https://workbench.lugli.arvadosapi.com/collections/lugli-4zz18-z513nlpqm03hpca][web page]] (with time stamp)
and a full list is generated [[https://collections.lugli.arvadosapi.com/c=lugli-4zz18-z513nlpqm03hpca/][here]]. It is nice to start up, but for
@@ -81,4 +58,4 @@ its listed UUID:
: arv-get 2be6af7b4741f2a5c5f8ff2bc6152d73+1955623+Ab9ad65d7fe958a053b3a57d545839de18290843a@5ed7f3c5
-* Using the Arvados API
+* TODO Using the Arvados API
diff --git a/doc/blog/using-covid-19-pubseq-part3.html b/doc/blog/using-covid-19-pubseq-part3.html
index 91879b0..df4a286 100644
--- a/doc/blog/using-covid-19-pubseq-part3.html
+++ b/doc/blog/using-covid-19-pubseq-part3.html
@@ -625,7 +625,7 @@ The web interface using this exact same script so it should just work
<h3 id="org39adf09"><span class="section-number-3">6.2</span> Example: uploading bulk GenBank sequences</h3>
<div class="outline-text-3" id="text-6-2">
<p>
-We also use above script to bulk upload GenBank sequences with a <a href="https://github.com/arvados/bh20-seq-resource/blob/master/scripts/from_genbank_to_fasta_and_yaml.py">FASTA
+We also use above script to bulk upload GenBank sequences with a <a href="https://github.com/arvados/bh20-seq-resource/blob/master/scripts/download_genbank_data/from_genbank_to_fasta_and_yaml.py">FASTA
and YAML</a> extractor specific for GenBank. This means that the steps we
took above for uploading a GenBank sequence are already automated.
</p>
diff --git a/doc/blog/using-covid-19-pubseq-part3.org b/doc/blog/using-covid-19-pubseq-part3.org
index 03f37ab..e8fee36 100644
--- a/doc/blog/using-covid-19-pubseq-part3.org
+++ b/doc/blog/using-covid-19-pubseq-part3.org
@@ -234,6 +234,6 @@ The web interface using this exact same script so it should just work
** Example: uploading bulk GenBank sequences
-We also use above script to bulk upload GenBank sequences with a [[https://github.com/arvados/bh20-seq-resource/blob/master/scripts/from_genbank_to_fasta_and_yaml.py][FASTA
+We also use above script to bulk upload GenBank sequences with a [[https://github.com/arvados/bh20-seq-resource/blob/master/scripts/download_genbank_data/from_genbank_to_fasta_and_yaml.py][FASTA
and YAML]] extractor specific for GenBank. This means that the steps we
took above for uploading a GenBank sequence are already automated.
diff --git a/doc/web/about.html b/doc/web/about.html
index dfd4252..c971a4e 100644
--- a/doc/web/about.html
+++ b/doc/web/about.html
@@ -1,549 +1,964 @@
<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
-"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
+ "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" lang="en" xml:lang="en">
<head>
-<!-- 2020-07-18 Sat 03:27 -->
-<meta http-equiv="Content-Type" content="text/html;charset=utf-8" />
-<meta name="viewport" content="width=device-width, initial-scale=1" />
-<title>About/FAQ</title>
-<meta name="generator" content="Org mode" />
-<meta name="author" content="Pjotr Prins" />
-<style type="text/css">
- <!--/*--><![CDATA[/*><!--*/
- .title { text-align: center;
- margin-bottom: .2em; }
- .subtitle { text-align: center;
- font-size: medium;
- font-weight: bold;
- margin-top:0; }
- .todo { font-family: monospace; color: red; }
- .done { font-family: monospace; color: green; }
- .priority { font-family: monospace; color: orange; }
- .tag { background-color: #eee; font-family: monospace;
- padding: 2px; font-size: 80%; font-weight: normal; }
- .timestamp { color: #bebebe; }
- .timestamp-kwd { color: #5f9ea0; }
- .org-right { margin-left: auto; margin-right: 0px; text-align: right; }
- .org-left { margin-left: 0px; margin-right: auto; text-align: left; }
- .org-center { margin-left: auto; margin-right: auto; text-align: center; }
- .underline { text-decoration: underline; }
- #postamble p, #preamble p { font-size: 90%; margin: .2em; }
- p.verse { margin-left: 3%; }
- pre {
- border: 1px solid #ccc;
- box-shadow: 3px 3px 3px #eee;
- padding: 8pt;
- font-family: monospace;
- overflow: auto;
- margin: 1.2em;
- }
- pre.src {
- position: relative;
- overflow: visible;
- padding-top: 1.2em;
- }
- pre.src:before {
- display: none;
- position: absolute;
- background-color: white;
- top: -10px;
- right: 10px;
- padding: 3px;
- border: 1px solid black;
- }
- pre.src:hover:before { display: inline;}
- /* Languages per Org manual */
- pre.src-asymptote:before { content: 'Asymptote'; }
- pre.src-awk:before { content: 'Awk'; }
- pre.src-C:before { content: 'C'; }
- /* pre.src-C++ doesn't work in CSS */
- pre.src-clojure:before { content: 'Clojure'; }
- pre.src-css:before { content: 'CSS'; }
- pre.src-D:before { content: 'D'; }
- pre.src-ditaa:before { content: 'ditaa'; }
- pre.src-dot:before { content: 'Graphviz'; }
- pre.src-calc:before { content: 'Emacs Calc'; }
- pre.src-emacs-lisp:before { content: 'Emacs Lisp'; }
- pre.src-fortran:before { content: 'Fortran'; }
- pre.src-gnuplot:before { content: 'gnuplot'; }
- pre.src-haskell:before { content: 'Haskell'; }
- pre.src-hledger:before { content: 'hledger'; }
- pre.src-java:before { content: 'Java'; }
- pre.src-js:before { content: 'Javascript'; }
- pre.src-latex:before { content: 'LaTeX'; }
- pre.src-ledger:before { content: 'Ledger'; }
- pre.src-lisp:before { content: 'Lisp'; }
- pre.src-lilypond:before { content: 'Lilypond'; }
- pre.src-lua:before { content: 'Lua'; }
- pre.src-matlab:before { content: 'MATLAB'; }
- pre.src-mscgen:before { content: 'Mscgen'; }
- pre.src-ocaml:before { content: 'Objective Caml'; }
- pre.src-octave:before { content: 'Octave'; }
- pre.src-org:before { content: 'Org mode'; }
- pre.src-oz:before { content: 'OZ'; }
- pre.src-plantuml:before { content: 'Plantuml'; }
- pre.src-processing:before { content: 'Processing.js'; }
- pre.src-python:before { content: 'Python'; }
- pre.src-R:before { content: 'R'; }
- pre.src-ruby:before { content: 'Ruby'; }
- pre.src-sass:before { content: 'Sass'; }
- pre.src-scheme:before { content: 'Scheme'; }
- pre.src-screen:before { content: 'Gnu Screen'; }
- pre.src-sed:before { content: 'Sed'; }
- pre.src-sh:before { content: 'shell'; }
- pre.src-sql:before { content: 'SQL'; }
- pre.src-sqlite:before { content: 'SQLite'; }
- /* additional languages in org.el's org-babel-load-languages alist */
- pre.src-forth:before { content: 'Forth'; }
- pre.src-io:before { content: 'IO'; }
- pre.src-J:before { content: 'J'; }
- pre.src-makefile:before { content: 'Makefile'; }
- pre.src-maxima:before { content: 'Maxima'; }
- pre.src-perl:before { content: 'Perl'; }
- pre.src-picolisp:before { content: 'Pico Lisp'; }
- pre.src-scala:before { content: 'Scala'; }
- pre.src-shell:before { content: 'Shell Script'; }
- pre.src-ebnf2ps:before { content: 'ebfn2ps'; }
- /* additional language identifiers per "defun org-babel-execute"
- in ob-*.el */
- pre.src-cpp:before { content: 'C++'; }
- pre.src-abc:before { content: 'ABC'; }
- pre.src-coq:before { content: 'Coq'; }
- pre.src-groovy:before { content: 'Groovy'; }
- /* additional language identifiers from org-babel-shell-names in
- ob-shell.el: ob-shell is the only babel language using a lambda to put
- the execution function name together. */
- pre.src-bash:before { content: 'bash'; }
- pre.src-csh:before { content: 'csh'; }
- pre.src-ash:before { content: 'ash'; }
- pre.src-dash:before { content: 'dash'; }
- pre.src-ksh:before { content: 'ksh'; }
- pre.src-mksh:before { content: 'mksh'; }
- pre.src-posh:before { content: 'posh'; }
- /* Additional Emacs modes also supported by the LaTeX listings package */
- pre.src-ada:before { content: 'Ada'; }
- pre.src-asm:before { content: 'Assembler'; }
- pre.src-caml:before { content: 'Caml'; }
- pre.src-delphi:before { content: 'Delphi'; }
- pre.src-html:before { content: 'HTML'; }
- pre.src-idl:before { content: 'IDL'; }
- pre.src-mercury:before { content: 'Mercury'; }
- pre.src-metapost:before { content: 'MetaPost'; }
- pre.src-modula-2:before { content: 'Modula-2'; }
- pre.src-pascal:before { content: 'Pascal'; }
- pre.src-ps:before { content: 'PostScript'; }
- pre.src-prolog:before { content: 'Prolog'; }
- pre.src-simula:before { content: 'Simula'; }
- pre.src-tcl:before { content: 'tcl'; }
- pre.src-tex:before { content: 'TeX'; }
- pre.src-plain-tex:before { content: 'Plain TeX'; }
- pre.src-verilog:before { content: 'Verilog'; }
- pre.src-vhdl:before { content: 'VHDL'; }
- pre.src-xml:before { content: 'XML'; }
- pre.src-nxml:before { content: 'XML'; }
- /* add a generic configuration mode; LaTeX export needs an additional
- (add-to-list 'org-latex-listings-langs '(conf " ")) in .emacs */
- pre.src-conf:before { content: 'Configuration File'; }
-
- table { border-collapse:collapse; }
- caption.t-above { caption-side: top; }
- caption.t-bottom { caption-side: bottom; }
- td, th { vertical-align:top; }
- th.org-right { text-align: center; }
- th.org-left { text-align: center; }
- th.org-center { text-align: center; }
- td.org-right { text-align: right; }
- td.org-left { text-align: left; }
- td.org-center { text-align: center; }
- dt { font-weight: bold; }
- .footpara { display: inline; }
- .footdef { margin-bottom: 1em; }
- .figure { padding: 1em; }
- .figure p { text-align: center; }
- .equation-container {
- display: table;
- text-align: center;
- width: 100%;
- }
- .equation {
- vertical-align: middle;
- }
- .equation-label {
- display: table-cell;
- text-align: right;
- vertical-align: middle;
- }
- .inlinetask {
- padding: 10px;
- border: 2px solid gray;
- margin: 10px;
- background: #ffffcc;
- }
- #org-div-home-and-up
- { text-align: right; font-size: 70%; white-space: nowrap; }
- textarea { overflow-x: auto; }
- .linenr { font-size: smaller }
- .code-highlighted { background-color: #ffff00; }
- .org-info-js_info-navigation { border-style: none; }
- #org-info-js_console-label
- { font-size: 10px; font-weight: bold; white-space: nowrap; }
- .org-info-js_search-highlight
- { background-color: #ffff00; color: #000000; font-weight: bold; }
- .org-svg { width: 90%; }
- /*]]>*/-->
-</style>
-<script type="text/javascript">
-/*
-@licstart The following is the entire license notice for the
-JavaScript code in this tag.
-
-Copyright (C) 2012-2020 Free Software Foundation, Inc.
-
-The JavaScript code in this tag is free software: you can
-redistribute it and/or modify it under the terms of the GNU
-General Public License (GNU GPL) as published by the Free Software
-Foundation, either version 3 of the License, or (at your option)
-any later version. The code is distributed WITHOUT ANY WARRANTY;
-without even the implied warranty of MERCHANTABILITY or FITNESS
-FOR A PARTICULAR PURPOSE. See the GNU GPL for more details.
-
-As additional permission under GNU GPL version 3 section 7, you
-may distribute non-source (e.g., minimized or compacted) forms of
-that code without the copy of the GNU GPL normally required by
-section 4, provided you include this license notice and a URL
-through which recipients can access the Corresponding Source.
-
-
-@licend The above is the entire license notice
-for the JavaScript code in this tag.
-*/
-<!--/*--><![CDATA[/*><!--*/
- function CodeHighlightOn(elem, id)
- {
- var target = document.getElementById(id);
- if(null != target) {
- elem.cacheClassElem = elem.className;
- elem.cacheClassTarget = target.className;
- target.className = "code-highlighted";
- elem.className = "code-highlighted";
- }
- }
- function CodeHighlightOff(elem, id)
- {
- var target = document.getElementById(id);
- if(elem.cacheClassElem)
- elem.className = elem.cacheClassElem;
- if(elem.cacheClassTarget)
- target.className = elem.cacheClassTarget;
- }
-/*]]>*///-->
-</script>
+ <!-- 2020-07-18 Sat 03:27 -->
+ <meta http-equiv="Content-Type" content="text/html;charset=utf-8"/>
+ <meta name="viewport" content="width=device-width, initial-scale=1"/>
+ <title>About/FAQ</title>
+ <meta name="generator" content="Org mode"/>
+ <meta name="author" content="Pjotr Prins"/>
+ <style type="text/css">
+ <!-- /*--><![CDATA[/*><!--*/
+ .title {
+ text-align: center;
+ margin-bottom: .2em;
+ }
+
+ .subtitle {
+ text-align: center;
+ font-size: medium;
+ font-weight: bold;
+ margin-top: 0;
+ }
+
+ .todo {
+ font-family: monospace;
+ color: red;
+ }
+
+ .done {
+ font-family: monospace;
+ color: green;
+ }
+
+ .priority {
+ font-family: monospace;
+ color: orange;
+ }
+
+ .tag {
+ background-color: #eee;
+ font-family: monospace;
+ padding: 2px;
+ font-size: 80%;
+ font-weight: normal;
+ }
+
+ .timestamp {
+ color: #bebebe;
+ }
+
+ .timestamp-kwd {
+ color: #5f9ea0;
+ }
+
+ .org-right {
+ margin-left: auto;
+ margin-right: 0px;
+ text-align: right;
+ }
+
+ .org-left {
+ margin-left: 0px;
+ margin-right: auto;
+ text-align: left;
+ }
+
+ .org-center {
+ margin-left: auto;
+ margin-right: auto;
+ text-align: center;
+ }
+
+ .underline {
+ text-decoration: underline;
+ }
+
+ #postamble p, #preamble p {
+ font-size: 90%;
+ margin: .2em;
+ }
+
+ p.verse {
+ margin-left: 3%;
+ }
+
+ pre {
+ border: 1px solid #ccc;
+ box-shadow: 3px 3px 3px #eee;
+ padding: 8pt;
+ font-family: monospace;
+ overflow: auto;
+ margin: 1.2em;
+ }
+
+ pre.src {
+ position: relative;
+ overflow: visible;
+ padding-top: 1.2em;
+ }
+
+ pre.src:before {
+ display: none;
+ position: absolute;
+ background-color: white;
+ top: -10px;
+ right: 10px;
+ padding: 3px;
+ border: 1px solid black;
+ }
+
+ pre.src:hover:before {
+ display: inline;
+ }
+
+ /* Languages per Org manual */
+ pre.src-asymptote:before {
+ content: 'Asymptote';
+ }
+
+ pre.src-awk:before {
+ content: 'Awk';
+ }
+
+ pre.src-C:before {
+ content: 'C';
+ }
+
+ /* pre.src-C++ doesn't work in CSS */
+ pre.src-clojure:before {
+ content: 'Clojure';
+ }
+
+ pre.src-css:before {
+ content: 'CSS';
+ }
+
+ pre.src-D:before {
+ content: 'D';
+ }
+
+ pre.src-ditaa:before {
+ content: 'ditaa';
+ }
+
+ pre.src-dot:before {
+ content: 'Graphviz';
+ }
+
+ pre.src-calc:before {
+ content: 'Emacs Calc';
+ }
+
+ pre.src-emacs-lisp:before {
+ content: 'Emacs Lisp';
+ }
+
+ pre.src-fortran:before {
+ content: 'Fortran';
+ }
+
+ pre.src-gnuplot:before {
+ content: 'gnuplot';
+ }
+
+ pre.src-haskell:before {
+ content: 'Haskell';
+ }
+
+ pre.src-hledger:before {
+ content: 'hledger';
+ }
+
+ pre.src-java:before {
+ content: 'Java';
+ }
+
+ pre.src-js:before {
+ content: 'Javascript';
+ }
+
+ pre.src-latex:before {
+ content: 'LaTeX';
+ }
+
+ pre.src-ledger:before {
+ content: 'Ledger';
+ }
+
+ pre.src-lisp:before {
+ content: 'Lisp';
+ }
+
+ pre.src-lilypond:before {
+ content: 'Lilypond';
+ }
+
+ pre.src-lua:before {
+ content: 'Lua';
+ }
+
+ pre.src-matlab:before {
+ content: 'MATLAB';
+ }
+
+ pre.src-mscgen:before {
+ content: 'Mscgen';
+ }
+
+ pre.src-ocaml:before {
+ content: 'Objective Caml';
+ }
+
+ pre.src-octave:before {
+ content: 'Octave';
+ }
+
+ pre.src-org:before {
+ content: 'Org mode';
+ }
+
+ pre.src-oz:before {
+ content: 'OZ';
+ }
+
+ pre.src-plantuml:before {
+ content: 'Plantuml';
+ }
+
+ pre.src-processing:before {
+ content: 'Processing.js';
+ }
+
+ pre.src-python:before {
+ content: 'Python';
+ }
+
+ pre.src-R:before {
+ content: 'R';
+ }
+
+ pre.src-ruby:before {
+ content: 'Ruby';
+ }
+
+ pre.src-sass:before {
+ content: 'Sass';
+ }
+
+ pre.src-scheme:before {
+ content: 'Scheme';
+ }
+
+ pre.src-screen:before {
+ content: 'Gnu Screen';
+ }
+
+ pre.src-sed:before {
+ content: 'Sed';
+ }
+
+ pre.src-sh:before {
+ content: 'shell';
+ }
+
+ pre.src-sql:before {
+ content: 'SQL';
+ }
+
+ pre.src-sqlite:before {
+ content: 'SQLite';
+ }
+
+ /* additional languages in org.el's org-babel-load-languages alist */
+ pre.src-forth:before {
+ content: 'Forth';
+ }
+
+ pre.src-io:before {
+ content: 'IO';
+ }
+
+ pre.src-J:before {
+ content: 'J';
+ }
+
+ pre.src-makefile:before {
+ content: 'Makefile';
+ }
+
+ pre.src-maxima:before {
+ content: 'Maxima';
+ }
+
+ pre.src-perl:before {
+ content: 'Perl';
+ }
+
+ pre.src-picolisp:before {
+ content: 'Pico Lisp';
+ }
+
+ pre.src-scala:before {
+ content: 'Scala';
+ }
+
+ pre.src-shell:before {
+ content: 'Shell Script';
+ }
+
+ pre.src-ebnf2ps:before {
+ content: 'ebfn2ps';
+ }
+
+ /* additional language identifiers per "defun org-babel-execute"
+ in ob-*.el */
+ pre.src-cpp:before {
+ content: 'C++';
+ }
+
+ pre.src-abc:before {
+ content: 'ABC';
+ }
+
+ pre.src-coq:before {
+ content: 'Coq';
+ }
+
+ pre.src-groovy:before {
+ content: 'Groovy';
+ }
+
+ /* additional language identifiers from org-babel-shell-names in
+ ob-shell.el: ob-shell is the only babel language using a lambda to put
+ the execution function name together. */
+ pre.src-bash:before {
+ content: 'bash';
+ }
+
+ pre.src-csh:before {
+ content: 'csh';
+ }
+
+ pre.src-ash:before {
+ content: 'ash';
+ }
+
+ pre.src-dash:before {
+ content: 'dash';
+ }
+
+ pre.src-ksh:before {
+ content: 'ksh';
+ }
+
+ pre.src-mksh:before {
+ content: 'mksh';
+ }
+
+ pre.src-posh:before {
+ content: 'posh';
+ }
+
+ /* Additional Emacs modes also supported by the LaTeX listings package */
+ pre.src-ada:before {
+ content: 'Ada';
+ }
+
+ pre.src-asm:before {
+ content: 'Assembler';
+ }
+
+ pre.src-caml:before {
+ content: 'Caml';
+ }
+
+ pre.src-delphi:before {
+ content: 'Delphi';
+ }
+
+ pre.src-html:before {
+ content: 'HTML';
+ }
+
+ pre.src-idl:before {
+ content: 'IDL';
+ }
+
+ pre.src-mercury:before {
+ content: 'Mercury';
+ }
+
+ pre.src-metapost:before {
+ content: 'MetaPost';
+ }
+
+ pre.src-modula-2:before {
+ content: 'Modula-2';
+ }
+
+ pre.src-pascal:before {
+ content: 'Pascal';
+ }
+
+ pre.src-ps:before {
+ content: 'PostScript';
+ }
+
+ pre.src-prolog:before {
+ content: 'Prolog';
+ }
+
+ pre.src-simula:before {
+ content: 'Simula';
+ }
+
+ pre.src-tcl:before {
+ content: 'tcl';
+ }
+
+ pre.src-tex:before {
+ content: 'TeX';
+ }
+
+ pre.src-plain-tex:before {
+ content: 'Plain TeX';
+ }
+
+ pre.src-verilog:before {
+ content: 'Verilog';
+ }
+
+ pre.src-vhdl:before {
+ content: 'VHDL';
+ }
+
+ pre.src-xml:before {
+ content: 'XML';
+ }
+
+ pre.src-nxml:before {
+ content: 'XML';
+ }
+
+ /* add a generic configuration mode; LaTeX export needs an additional
+ (add-to-list 'org-latex-listings-langs '(conf " ")) in .emacs */
+ pre.src-conf:before {
+ content: 'Configuration File';
+ }
+
+ table {
+ border-collapse: collapse;
+ }
+
+ caption.t-above {
+ caption-side: top;
+ }
+
+ caption.t-bottom {
+ caption-side: bottom;
+ }
+
+ td, th {
+ vertical-align: top;
+ }
+
+ th.org-right {
+ text-align: center;
+ }
+
+ th.org-left {
+ text-align: center;
+ }
+
+ th.org-center {
+ text-align: center;
+ }
+
+ td.org-right {
+ text-align: right;
+ }
+
+ td.org-left {
+ text-align: left;
+ }
+
+ td.org-center {
+ text-align: center;
+ }
+
+ dt {
+ font-weight: bold;
+ }
+
+ .footpara {
+ display: inline;
+ }
+
+ .footdef {
+ margin-bottom: 1em;
+ }
+
+ .figure {
+ padding: 1em;
+ }
+
+ .figure p {
+ text-align: center;
+ }
+
+ .equation-container {
+ display: table;
+ text-align: center;
+ width: 100%;
+ }
+
+ .equation {
+ vertical-align: middle;
+ }
+
+ .equation-label {
+ display: table-cell;
+ text-align: right;
+ vertical-align: middle;
+ }
+
+ .inlinetask {
+ padding: 10px;
+ border: 2px solid gray;
+ margin: 10px;
+ background: #ffffcc;
+ }
+
+ #org-div-home-and-up {
+ text-align: right;
+ font-size: 70%;
+ white-space: nowrap;
+ }
+
+ textarea {
+ overflow-x: auto;
+ }
+
+ .linenr {
+ font-size: smaller
+ }
+
+ .code-highlighted {
+ background-color: #ffff00;
+ }
+
+ .org-info-js_info-navigation {
+ border-style: none;
+ }
+
+ #org-info-js_console-label {
+ font-size: 10px;
+ font-weight: bold;
+ white-space: nowrap;
+ }
+
+ .org-info-js_search-highlight {
+ background-color: #ffff00;
+ color: #000000;
+ font-weight: bold;
+ }
+
+ .org-svg {
+ width: 90%;
+ }
+
+ /*]]>*/
+ -->
+ </style>
+ <script type="text/javascript">
+ /*
+ @licstart The following is the entire license notice for the
+ JavaScript code in this tag.
+
+ Copyright (C) 2012-2020 Free Software Foundation, Inc.
+
+ The JavaScript code in this tag is free software: you can
+ redistribute it and/or modify it under the terms of the GNU
+ General Public License (GNU GPL) as published by the Free Software
+ Foundation, either version 3 of the License, or (at your option)
+ any later version. The code is distributed WITHOUT ANY WARRANTY;
+ without even the implied warranty of MERCHANTABILITY or FITNESS
+ FOR A PARTICULAR PURPOSE. See the GNU GPL for more details.
+
+ As additional permission under GNU GPL version 3 section 7, you
+ may distribute non-source (e.g., minimized or compacted) forms of
+ that code without the copy of the GNU GPL normally required by
+ section 4, provided you include this license notice and a URL
+ through which recipients can access the Corresponding Source.
+
+
+ @licend The above is the entire license notice
+ for the JavaScript code in this tag.
+ */
+ <!--/*--><![CDATA[/*><!--*/
+ function CodeHighlightOn(elem, id) {
+ var target = document.getElementById(id);
+ if (null != target) {
+ elem.cacheClassElem = elem.className;
+ elem.cacheClassTarget = target.className;
+ target.className = "code-highlighted";
+ elem.className = "code-highlighted";
+ }
+ }
+
+ function CodeHighlightOff(elem, id) {
+ var target = document.getElementById(id);
+ if (elem.cacheClassElem)
+ elem.className = elem.cacheClassElem;
+ if (elem.cacheClassTarget)
+ target.className = elem.cacheClassTarget;
+ }
+
+ /*]]>*///-->
+ </script>
</head>
<body>
<div id="content">
-<h1 class="title">About/FAQ</h1>
-<div id="table-of-contents">
-<h2>Table of Contents</h2>
-<div id="text-table-of-contents">
-<ul>
-<li><a href="#org0db9061">1. What is the 'public sequence resource' about?</a></li>
-<li><a href="#org983877d">2. Who created the public sequence resource?</a></li>
-<li><a href="#org83093c3">3. How does the public sequence resource compare to other data resources?</a></li>
-<li><a href="#org9b31fd4">4. Why should I upload my data here?</a></li>
-<li><a href="#org4e92cb5">5. Why should I not upload by data here?</a></li>
-<li><a href="#orgdfe72f6">6. How does the public sequence resource work?</a></li>
-<li><a href="#orgd0c5abb">7. Who uses the public sequence resource?</a></li>
-<li><a href="#org56f4a54">8. How can I contribute?</a></li>
-<li><a href="#org2240ef7">9. Is this about open data?</a></li>
-<li><a href="#orgbb655e0">10. Is this about free software?</a></li>
-<li><a href="#org4e779f4">11. How do I upload raw data?</a></li>
-<li><a href="#org83f6b7b">12. How do I change metadata?</a></li>
-<li><a href="#org1bc6dab">13. How do I change the work flows?</a></li>
-<li><a href="#org1140d62">14. How do I change the source code?</a></li>
-<li><a href="#orge182714">15. Should I choose CC-BY or CC0?</a></li>
-<li><a href="#orgf4a692b">16. How do I deal with private data and privacy?</a></li>
-<li><a href="#org7757574">17. How do I communicate with you?</a></li>
-<li><a href="#org194006f">18. Who are the sponsors?</a></li>
-</ul>
-</div>
-</div>
-
-<div id="outline-container-org0db9061" class="outline-2">
-<h2 id="org0db9061"><span class="section-number-2">1</span> What is the 'public sequence resource' about?</h2>
-<div class="outline-text-2" id="text-1">
-<p>
-The <b>public sequence resource</b> aims to provide a generic and useful
-resource for COVID-19 research. The focus is on providing the best
-possible sequence data with associated metadata that can be used for
-sequence comparison and protein prediction.
-</p>
-</div>
-</div>
-
-<div id="outline-container-org983877d" class="outline-2">
-<h2 id="org983877d"><span class="section-number-2">2</span> Who created the public sequence resource?</h2>
-<div class="outline-text-2" id="text-2">
-<p>
-The <b>public sequence resource</b> is an initiative by <a href="https://github.com/arvados/bh20-seq-resource/graphs/contributors">bioinformatics</a> and
-ontology experts who want to create something agile and useful for the
-wider research community. The initiative started at the COVID-19
-biohackathon in April 2020 and is ongoing. The main project drivers
-are Pjotr Prins (UTHSC), Peter Amstutz (Curii), Andrea Guarracino
-(University of Rome Tor Vergata), Michael Crusoe (Common Workflow
-Language), Thomas Liener (consultant, formerly EBI), Erik Garrison
-(UCSC) and Jerven Bolleman (Swiss Institute of Bioinformatics).
-</p>
-
-<p>
-Notably, as this is a free software initiative, the project represents
-major work by hundreds of software developers and ontology and data
-wrangling experts. Thank you everyone!
-</p>
-</div>
-</div>
-
-<div id="outline-container-org83093c3" class="outline-2">
-<h2 id="org83093c3"><span class="section-number-2">3</span> How does the public sequence resource compare to other data resources?</h2>
-<div class="outline-text-2" id="text-3">
-<p>
-The short version is that we use state-of-the-art practices in
-bioinformatics using agile methods. Unlike the resources from large
-institutes we can improve things on a dime and anyone can contribute
-to building out this resource! Sequences from GenBank, EBI/ENA and
-others are regularly added to PubSeq. We encourage people to everyone
-to submit on PubSeq because of its superior live tooling and metadata
-support (see the next question).
-</p>
-
-<p>
-Importantly: all data is published under either the <a href="https://creativecommons.org/licenses/by/4.0/">Creative Commons
-4.0 attribution license</a> or the <a href="https://creativecommons.org/share-your-work/public-domain/cc0/">CC0 “No Rights Reserved” license</a> which
-means it data can be published and workflows can run in public
-environments allowing for improved access for research and
-reproducible results. This contrasts with some other public resources,
-such as GISAID.
-</p>
-</div>
-</div>
-
-<div id="outline-container-org9b31fd4" class="outline-2">
-<h2 id="org9b31fd4"><span class="section-number-2">4</span> Why should I upload my data here?</h2>
-<div class="outline-text-2" id="text-4">
-<ol class="org-ol">
-<li>We champion truly shareable data without licensing restrictions - with proper
-attribution</li>
-<li>We provide full metadata support using state-of-the-art ontology's</li>
-<li>We provide a web-based sequence uploader and a command-line version
-for bulk uploads</li>
-<li>We provide a live SPARQL end-point for all metadata</li>
-<li>We provide free data analysis and sequence comparison triggered on data upload</li>
-<li>We do real work for you, with this <a href="https://workbench.lugli.arvadosapi.com/container_requests/lugli-xvhdp-bhhk4nxx1lch5od">link</a> you can see the last
-run took 5.5 hours!</li>
-<li>We provide free downloads of all computed output</li>
-<li>There is no need to set up pipelines and/or compute clusters</li>
-<li>All workflows get triggered on uploading a new sequence</li>
-<li>When someone (you?) improves the software/workflows and everyone benefits</li>
-<li>Your data gets automatically integrated with the Swiss Institure of
-Bioinformatics COVID-19 knowledge base
-<a href="https://covid-19-sparql.expasy.org/">https://covid-19-sparql.expasy.org/</a> (Elixir Switzerland)</li>
-<li>Your data will be used to develop drug targets</li>
-</ol>
-
-<p>
-Finally, if you upload your data here we have workflows that output
-formatted data suitable for <a href="http://covid19.genenetwork.org/blog?id=using-covid-19-pubseq-part6">uploading to EBI resources</a> (and soon
-others). Uploading your data here get your data ready for upload to
-multiple resources.
-</p>
-</div>
-</div>
-
-<div id="outline-container-org4e92cb5" class="outline-2">
-<h2 id="org4e92cb5"><span class="section-number-2">5</span> Why should I not upload by data here?</h2>
-<div class="outline-text-2" id="text-5">
-<p>
-Funny question. There are only good reasons to upload your data here
-and make it available to the widest audience possible.
-</p>
-
-<p>
-In fact, you can upload your data here as well as to other
-resources. It is your data after all. No one can prevent you from
-uploading your data to multiple resources.
-</p>
-
-<p>
-We recommend uploading to EBI and NCBI resources using our data
-conversion tools. It means you only enter data once and make the
-process smooth. You can also use our command line data uploader
-for bulk uploads!
-</p>
-</div>
-</div>
-
-<div id="outline-container-orgdfe72f6" class="outline-2">
-<h2 id="orgdfe72f6"><span class="section-number-2">6</span> How does the public sequence resource work?</h2>
-<div class="outline-text-2" id="text-6">
-<p>
-On uploading a sequence with metadata it will automatically be
-processed and incorporated into the public pangenome with metadata
-using workflows from the High Performance Open Biology Lab defined
-<a href="https://github.com/hpobio-lab/viral-analysis/tree/master/cwl/pangenome-generate">here</a>.
-</p>
-</div>
-</div>
-
-<div id="outline-container-orgd0c5abb" class="outline-2">
-<h2 id="orgd0c5abb"><span class="section-number-2">7</span> Who uses the public sequence resource?</h2>
-<div class="outline-text-2" id="text-7">
-<p>
-The Swiss Institute of Bioinformatics has included this data in
-<a href="https://covid-19-sparql.expasy.org/">https://covid-19-sparql.expasy.org/</a> and made it part of <a href="https://www.uniprot.org/">Uniprot</a>.
-</p>
-
-<p>
-The Pantograph <a href="https://graph-genome.github.io/">viewer</a> uses PubSeq data for their visualisations.
-</p>
-
-<p>
-<a href="https://uthsc.edu">UTHSC</a> (USA), <a href="https://www.esr.cri.nz/">ESR</a> (New Zealand) and <a href="https://www.ornl.gov/news/ornl-fight-against-covid-19">ORNL</a> (USA) use COVID-19 PubSeq data
-for monitoring, protein prediction and drug development.
-</p>
-</div>
-</div>
-
-<div id="outline-container-org56f4a54" class="outline-2">
-<h2 id="org56f4a54"><span class="section-number-2">8</span> How can I contribute?</h2>
-<div class="outline-text-2" id="text-8">
-<p>
-You can contribute by submitting sequences, updating metadata, submit
-issues on our issue tracker, and more importantly add functionality.
-See 'How do I change the source code' below. Read through our online
-documentation at <a href="http://covid19.genenetwork.org/blog">http://covid19.genenetwork.org/blog</a> as a starting
-point.
-</p>
-</div>
-</div>
-
-<div id="outline-container-org2240ef7" class="outline-2">
-<h2 id="org2240ef7"><span class="section-number-2">9</span> Is this about open data?</h2>
-<div class="outline-text-2" id="text-9">
-<p>
-All data is published under a <a href="https://creativecommons.org/licenses/by/4.0/">Creative Commons 4.0 attribution license</a>
-(CC-BY-4.0). You can download the raw and published (GFA/RDF/FASTA)
-data and store it for further processing.
-</p>
-</div>
-</div>
-
-<div id="outline-container-orgbb655e0" class="outline-2">
-<h2 id="orgbb655e0"><span class="section-number-2">10</span> Is this about free software?</h2>
-<div class="outline-text-2" id="text-10">
-<p>
-Absolutely. Free software allows for fully reproducible pipelines. You
-can take our workflows and data and run it elsewhere!
-</p>
-</div>
-</div>
-
-<div id="outline-container-org4e779f4" class="outline-2">
-<h2 id="org4e779f4"><span class="section-number-2">11</span> How do I upload raw data?</h2>
-<div class="outline-text-2" id="text-11">
-<p>
-We are preparing raw sequence data pipelines (fastq and BAM). The
-reason is that we want the best data possible for downstream analysis
-(including protein prediction and test development). The current
-approach where people publish final sequences of SARS-CoV-2 is lacking
-because it hides how this sequence was created. For reasons of
-reproducible and improved results we want/need to work with the raw
-sequence reads (both short reads and long reads) and take alternative
-assembly variations into consideration. This is all work in progress.
-</p>
-</div>
-</div>
-
-<div id="outline-container-org83f6b7b" class="outline-2">
-<h2 id="org83f6b7b"><span class="section-number-2">12</span> How do I change metadata?</h2>
-<div class="outline-text-2" id="text-12">
-<p>
-See the <a href="http://covid19.genenetwork.org/blog">http://covid19.genenetwork.org/blog</a>!
-</p>
-</div>
-</div>
-
-<div id="outline-container-org1bc6dab" class="outline-2">
-<h2 id="org1bc6dab"><span class="section-number-2">13</span> How do I change the work flows?</h2>
-<div class="outline-text-2" id="text-13">
-<p>
-Workflows are on <a href="https://github.com/arvados/bh20-seq-resource/tree/master/workflows">github</a> and can be modified. See also the BLOG
-<a href="http://covid19.genenetwork.org/blog">http://covid19.genenetwork.org/blog</a> on workflows.
-</p>
-</div>
-</div>
-
-<div id="outline-container-org1140d62" class="outline-2">
-<h2 id="org1140d62"><span class="section-number-2">14</span> How do I change the source code?</h2>
-<div class="outline-text-2" id="text-14">
-<p>
-Go to our <a href="https://github.com/arvados/bh20-seq-resource">source code repositories</a>, fork/clone the repository, change
-something and submit a <a href="https://github.com/arvados/bh20-seq-resource/pulls">pull request</a> (PR). That easy! Check out how
-many PRs we already merged.
-</p>
-</div>
-</div>
-
-<div id="outline-container-orge182714" class="outline-2">
-<h2 id="orge182714"><span class="section-number-2">15</span> Should I choose CC-BY or CC0?</h2>
-<div class="outline-text-2" id="text-15">
-<p>
-Restrictive data licenses are hampering data sharing and reproducible
-research. CC0 is the preferred license because it gives researchers
-the most freedom. Since we provide metadata there is no reason for
-others not to honour your work. We also provide CC-BY as an option
-because we know people like the attribution clause.
-</p>
-
-<p>
-In all honesty: we prefer both data and software to be free.
-</p>
-</div>
-</div>
-
-<div id="outline-container-orgf4a692b" class="outline-2">
-<h2 id="orgf4a692b"><span class="section-number-2">16</span> How do I deal with private data and privacy?</h2>
-<div class="outline-text-2" id="text-16">
-<p>
-A public sequence resource is about public data. Metadata can refer to
-private data. You can use your own (anonymous) identifiers. We also
-plan to combine identifiers with clinical data stored securely at
-<a href="https://redcap-covid19.elixir-luxembourg.org/redcap/">REDCap</a>. See the relevant <a href="https://github.com/arvados/bh20-seq-resource/issues/21">tracker</a> for more information and contributing.
-</p>
-</div>
-</div>
-
-<div id="outline-container-org7757574" class="outline-2">
-<h2 id="org7757574"><span class="section-number-2">17</span> How do I communicate with you?</h2>
-<div class="outline-text-2" id="text-17">
-<p>
-We use a <a href="https://gitter.im/arvados/pubseq?utm_source=share-link&amp;utm_medium=link&amp;utm_campaign=share-link">gitter channel</a> you can join.
-</p>
-</div>
-</div>
-
-<div id="outline-container-org194006f" class="outline-2">
-<h2 id="org194006f"><span class="section-number-2">18</span> Who are the sponsors?</h2>
-<div class="outline-text-2" id="text-18">
-<p>
-The main sponsors are listed in the footer. In addition to the time
-generously donated by many contributors we also acknowledge Amazon AWS
-for donating COVID-19 related compute time.
-</p>
-</div>
-</div>
+ <h1 class="title">About/FAQ</h1>
+ <div id="table-of-contents">
+ <h2>Table of Contents</h2>
+ <div id="text-table-of-contents">
+ <ul>
+ <li><a href="#org0db9061">1. What is the 'public sequence resource' about?</a></li>
+ <li><a href="#org983877d">2. Who created the public sequence resource?</a></li>
+ <li><a href="#org83093c3">3. How does the public sequence resource compare to other data resources?</a>
+ </li>
+ <li><a href="#org9b31fd4">4. Why should I upload my data here?</a></li>
+ <li><a href="#org4e92cb5">5. Why should I not upload by data here?</a></li>
+ <li><a href="#orgdfe72f6">6. How does the public sequence resource work?</a></li>
+ <li><a href="#orgd0c5abb">7. Who uses the public sequence resource?</a></li>
+ <li><a href="#org56f4a54">8. How can I contribute?</a></li>
+ <li><a href="#org2240ef7">9. Is this about open data?</a></li>
+ <li><a href="#orgbb655e0">10. Is this about free software?</a></li>
+ <li><a href="#org4e779f4">11. How do I upload raw data?</a></li>
+ <li><a href="#org83f6b7b">12. How do I change metadata?</a></li>
+ <li><a href="#org1bc6dab">13. How do I change the work flows?</a></li>
+ <li><a href="#org1140d62">14. How do I change the source code?</a></li>
+ <li><a href="#orge182714">15. Should I choose CC-BY or CC0?</a></li>
+ <li><a href="#orgf4a692b">16. How do I deal with private data and privacy?</a></li>
+ <li><a href="#org7757574">17. How do I communicate with you?</a></li>
+ <li><a href="#org194006f">18. Who are the sponsors?</a></li>
+ </ul>
+ </div>
+ </div>
+
+ <div id="outline-container-org0db9061" class="outline-2">
+ <h2 id="org0db9061"><span class="section-number-2">1</span> What is the 'public sequence resource' about?</h2>
+ <div class="outline-text-2" id="text-1">
+ <p>
+ The <b>public sequence resource</b> aims to provide a generic and useful
+ resource for COVID-19 research. The focus is on providing the best
+ possible sequence data with associated metadata that can be used for
+ sequence comparison and protein prediction.
+ </p>
+ <p>
+ We were at the <strong>Bioinformatics Community Conference 2020</strong>! Have a look at the
+ <a href="https://bcc2020.sched.com/event/coLw">video talk</a></li>
+ (<a href="https://drive.google.com/file/d/1skXHwVKM_gl73-_4giYIOQ1IlC5X5uBo/view?usp=sharing">alternative link</a>)
+ and the <a href="https://drive.google.com/file/d/1vyEgfvSqhM9yIwWZ6Iys-QxhxtVxPSdp/view?usp=sharing">poster</a>.
+ </p>
+ </div>
+ </div>
+
+ <div id="outline-container-org983877d" class="outline-2">
+ <h2 id="org983877d"><span class="section-number-2">2</span> Who created the public sequence resource?</h2>
+ <div class="outline-text-2" id="text-2">
+ <p>
+ The <b>public sequence resource</b> is an initiative by <a
+ href="https://github.com/arvados/bh20-seq-resource/graphs/contributors">bioinformatics</a> and
+ ontology experts who want to create something agile and useful for the
+ wider research community. The initiative started at the COVID-19
+ biohackathon in April 2020 and is ongoing. The main project drivers
+ are Pjotr Prins (UTHSC), Peter Amstutz (Curii), Andrea Guarracino
+ (University of Rome Tor Vergata), Michael Crusoe (Common Workflow
+ Language), Thomas Liener (consultant, formerly EBI), Erik Garrison
+ (UCSC) and Jerven Bolleman (Swiss Institute of Bioinformatics).
+ </p>
+
+ <p>
+ Notably, as this is a free software initiative, the project represents
+ major work by hundreds of software developers and ontology and data
+ wrangling experts. Thank you everyone!
+ </p>
+ </div>
+ </div>
+
+ <div id="outline-container-org83093c3" class="outline-2">
+ <h2 id="org83093c3"><span class="section-number-2">3</span> How does the public sequence resource compare to
+ other data resources?</h2>
+ <div class="outline-text-2" id="text-3">
+ <p>
+ The short version is that we use state-of-the-art practices in
+ bioinformatics using agile methods. Unlike the resources from large
+ institutes we can improve things on a dime and anyone can contribute
+ to building out this resource! Sequences from GenBank, EBI/ENA and
+ others are regularly added to PubSeq. We encourage people to everyone
+ to submit on PubSeq because of its superior live tooling and metadata
+ support (see the next question).
+ </p>
+
+ <p>
+ Importantly: all data is published under either the <a
+ href="https://creativecommons.org/licenses/by/4.0/">Creative Commons
+ 4.0 attribution license</a> or the <a
+ href="https://creativecommons.org/share-your-work/public-domain/cc0/">CC0 “No Rights Reserved”
+ license</a> which
+ means it data can be published and workflows can run in public
+ environments allowing for improved access for research and
+ reproducible results. This contrasts with some other public resources,
+ such as GISAID.
+ </p>
+ </div>
+ </div>
+
+ <div id="outline-container-org9b31fd4" class="outline-2">
+ <h2 id="org9b31fd4"><span class="section-number-2">4</span> Why should I upload my data here?</h2>
+ <div class="outline-text-2" id="text-4">
+ <ol class="org-ol">
+ <li>We champion truly shareable data without licensing restrictions - with proper
+ attribution
+ </li>
+ <li>We provide full metadata support using state-of-the-art ontology's</li>
+ <li>We provide a web-based sequence uploader and a command-line version
+ for bulk uploads
+ </li>
+ <li>We provide a live SPARQL end-point for all metadata</li>
+ <li>We provide free data analysis and sequence comparison triggered on data upload</li>
+ <li>We do real work for you, with this <a
+ href="https://workbench.lugli.arvadosapi.com/container_requests/lugli-xvhdp-bhhk4nxx1lch5od">link</a>
+ you can see the last
+ run took 5.5 hours!
+ </li>
+ <li>We provide free downloads of all computed output</li>
+ <li>There is no need to set up pipelines and/or compute clusters</li>
+ <li>All workflows get triggered on uploading a new sequence</li>
+ <li>When someone (you?) improves the software/workflows and everyone benefits</li>
+ <li>Your data gets automatically integrated with the Swiss Institure of
+ Bioinformatics COVID-19 knowledge base
+ <a href="https://covid-19-sparql.expasy.org/">https://covid-19-sparql.expasy.org/</a> (Elixir
+ Switzerland)
+ </li>
+ <li>Your data will be used to develop drug targets</li>
+ </ol>
+
+ <p>
+ Finally, if you upload your data here we have workflows that output
+ formatted data suitable for <a
+ href="http://covid19.genenetwork.org/blog?id=using-covid-19-pubseq-part6">uploading to EBI
+ resources</a> (and soon
+ others). Uploading your data here get your data ready for upload to
+ multiple resources.
+ </p>
+ </div>
+ </div>
+
+ <div id="outline-container-org4e92cb5" class="outline-2">
+ <h2 id="org4e92cb5"><span class="section-number-2">5</span> Why should I not upload by data here?</h2>
+ <div class="outline-text-2" id="text-5">
+ <p>
+ Funny question. There are only good reasons to upload your data here
+ and make it available to the widest audience possible.
+ </p>
+
+ <p>
+ In fact, you can upload your data here as well as to other
+ resources. It is your data after all. No one can prevent you from
+ uploading your data to multiple resources.
+ </p>
+
+ <p>
+ We recommend uploading to EBI and NCBI resources using our data
+ conversion tools. It means you only enter data once and make the
+ process smooth. You can also use our command line data uploader
+ for bulk uploads!
+ </p>
+ </div>
+ </div>
+
+ <div id="outline-container-orgdfe72f6" class="outline-2">
+ <h2 id="orgdfe72f6"><span class="section-number-2">6</span> How does the public sequence resource work?</h2>
+ <div class="outline-text-2" id="text-6">
+ <p>
+ On uploading a sequence with metadata it will automatically be
+ processed and incorporated into the public pangenome with metadata
+ using workflows from the High Performance Open Biology Lab defined
+ <a href="https://github.com/hpobio-lab/viral-analysis/tree/master/cwl/pangenome-generate">here</a>.
+ </p>
+ </div>
+ </div>
+
+ <div id="outline-container-orgd0c5abb" class="outline-2">
+ <h2 id="orgd0c5abb"><span class="section-number-2">7</span> Who uses the public sequence resource?</h2>
+ <div class="outline-text-2" id="text-7">
+ <p>
+ The Swiss Institute of Bioinformatics has included this data in
+ <a href="https://covid-19-sparql.expasy.org/">https://covid-19-sparql.expasy.org/</a> and made it part
+ of <a href="https://www.uniprot.org/">Uniprot</a>.
+ </p>
+
+ <p>
+ The Pantograph <a href="https://graph-genome.github.io/">viewer</a> uses PubSeq data for their
+ visualisations.
+ </p>
+
+ <p>
+ <a href="https://uthsc.edu">UTHSC</a> (USA), <a href="https://www.esr.cri.nz/">ESR</a> (New Zealand) and
+ <a href="https://www.ornl.gov/news/ornl-fight-against-covid-19">ORNL</a> (USA) use COVID-19 PubSeq data
+ for monitoring, protein prediction and drug development.
+ </p>
+ </div>
+ </div>
+
+ <div id="outline-container-org56f4a54" class="outline-2">
+ <h2 id="org56f4a54"><span class="section-number-2">8</span> How can I contribute?</h2>
+ <div class="outline-text-2" id="text-8">
+ <p>
+ You can contribute by submitting sequences, updating metadata, submit
+ issues on our issue tracker, and more importantly add functionality.
+ See 'How do I change the source code' below. Read through our online
+ documentation at <a href="http://covid19.genenetwork.org/blog">http://covid19.genenetwork.org/blog</a>
+ as a starting
+ point.
+ </p>
+ </div>
+ </div>
+
+ <div id="outline-container-org2240ef7" class="outline-2">
+ <h2 id="org2240ef7"><span class="section-number-2">9</span> Is this about open data?</h2>
+ <div class="outline-text-2" id="text-9">
+ <p>
+ All data is published under a <a href="https://creativecommons.org/licenses/by/4.0/">Creative Commons
+ 4.0 attribution license</a>
+ (CC-BY-4.0). You can download the raw and published (GFA/RDF/FASTA)
+ data and store it for further processing.
+ </p>
+ </div>
+ </div>
+
+ <div id="outline-container-orgbb655e0" class="outline-2">
+ <h2 id="orgbb655e0"><span class="section-number-2">10</span> Is this about free software?</h2>
+ <div class="outline-text-2" id="text-10">
+ <p>
+ Absolutely. Free software allows for fully reproducible pipelines. You
+ can take our workflows and data and run it elsewhere!
+ </p>
+ </div>
+ </div>
+
+ <div id="outline-container-org4e779f4" class="outline-2">
+ <h2 id="org4e779f4"><span class="section-number-2">11</span> How do I upload raw data?</h2>
+ <div class="outline-text-2" id="text-11">
+ <p>
+ We are preparing raw sequence data pipelines (fastq and BAM). The
+ reason is that we want the best data possible for downstream analysis
+ (including protein prediction and test development). The current
+ approach where people publish final sequences of SARS-CoV-2 is lacking
+ because it hides how this sequence was created. For reasons of
+ reproducible and improved results we want/need to work with the raw
+ sequence reads (both short reads and long reads) and take alternative
+ assembly variations into consideration. This is all work in progress.
+ </p>
+ </div>
+ </div>
+
+ <div id="outline-container-org83f6b7b" class="outline-2">
+ <h2 id="org83f6b7b"><span class="section-number-2">12</span> How do I change metadata?</h2>
+ <div class="outline-text-2" id="text-12">
+ <p>
+ See the <a href="http://covid19.genenetwork.org/blog">http://covid19.genenetwork.org/blog</a>!
+ </p>
+ </div>
+ </div>
+
+ <div id="outline-container-org1bc6dab" class="outline-2">
+ <h2 id="org1bc6dab"><span class="section-number-2">13</span> How do I change the work flows?</h2>
+ <div class="outline-text-2" id="text-13">
+ <p>
+ Workflows are on <a href="https://github.com/arvados/bh20-seq-resource/tree/master/workflows">github</a>
+ and can be modified. See also the BLOG
+ <a href="http://covid19.genenetwork.org/blog">http://covid19.genenetwork.org/blog</a> on workflows.
+ </p>
+ </div>
+ </div>
+
+ <div id="outline-container-org1140d62" class="outline-2">
+ <h2 id="org1140d62"><span class="section-number-2">14</span> How do I change the source code?</h2>
+ <div class="outline-text-2" id="text-14">
+ <p>
+ Go to our <a href="https://github.com/arvados/bh20-seq-resource">source code repositories</a>,
+ fork/clone the repository, change
+ something and submit a <a href="https://github.com/arvados/bh20-seq-resource/pulls">pull request</a>
+ (PR). That easy! Check out how
+ many PRs we already merged.
+ </p>
+ </div>
+ </div>
+
+ <div id="outline-container-orge182714" class="outline-2">
+ <h2 id="orge182714"><span class="section-number-2">15</span> Should I choose CC-BY or CC0?</h2>
+ <div class="outline-text-2" id="text-15">
+ <p>
+ Restrictive data licenses are hampering data sharing and reproducible
+ research. CC0 is the preferred license because it gives researchers
+ the most freedom. Since we provide metadata there is no reason for
+ others not to honour your work. We also provide CC-BY as an option
+ because we know people like the attribution clause.
+ </p>
+
+ <p>
+ In all honesty: we prefer both data and software to be free.
+ </p>
+ </div>
+ </div>
+
+ <div id="outline-container-orgf4a692b" class="outline-2">
+ <h2 id="orgf4a692b"><span class="section-number-2">16</span> How do I deal with private data and privacy?</h2>
+ <div class="outline-text-2" id="text-16">
+ <p>
+ A public sequence resource is about public data. Metadata can refer to
+ private data. You can use your own (anonymous) identifiers. We also
+ plan to combine identifiers with clinical data stored securely at
+ <a href="https://redcap-covid19.elixir-luxembourg.org/redcap/">REDCap</a>. See the relevant <a
+ href="https://github.com/arvados/bh20-seq-resource/issues/21">tracker</a> for more information and
+ contributing.
+ </p>
+ </div>
+ </div>
+
+ <div id="outline-container-org7757574" class="outline-2">
+ <h2 id="org7757574"><span class="section-number-2">17</span> How do I communicate with you?</h2>
+ <div class="outline-text-2" id="text-17">
+ <p>
+ We use a <a
+ href="https://gitter.im/arvados/pubseq?utm_source=share-link&amp;utm_medium=link&amp;utm_campaign=share-link">gitter
+ channel</a> you can join.
+ </p>
+ </div>
+ </div>
+
+ <div id="outline-container-org194006f" class="outline-2">
+ <h2 id="org194006f"><span class="section-number-2">18</span> Who are the sponsors?</h2>
+ <div class="outline-text-2" id="text-18">
+ <p>
+ The main sponsors are listed in the footer. In addition to the time
+ generously donated by many contributors we also acknowledge Amazon AWS
+ for donating COVID-19 related compute time.
+ </p>
+ </div>
+ </div>
</div>
<div id="postamble" class="status">
-<hr><small>Created by <a href="http://thebird.nl/">Pjotr Prins</a> (pjotr.public768 at thebird 'dot' nl) using Emacs org-mode and a healthy dose of Lisp!<br />Modified 2020-07-18 Sat 03:27</small>.
+ <hr>
+ <small>Created by <a href="http://thebird.nl/">Pjotr Prins</a> (pjotr.public768 at thebird 'dot' nl) using Emacs
+ org-mode and a healthy dose of Lisp!<br/>Modified 2020-07-18 Sat 03:27</small>.
</div>
</body>
</html>
diff --git a/doc/web/about.org b/doc/web/about.org
index 39fb667..29a80bf 100644
--- a/doc/web/about.org
+++ b/doc/web/about.org
@@ -17,7 +17,10 @@
- [[#how-do-i-change-the-work-flows][How do I change the work flows?]]
- [[#how-do-i-change-the-source-code][How do I change the source code?]]
- [[#should-i-choose-cc-by-or-cc0][Should I choose CC-BY or CC0?]]
+ - [[#are-there-also-variant-in-the-RDF-databases]][Are there also variant in the RDF databases?]
- [[#how-do-i-deal-with-private-data-and-privacy][How do I deal with private data and privacy?]]
+ - [[#do-you-have-any-checks-or-concerns-if-human-sequence-accidentally-submitted-to-your-service-as-part-of-a-fastq][Do you have any checks or concerns if human sequence accidentally submitted to your service as part of a fastq?]
+ - [[#does-PubSeq-support-only-SARS-CoV-2=data]][Does PubSeq support only SARS-CoV-2 data?]
- [[#how-do-i-communicate-with-you][How do I communicate with you?]]
- [[#who-are-the-sponsors][Who are the sponsors?]]
@@ -28,6 +31,8 @@ resource for COVID-19 research. The focus is on providing the best
possible sequence data with associated metadata that can be used for
sequence comparison and protein prediction.
+We were at the *Bioinformatics Community Conference 2020*! Have a look at the [[https://bcc2020.sched.com/event/coLw]][video talk] ([[https://drive.google.com/file/d/1skXHwVKM_gl73-_4giYIOQ1IlC5X5uBo/view?usp=sharing]][alternative link]) and the [[https://drive.google.com/file/d/1vyEgfvSqhM9yIwWZ6Iys-QxhxtVxPSdp/view?usp=sharing]][poster].
+
* Who created the public sequence resource?
The *public sequence resource* is an initiative by [[https://github.com/arvados/bh20-seq-resource/graphs/contributors][bioinformatics]] and
@@ -171,6 +176,12 @@ because we know people like the attribution clause.
In all honesty: we prefer both data and software to be free.
+* Are there also variant in the RDF databases? *
+
+We do output a RDF file with the pangenome built in, and you can parse it because it has variants implicitly.
+
+We are also writing tools to generate VCF files directly from the pangenome.
+
* How do I deal with private data and privacy?
A public sequence resource is about public data. Metadata can refer to
@@ -178,6 +189,15 @@ private data. You can use your own (anonymous) identifiers. We also
plan to combine identifiers with clinical data stored securely at
[[https://redcap-covid19.elixir-luxembourg.org/redcap/][REDCap]]. See the relevant [[https://github.com/arvados/bh20-seq-resource/issues/21][tracker]] for more information and contributing.
+* Do you have any checks or concerns if human sequence accidentally submitted to your service as part of a fastq? *
+
+We are planning to remove reads that match the human reference.
+
+* Does PubSeq support only SARS-CoV-2 data? *
+
+To date, PubSeq is a resource specific to SARS-CoV-2, but we are designing it to be able to support other species in the future.
+
+
* How do I communicate with you?
We use a [[https://gitter.im/arvados/pubseq?utm_source=share-link&utm_medium=link&utm_campaign=share-link][gitter channel]] you can join.
diff --git a/image/homepage.png b/image/homepage.png
new file mode 100644
index 0000000..f66f9fd
--- /dev/null
+++ b/image/homepage.png
Binary files differ
diff --git a/image/website.png b/image/website.png
deleted file mode 100644
index fa57ca5..0000000
--- a/image/website.png
+++ /dev/null
Binary files differ
diff --git a/workflows/pangenome-generate/odgi-build-from-spoa-gfa.cwl b/workflows/pangenome-generate/odgi-build-from-spoa-gfa.cwl
new file mode 100644
index 0000000..2459ce7
--- /dev/null
+++ b/workflows/pangenome-generate/odgi-build-from-spoa-gfa.cwl
@@ -0,0 +1,29 @@
+cwlVersion: v1.1
+class: CommandLineTool
+inputs:
+ inputGFA: File
+outputs:
+ odgiGraph:
+ type: File
+ outputBinding:
+ glob: $(inputs.inputGFA.nameroot).unchop.sorted.odgi
+requirements:
+ InlineJavascriptRequirement: {}
+ ShellCommandRequirement: {}
+hints:
+ DockerRequirement:
+ dockerPull: "quay.io/biocontainers/odgi:v0.3--py37h8b12597_0"
+ ResourceRequirement:
+ coresMin: 4
+ ramMin: $(7 * 1024)
+ outdirMin: $(Math.ceil((inputs.inputGFA.size/(1024*1024*1024)+1) * 2))
+ InitialWorkDirRequirement:
+ listing:
+ - entry: $(inputs.inputGFA)
+ writable: true
+arguments: [odgi, build, -g, $(inputs.inputGFA), -o, -,
+ {shellQuote: false, valueFrom: "|"},
+ odgi, unchop, -i, -, -o, -,
+ {shellQuote: false, valueFrom: "|"},
+ odgi, sort, -i, -, -p, s, -o, $(inputs.inputGFA.nameroot).unchop.sorted.odgi
+ ]
diff --git a/workflows/pangenome-generate/pangenome-generate_spoa.cwl b/workflows/pangenome-generate/pangenome-generate_spoa.cwl
new file mode 100644
index 0000000..958ffb6
--- /dev/null
+++ b/workflows/pangenome-generate/pangenome-generate_spoa.cwl
@@ -0,0 +1,122 @@
+#!/usr/bin/env cwl-runner
+cwlVersion: v1.1
+class: Workflow
+requirements:
+ ScatterFeatureRequirement: {}
+ StepInputExpressionRequirement: {}
+inputs:
+ inputReads: File[]
+ metadata: File[]
+ metadataSchema: File
+ subjects: string[]
+ exclude: File?
+ bin_widths:
+ type: int[]
+ default: [ 1, 4, 16, 64, 256, 1000, 4000, 16000]
+ doc: width of each bin in basepairs along the graph vector
+ cells_per_file:
+ type: int
+ default: 100
+ doc: Cells per file on component_segmentation
+outputs:
+ odgiGraph:
+ type: File
+ outputSource: buildGraph/odgiGraph
+ odgiPNG:
+ type: File
+ outputSource: vizGraph/graph_image
+ spoaGFA:
+ type: File
+ outputSource: induceGraph/spoaGFA
+ odgiRDF:
+ type: File
+ outputSource: odgi2rdf/rdf
+ readsMergeDedup:
+ type: File
+ outputSource: dedup/reads_dedup
+ mergedMetadata:
+ type: File
+ outputSource: mergeMetadata/merged
+ indexed_paths:
+ type: File
+ outputSource: index_paths/indexed_paths
+ colinear_components:
+ type: Directory
+ outputSource: segment_components/colinear_components
+steps:
+ relabel:
+ in:
+ readsFA: inputReads
+ subjects: subjects
+ exclude: exclude
+ out: [relabeledSeqs, originalLabels]
+ run: relabel-seqs.cwl
+ dedup:
+ in: {reads: relabel/relabeledSeqs}
+ out: [reads_dedup, dups]
+ run: ../tools/seqkit/seqkit_rmdup.cwl
+ sort_by_quality_and_len:
+ in: {reads: dedup/reads_dedup}
+ out: [reads_sorted_by_quality_and_len]
+ run: sort_fasta_by_quality_and_len.cwl
+ induceGraph:
+ in:
+ readsFA: sort_by_quality_and_len/reads_sorted_by_quality_and_len
+ out: [spoaGFA]
+ run: spoa.cwl
+ buildGraph:
+ in: {inputGFA: induceGraph/spoaGFA}
+ out: [odgiGraph]
+ run: odgi-build-from-spoa-gfa.cwl
+ vizGraph:
+ in:
+ sparse_graph_index: buildGraph/odgiGraph
+ width:
+ default: 50000
+ height:
+ default: 500
+ path_per_row:
+ default: true
+ path_height:
+ default: 4
+ out: [graph_image]
+ run: ../tools/odgi/odgi_viz.cwl
+ odgi2rdf:
+ in: {odgi: buildGraph/odgiGraph}
+ out: [rdf]
+ run: odgi_to_rdf.cwl
+ mergeMetadata:
+ in:
+ metadata: metadata
+ metadataSchema: metadataSchema
+ subjects: subjects
+ dups: dedup/dups
+ originalLabels: relabel/originalLabels
+ out: [merged]
+ run: merge-metadata.cwl
+ bin_paths:
+ run: ../tools/odgi/odgi_bin.cwl
+ in:
+ sparse_graph_index: buildGraph/odgiGraph
+ bin_width: bin_widths
+ scatter: bin_width
+ out: [ bins, pangenome_sequence ]
+ index_paths:
+ label: Create path index
+ run: ../tools/odgi/odgi_pathindex.cwl
+ in:
+ sparse_graph_index: buildGraph/odgiGraph
+ out: [ indexed_paths ]
+ segment_components:
+ label: Run component segmentation
+ run: ../tools/graph-genome-segmentation/component_segmentation.cwl
+ in:
+ bins: bin_paths/bins
+ cells_per_file: cells_per_file
+ pangenome_sequence:
+ source: bin_paths/pangenome_sequence
+ valueFrom: $(self[0])
+ # the bin_paths step is scattered over the bin_width array, but always using the same sparse_graph_index
+ # the pangenome_sequence that is extracted is exactly the same for the same sparse_graph_index
+ # regardless of bin_width, so we take the first pangenome_sequence as input for this step
+ out: [ colinear_components ]
diff --git a/workflows/pangenome-generate/sort_fasta_by_quality_and_len.cwl b/workflows/pangenome-generate/sort_fasta_by_quality_and_len.cwl
new file mode 100644
index 0000000..59f027e
--- /dev/null
+++ b/workflows/pangenome-generate/sort_fasta_by_quality_and_len.cwl
@@ -0,0 +1,18 @@
+cwlVersion: v1.1
+class: CommandLineTool
+inputs:
+ readsFA:
+ type: File
+ inputBinding: {position: 2}
+ script:
+ type: File
+ inputBinding: {position: 1}
+ default: {class: File, location: sort_fasta_by_quality_and_len.py}
+stdout: $(inputs.readsFA.nameroot).sorted_by_quality_and_len.fasta
+outputs:
+ sortedReadsFA:
+ type: stdout
+requirements:
+ InlineJavascriptRequirement: {}
+ ShellCommandRequirement: {}
+baseCommand: [python]
diff --git a/workflows/pangenome-generate/sort_fasta_by_quality_and_len.py b/workflows/pangenome-generate/sort_fasta_by_quality_and_len.py
new file mode 100644
index 0000000..e48fd68
--- /dev/null
+++ b/workflows/pangenome-generate/sort_fasta_by_quality_and_len.py
@@ -0,0 +1,35 @@
+#!/usr/bin/env python3
+
+# Sort the sequences by quality (percentage of number of N bases not called, descending) and by length (descending).
+# The best sequence is the longest one, with no uncalled bases.
+
+import os
+import sys
+import gzip
+
+def open_gzipsafe(path_file):
+ if path_file.endswith('.gz'):
+ return gzip.open(path_file, 'rt')
+ else:
+ return open(path_file)
+
+path_fasta = sys.argv[1]
+
+header_to_seq_dict = {}
+header_percCalledBases_seqLength_list = []
+
+with open_gzipsafe(path_fasta) as f:
+ for fasta in f.read().strip('\n>').split('>'):
+ header = fasta.strip('\n').split('\n')[0]
+
+ header_to_seq_dict[
+ header
+ ] = ''.join(fasta.strip('\n').split('\n')[1:])
+
+ seq_len = len(header_to_seq_dict[header])
+ header_percCalledBases_seqLength_list.append([
+ header, header_to_seq_dict[header].count('N'), (seq_len - header_to_seq_dict[header].count('N'))/seq_len, seq_len
+ ])
+
+for header, x, percCalledBases, seqLength_list in sorted(header_percCalledBases_seqLength_list, key=lambda x: (x[-2], x[-1]), reverse = True):
+ sys.stdout.write('>{}\n{}\n'.format(header, header_to_seq_dict[header]))
diff --git a/workflows/pangenome-generate/spoa.cwl b/workflows/pangenome-generate/spoa.cwl
new file mode 100644
index 0000000..1e390d8
--- /dev/null
+++ b/workflows/pangenome-generate/spoa.cwl
@@ -0,0 +1,27 @@
+cwlVersion: v1.1
+class: CommandLineTool
+inputs:
+ readsFA: File
+stdout: $(inputs.readsFA.nameroot).g6.gfa
+script:
+ type: File
+ default: {class: File, location: relabel-seqs.py}
+outputs:
+ spoaGFA:
+ type: stdout
+requirements:
+ InlineJavascriptRequirement: {}
+ ShellCommandRequirement: {}
+hints:
+ DockerRequirement:
+ dockerPull: "quay.io/biocontainers/spoa:3.0.2--hc9558a2_0"
+ ResourceRequirement:
+ coresMin: 1
+ ramMin: $(15 * 1024)
+ outdirMin: $(Math.ceil(inputs.readsFA.size/(1024*1024*1024) + 20))
+baseCommand: spoa
+arguments: [
+ $(inputs.readsFA),
+ -G,
+ -g, '-6'
+]