diff --git a/doc/blog/using-covid-19-pubseq-part2.org b/doc/blog/using-covid-19-pubseq-part2.org
index d2a1cbc..349fd06 100644
--- a/doc/blog/using-covid-19-pubseq-part2.org
+++ b/doc/blog/using-covid-19-pubseq-part2.org
@@ -8,36 +8,13 @@
#+HTML_LINK_HOME: http://covid19.genenetwork.org
#+HTML_HEAD:
-As part of the COVID-19 Biohackathon 2020 we formed a working group to
-create a COVID-19 Public Sequence Resource (COVID-19 PubSeq) for
-Corona virus sequences. The general idea is to create a repository
-that has a low barrier to entry for uploading sequence data using best
-practices. I.e., data published with a creative commons 4.0 (CC-4.0)
-license with metadata using state-of-the art standards and, perhaps
-most importantly, providing standardised workflows that get triggered
-on upload, so that results are immediately available in standardised
-data formats.
-
* Table of Contents :TOC:noexport:
- [[#finding-output-of-workflows][Finding output of workflows]]
- - [[#introduction][Introduction]]
- [[#the-arvados-file-interface][The Arvados file interface]]
- [[#using-the-arvados-api][Using the Arvados API]]
* Finding output of workflows
-As part of the COVID-19 Biohackathon 2020 we formed a working group to
-create a COVID-19 Public Sequence Resource (COVID-19 PubSeq) for
-Corona virus sequences. The general idea is to create a repository
-that has a low barrier to entry for uploading sequence data using best
-practices. I.e., data published with a creative commons 4.0 (CC-4.0)
-license with metadata using state-of-the art standards and, perhaps
-most importantly, providing standardised workflows that get triggered
-on upload, so that results are immediately available in standardised
-data formats.
-
-* Introduction
-
We are using Arvados to run common workflow language (CWL) pipelines.
The most recent output is on display on a [[https://workbench.lugli.arvadosapi.com/collections/lugli-4zz18-z513nlpqm03hpca][web page]] (with time stamp)
and a full list is generated [[https://collections.lugli.arvadosapi.com/c=lugli-4zz18-z513nlpqm03hpca/][here]]. It is nice to start up, but for
@@ -81,4 +58,4 @@ its listed UUID:
: arv-get 2be6af7b4741f2a5c5f8ff2bc6152d73+1955623+Ab9ad65d7fe958a053b3a57d545839de18290843a@5ed7f3c5
-* Using the Arvados API
+* TODO Using the Arvados API
diff --git a/doc/blog/using-covid-19-pubseq-part3.html b/doc/blog/using-covid-19-pubseq-part3.html
index 91879b0..df4a286 100644
--- a/doc/blog/using-covid-19-pubseq-part3.html
+++ b/doc/blog/using-covid-19-pubseq-part3.html
@@ -625,7 +625,7 @@ The web interface using this exact same script so it should just work
-We also use above script to bulk upload GenBank sequences with a FASTA
+We also use above script to bulk upload GenBank sequences with a FASTA
and YAML extractor specific for GenBank. This means that the steps we
took above for uploading a GenBank sequence are already automated.
diff --git a/doc/blog/using-covid-19-pubseq-part3.org b/doc/blog/using-covid-19-pubseq-part3.org
index 03f37ab..e8fee36 100644
--- a/doc/blog/using-covid-19-pubseq-part3.org
+++ b/doc/blog/using-covid-19-pubseq-part3.org
@@ -234,6 +234,6 @@ The web interface using this exact same script so it should just work
** Example: uploading bulk GenBank sequences
-We also use above script to bulk upload GenBank sequences with a [[https://github.com/arvados/bh20-seq-resource/blob/master/scripts/from_genbank_to_fasta_and_yaml.py][FASTA
+We also use above script to bulk upload GenBank sequences with a [[https://github.com/arvados/bh20-seq-resource/blob/master/scripts/download_genbank_data/from_genbank_to_fasta_and_yaml.py][FASTA
and YAML]] extractor specific for GenBank. This means that the steps we
took above for uploading a GenBank sequence are already automated.
diff --git a/doc/web/about.html b/doc/web/about.html
index dfd4252..c971a4e 100644
--- a/doc/web/about.html
+++ b/doc/web/about.html
@@ -1,549 +1,964 @@
+ "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
-
-
-
-
About/FAQ
-
-
-
-
+
+
+
+
About/FAQ
+
+
+
+
-
About/FAQ
-
-
-
-
1 What is the 'public sequence resource' about?
-
-
-The public sequence resource aims to provide a generic and useful
-resource for COVID-19 research. The focus is on providing the best
-possible sequence data with associated metadata that can be used for
-sequence comparison and protein prediction.
-
-
-
-
-
-
2 Who created the public sequence resource?
-
-
-The public sequence resource is an initiative by bioinformatics and
-ontology experts who want to create something agile and useful for the
-wider research community. The initiative started at the COVID-19
-biohackathon in April 2020 and is ongoing. The main project drivers
-are Pjotr Prins (UTHSC), Peter Amstutz (Curii), Andrea Guarracino
-(University of Rome Tor Vergata), Michael Crusoe (Common Workflow
-Language), Thomas Liener (consultant, formerly EBI), Erik Garrison
-(UCSC) and Jerven Bolleman (Swiss Institute of Bioinformatics).
-
-
-
-Notably, as this is a free software initiative, the project represents
-major work by hundreds of software developers and ontology and data
-wrangling experts. Thank you everyone!
-
-
-
-
-
-
3 How does the public sequence resource compare to other data resources?
-
-
-The short version is that we use state-of-the-art practices in
-bioinformatics using agile methods. Unlike the resources from large
-institutes we can improve things on a dime and anyone can contribute
-to building out this resource! Sequences from GenBank, EBI/ENA and
-others are regularly added to PubSeq. We encourage people to everyone
-to submit on PubSeq because of its superior live tooling and metadata
-support (see the next question).
-
-
-
-Importantly: all data is published under either the Creative Commons
-4.0 attribution license or the CC0 “No Rights Reserved” license which
-means it data can be published and workflows can run in public
-environments allowing for improved access for research and
-reproducible results. This contrasts with some other public resources,
-such as GISAID.
-
-
-
-
-
-
4 Why should I upload my data here?
-
-
-- We champion truly shareable data without licensing restrictions - with proper
-attribution
-- We provide full metadata support using state-of-the-art ontology's
-- We provide a web-based sequence uploader and a command-line version
-for bulk uploads
-- We provide a live SPARQL end-point for all metadata
-- We provide free data analysis and sequence comparison triggered on data upload
-- We do real work for you, with this link you can see the last
-run took 5.5 hours!
-- We provide free downloads of all computed output
-- There is no need to set up pipelines and/or compute clusters
-- All workflows get triggered on uploading a new sequence
-- When someone (you?) improves the software/workflows and everyone benefits
-- Your data gets automatically integrated with the Swiss Institure of
-Bioinformatics COVID-19 knowledge base
-https://covid-19-sparql.expasy.org/ (Elixir Switzerland)
-- Your data will be used to develop drug targets
-
-
-
-Finally, if you upload your data here we have workflows that output
-formatted data suitable for uploading to EBI resources (and soon
-others). Uploading your data here get your data ready for upload to
-multiple resources.
-
-
-
-
-
-
5 Why should I not upload by data here?
-
-
-Funny question. There are only good reasons to upload your data here
-and make it available to the widest audience possible.
-
-
-
-In fact, you can upload your data here as well as to other
-resources. It is your data after all. No one can prevent you from
-uploading your data to multiple resources.
-
-
-
-We recommend uploading to EBI and NCBI resources using our data
-conversion tools. It means you only enter data once and make the
-process smooth. You can also use our command line data uploader
-for bulk uploads!
-
-
-
-
-
-
6 How does the public sequence resource work?
-
-
-On uploading a sequence with metadata it will automatically be
-processed and incorporated into the public pangenome with metadata
-using workflows from the High Performance Open Biology Lab defined
-here.
-
-
-
-
-
-
7 Who uses the public sequence resource?
-
-
-The Swiss Institute of Bioinformatics has included this data in
-https://covid-19-sparql.expasy.org/ and made it part of Uniprot.
-
-
-
-The Pantograph viewer uses PubSeq data for their visualisations.
-
-
-
-UTHSC (USA), ESR (New Zealand) and ORNL (USA) use COVID-19 PubSeq data
-for monitoring, protein prediction and drug development.
-
-
-
-
-
-
8 How can I contribute?
-
-
-You can contribute by submitting sequences, updating metadata, submit
-issues on our issue tracker, and more importantly add functionality.
-See 'How do I change the source code' below. Read through our online
-documentation at http://covid19.genenetwork.org/blog as a starting
-point.
-
-
-
-
-
-
9 Is this about open data?
-
-
-All data is published under a Creative Commons 4.0 attribution license
-(CC-BY-4.0). You can download the raw and published (GFA/RDF/FASTA)
-data and store it for further processing.
-
-
-
-
-
-
10 Is this about free software?
-
-
-Absolutely. Free software allows for fully reproducible pipelines. You
-can take our workflows and data and run it elsewhere!
-
-
-
-
-
-
11 How do I upload raw data?
-
-
-We are preparing raw sequence data pipelines (fastq and BAM). The
-reason is that we want the best data possible for downstream analysis
-(including protein prediction and test development). The current
-approach where people publish final sequences of SARS-CoV-2 is lacking
-because it hides how this sequence was created. For reasons of
-reproducible and improved results we want/need to work with the raw
-sequence reads (both short reads and long reads) and take alternative
-assembly variations into consideration. This is all work in progress.
-
-
-
-
-
-
12 How do I change metadata?
-
-
-
-
-
-
-
14 How do I change the source code?
-
-
-Go to our source code repositories, fork/clone the repository, change
-something and submit a pull request (PR). That easy! Check out how
-many PRs we already merged.
-
-
-
-
-
-
15 Should I choose CC-BY or CC0?
-
-
-Restrictive data licenses are hampering data sharing and reproducible
-research. CC0 is the preferred license because it gives researchers
-the most freedom. Since we provide metadata there is no reason for
-others not to honour your work. We also provide CC-BY as an option
-because we know people like the attribution clause.
-
-
-
-In all honesty: we prefer both data and software to be free.
-
-
-
-
-
-
16 How do I deal with private data and privacy?
-
-
-A public sequence resource is about public data. Metadata can refer to
-private data. You can use your own (anonymous) identifiers. We also
-plan to combine identifiers with clinical data stored securely at
-REDCap. See the relevant tracker for more information and contributing.
-
-
-
-
-
-
17 How do I communicate with you?
-
-
-We use a gitter channel you can join.
-
-
-
-
-
-
18 Who are the sponsors?
-
-
-The main sponsors are listed in the footer. In addition to the time
-generously donated by many contributors we also acknowledge Amazon AWS
-for donating COVID-19 related compute time.
-
-
-
+
About/FAQ
+
+
+
+
1 What is the 'public sequence resource' about?
+
+
+ The public sequence resource aims to provide a generic and useful
+ resource for COVID-19 research. The focus is on providing the best
+ possible sequence data with associated metadata that can be used for
+ sequence comparison and protein prediction.
+
+
+ We were at the Bioinformatics Community Conference 2020! Have a look at the
+ video talk
+ (alternative link)
+ and the poster.
+
+
+
+
+
+
2 Who created the public sequence resource?
+
+
+ The public sequence resource is an initiative by bioinformatics and
+ ontology experts who want to create something agile and useful for the
+ wider research community. The initiative started at the COVID-19
+ biohackathon in April 2020 and is ongoing. The main project drivers
+ are Pjotr Prins (UTHSC), Peter Amstutz (Curii), Andrea Guarracino
+ (University of Rome Tor Vergata), Michael Crusoe (Common Workflow
+ Language), Thomas Liener (consultant, formerly EBI), Erik Garrison
+ (UCSC) and Jerven Bolleman (Swiss Institute of Bioinformatics).
+
+
+
+ Notably, as this is a free software initiative, the project represents
+ major work by hundreds of software developers and ontology and data
+ wrangling experts. Thank you everyone!
+
+
+
+
+
+
3 How does the public sequence resource compare to
+ other data resources?
+
+
+ The short version is that we use state-of-the-art practices in
+ bioinformatics using agile methods. Unlike the resources from large
+ institutes we can improve things on a dime and anyone can contribute
+ to building out this resource! Sequences from GenBank, EBI/ENA and
+ others are regularly added to PubSeq. We encourage people to everyone
+ to submit on PubSeq because of its superior live tooling and metadata
+ support (see the next question).
+
+
+
+ Importantly: all data is published under either the Creative Commons
+ 4.0 attribution license or the CC0 “No Rights Reserved”
+ license which
+ means it data can be published and workflows can run in public
+ environments allowing for improved access for research and
+ reproducible results. This contrasts with some other public resources,
+ such as GISAID.
+
+
+
+
+
+
4 Why should I upload my data here?
+
+
+ - We champion truly shareable data without licensing restrictions - with proper
+ attribution
+
+ - We provide full metadata support using state-of-the-art ontology's
+ - We provide a web-based sequence uploader and a command-line version
+ for bulk uploads
+
+ - We provide a live SPARQL end-point for all metadata
+ - We provide free data analysis and sequence comparison triggered on data upload
+ - We do real work for you, with this link
+ you can see the last
+ run took 5.5 hours!
+
+ - We provide free downloads of all computed output
+ - There is no need to set up pipelines and/or compute clusters
+ - All workflows get triggered on uploading a new sequence
+ - When someone (you?) improves the software/workflows and everyone benefits
+ - Your data gets automatically integrated with the Swiss Institure of
+ Bioinformatics COVID-19 knowledge base
+ https://covid-19-sparql.expasy.org/ (Elixir
+ Switzerland)
+
+ - Your data will be used to develop drug targets
+
+
+
+ Finally, if you upload your data here we have workflows that output
+ formatted data suitable for uploading to EBI
+ resources (and soon
+ others). Uploading your data here get your data ready for upload to
+ multiple resources.
+
+
+
+
+
+
5 Why should I not upload by data here?
+
+
+ Funny question. There are only good reasons to upload your data here
+ and make it available to the widest audience possible.
+
+
+
+ In fact, you can upload your data here as well as to other
+ resources. It is your data after all. No one can prevent you from
+ uploading your data to multiple resources.
+
+
+
+ We recommend uploading to EBI and NCBI resources using our data
+ conversion tools. It means you only enter data once and make the
+ process smooth. You can also use our command line data uploader
+ for bulk uploads!
+
+
+
+
+
+
6 How does the public sequence resource work?
+
+
+ On uploading a sequence with metadata it will automatically be
+ processed and incorporated into the public pangenome with metadata
+ using workflows from the High Performance Open Biology Lab defined
+ here.
+
+
+
+
+
+
7 Who uses the public sequence resource?
+
+
+ The Swiss Institute of Bioinformatics has included this data in
+ https://covid-19-sparql.expasy.org/ and made it part
+ of Uniprot.
+
+
+
+ The Pantograph viewer uses PubSeq data for their
+ visualisations.
+
+
+
+ UTHSC (USA), ESR (New Zealand) and
+ ORNL (USA) use COVID-19 PubSeq data
+ for monitoring, protein prediction and drug development.
+
+
+
+
+
+
8 How can I contribute?
+
+
+ You can contribute by submitting sequences, updating metadata, submit
+ issues on our issue tracker, and more importantly add functionality.
+ See 'How do I change the source code' below. Read through our online
+ documentation at http://covid19.genenetwork.org/blog
+ as a starting
+ point.
+
+
+
+
+
+
9 Is this about open data?
+
+
+ All data is published under a Creative Commons
+ 4.0 attribution license
+ (CC-BY-4.0). You can download the raw and published (GFA/RDF/FASTA)
+ data and store it for further processing.
+
+
+
+
+
+
10 Is this about free software?
+
+
+ Absolutely. Free software allows for fully reproducible pipelines. You
+ can take our workflows and data and run it elsewhere!
+
+
+
+
+
+
11 How do I upload raw data?
+
+
+ We are preparing raw sequence data pipelines (fastq and BAM). The
+ reason is that we want the best data possible for downstream analysis
+ (including protein prediction and test development). The current
+ approach where people publish final sequences of SARS-CoV-2 is lacking
+ because it hides how this sequence was created. For reasons of
+ reproducible and improved results we want/need to work with the raw
+ sequence reads (both short reads and long reads) and take alternative
+ assembly variations into consideration. This is all work in progress.
+
+
+
+
+
+
12 How do I change metadata?
+
+
+
+
+
+
+
14 How do I change the source code?
+
+
+ Go to our source code repositories,
+ fork/clone the repository, change
+ something and submit a pull request
+ (PR). That easy! Check out how
+ many PRs we already merged.
+
+
+
+
+
+
15 Should I choose CC-BY or CC0?
+
+
+ Restrictive data licenses are hampering data sharing and reproducible
+ research. CC0 is the preferred license because it gives researchers
+ the most freedom. Since we provide metadata there is no reason for
+ others not to honour your work. We also provide CC-BY as an option
+ because we know people like the attribution clause.
+
+
+
+ In all honesty: we prefer both data and software to be free.
+
+
+
+
+
+
16 How do I deal with private data and privacy?
+
+
+ A public sequence resource is about public data. Metadata can refer to
+ private data. You can use your own (anonymous) identifiers. We also
+ plan to combine identifiers with clinical data stored securely at
+ REDCap. See the relevant tracker for more information and
+ contributing.
+
+
+
+
+
+
17 How do I communicate with you?
+
+
+ We use a gitter
+ channel you can join.
+
+
+
+
+
+
18 Who are the sponsors?
+
+
+ The main sponsors are listed in the footer. In addition to the time
+ generously donated by many contributors we also acknowledge Amazon AWS
+ for donating COVID-19 related compute time.
+
+
+
-
Created by Pjotr Prins (pjotr.public768 at thebird 'dot' nl) using Emacs org-mode and a healthy dose of Lisp!
Modified 2020-07-18 Sat 03:27.
+
+
Created by Pjotr Prins (pjotr.public768 at thebird 'dot' nl) using Emacs
+ org-mode and a healthy dose of Lisp!
Modified 2020-07-18 Sat 03:27.
diff --git a/doc/web/about.org b/doc/web/about.org
index 39fb667..8a954bb 100644
--- a/doc/web/about.org
+++ b/doc/web/about.org
@@ -28,6 +28,8 @@ resource for COVID-19 research. The focus is on providing the best
possible sequence data with associated metadata that can be used for
sequence comparison and protein prediction.
+We were at the *Bioinformatics Community Conference 2020*! Have a look at the [[https://bcc2020.sched.com/event/coLw]][video talk] ([[https://drive.google.com/file/d/1skXHwVKM_gl73-_4giYIOQ1IlC5X5uBo/view?usp=sharing]][alternative link]) and the [[https://drive.google.com/file/d/1vyEgfvSqhM9yIwWZ6Iys-QxhxtVxPSdp/view?usp=sharing]][poster].
+
* Who created the public sequence resource?
The *public sequence resource* is an initiative by [[https://github.com/arvados/bh20-seq-resource/graphs/contributors][bioinformatics]] and
--
cgit v1.2.3
From 088d0a7fa47d4dff5a38f42fe4d12b9841bb033e Mon Sep 17 00:00:00 2001
From: AndreaGuarracino
Date: Thu, 23 Jul 2020 22:28:37 +0200
Subject: new workflow for odgi building from spoa gfa
---
.../odgi-build-from-spoa-gfa.cwl | 29 ++++++++++++++++++++++
1 file changed, 29 insertions(+)
create mode 100644 workflows/pangenome-generate/odgi-build-from-spoa-gfa.cwl
diff --git a/workflows/pangenome-generate/odgi-build-from-spoa-gfa.cwl b/workflows/pangenome-generate/odgi-build-from-spoa-gfa.cwl
new file mode 100644
index 0000000..2459ce7
--- /dev/null
+++ b/workflows/pangenome-generate/odgi-build-from-spoa-gfa.cwl
@@ -0,0 +1,29 @@
+cwlVersion: v1.1
+class: CommandLineTool
+inputs:
+ inputGFA: File
+outputs:
+ odgiGraph:
+ type: File
+ outputBinding:
+ glob: $(inputs.inputGFA.nameroot).unchop.sorted.odgi
+requirements:
+ InlineJavascriptRequirement: {}
+ ShellCommandRequirement: {}
+hints:
+ DockerRequirement:
+ dockerPull: "quay.io/biocontainers/odgi:v0.3--py37h8b12597_0"
+ ResourceRequirement:
+ coresMin: 4
+ ramMin: $(7 * 1024)
+ outdirMin: $(Math.ceil((inputs.inputGFA.size/(1024*1024*1024)+1) * 2))
+ InitialWorkDirRequirement:
+ listing:
+ - entry: $(inputs.inputGFA)
+ writable: true
+arguments: [odgi, build, -g, $(inputs.inputGFA), -o, -,
+ {shellQuote: false, valueFrom: "|"},
+ odgi, unchop, -i, -, -o, -,
+ {shellQuote: false, valueFrom: "|"},
+ odgi, sort, -i, -, -p, s, -o, $(inputs.inputGFA.nameroot).unchop.sorted.odgi
+ ]
--
cgit v1.2.3
From e31d89f6b4c0d2a99eb6df90b85b4e51cb584817 Mon Sep 17 00:00:00 2001
From: AndreaGuarracino
Date: Mon, 27 Jul 2020 17:04:00 +0200
Subject: added spoa workflow in a low memory consumption mode
---
workflows/pangenome-generate/spoa.cwl | 27 +++++++++++++++++++++++++++
1 file changed, 27 insertions(+)
create mode 100644 workflows/pangenome-generate/spoa.cwl
diff --git a/workflows/pangenome-generate/spoa.cwl b/workflows/pangenome-generate/spoa.cwl
new file mode 100644
index 0000000..1e390d8
--- /dev/null
+++ b/workflows/pangenome-generate/spoa.cwl
@@ -0,0 +1,27 @@
+cwlVersion: v1.1
+class: CommandLineTool
+inputs:
+ readsFA: File
+stdout: $(inputs.readsFA.nameroot).g6.gfa
+script:
+ type: File
+ default: {class: File, location: relabel-seqs.py}
+outputs:
+ spoaGFA:
+ type: stdout
+requirements:
+ InlineJavascriptRequirement: {}
+ ShellCommandRequirement: {}
+hints:
+ DockerRequirement:
+ dockerPull: "quay.io/biocontainers/spoa:3.0.2--hc9558a2_0"
+ ResourceRequirement:
+ coresMin: 1
+ ramMin: $(15 * 1024)
+ outdirMin: $(Math.ceil(inputs.readsFA.size/(1024*1024*1024) + 20))
+baseCommand: spoa
+arguments: [
+ $(inputs.readsFA),
+ -G,
+ -g, '-6'
+]
--
cgit v1.2.3
From 618f956eb03c6a6ad1cc16efc931f55b0dce83e1 Mon Sep 17 00:00:00 2001
From: AndreaGuarracino
Date: Mon, 27 Jul 2020 17:27:07 +0200
Subject: added workflow to sort a multifasta by quality and length, and added
the overall new pangenome generation workflow with SPOA
---
.../pangenome-generate/pangenome-generate_spoa.cwl | 122 +++++++++++++++++++++
.../sort_fasta_by_quality_and_len.cwl | 18 +++
.../sort_fasta_by_quality_and_len.py | 35 ++++++
3 files changed, 175 insertions(+)
create mode 100644 workflows/pangenome-generate/pangenome-generate_spoa.cwl
create mode 100644 workflows/pangenome-generate/sort_fasta_by_quality_and_len.cwl
create mode 100644 workflows/pangenome-generate/sort_fasta_by_quality_and_len.py
diff --git a/workflows/pangenome-generate/pangenome-generate_spoa.cwl b/workflows/pangenome-generate/pangenome-generate_spoa.cwl
new file mode 100644
index 0000000..958ffb6
--- /dev/null
+++ b/workflows/pangenome-generate/pangenome-generate_spoa.cwl
@@ -0,0 +1,122 @@
+#!/usr/bin/env cwl-runner
+cwlVersion: v1.1
+class: Workflow
+requirements:
+ ScatterFeatureRequirement: {}
+ StepInputExpressionRequirement: {}
+inputs:
+ inputReads: File[]
+ metadata: File[]
+ metadataSchema: File
+ subjects: string[]
+ exclude: File?
+ bin_widths:
+ type: int[]
+ default: [ 1, 4, 16, 64, 256, 1000, 4000, 16000]
+ doc: width of each bin in basepairs along the graph vector
+ cells_per_file:
+ type: int
+ default: 100
+ doc: Cells per file on component_segmentation
+outputs:
+ odgiGraph:
+ type: File
+ outputSource: buildGraph/odgiGraph
+ odgiPNG:
+ type: File
+ outputSource: vizGraph/graph_image
+ spoaGFA:
+ type: File
+ outputSource: induceGraph/spoaGFA
+ odgiRDF:
+ type: File
+ outputSource: odgi2rdf/rdf
+ readsMergeDedup:
+ type: File
+ outputSource: dedup/reads_dedup
+ mergedMetadata:
+ type: File
+ outputSource: mergeMetadata/merged
+ indexed_paths:
+ type: File
+ outputSource: index_paths/indexed_paths
+ colinear_components:
+ type: Directory
+ outputSource: segment_components/colinear_components
+steps:
+ relabel:
+ in:
+ readsFA: inputReads
+ subjects: subjects
+ exclude: exclude
+ out: [relabeledSeqs, originalLabels]
+ run: relabel-seqs.cwl
+ dedup:
+ in: {reads: relabel/relabeledSeqs}
+ out: [reads_dedup, dups]
+ run: ../tools/seqkit/seqkit_rmdup.cwl
+ sort_by_quality_and_len:
+ in: {reads: dedup/reads_dedup}
+ out: [reads_sorted_by_quality_and_len]
+ run: sort_fasta_by_quality_and_len.cwl
+ induceGraph:
+ in:
+ readsFA: sort_by_quality_and_len/reads_sorted_by_quality_and_len
+ out: [spoaGFA]
+ run: spoa.cwl
+ buildGraph:
+ in: {inputGFA: induceGraph/spoaGFA}
+ out: [odgiGraph]
+ run: odgi-build-from-spoa-gfa.cwl
+ vizGraph:
+ in:
+ sparse_graph_index: buildGraph/odgiGraph
+ width:
+ default: 50000
+ height:
+ default: 500
+ path_per_row:
+ default: true
+ path_height:
+ default: 4
+ out: [graph_image]
+ run: ../tools/odgi/odgi_viz.cwl
+ odgi2rdf:
+ in: {odgi: buildGraph/odgiGraph}
+ out: [rdf]
+ run: odgi_to_rdf.cwl
+ mergeMetadata:
+ in:
+ metadata: metadata
+ metadataSchema: metadataSchema
+ subjects: subjects
+ dups: dedup/dups
+ originalLabels: relabel/originalLabels
+ out: [merged]
+ run: merge-metadata.cwl
+ bin_paths:
+ run: ../tools/odgi/odgi_bin.cwl
+ in:
+ sparse_graph_index: buildGraph/odgiGraph
+ bin_width: bin_widths
+ scatter: bin_width
+ out: [ bins, pangenome_sequence ]
+ index_paths:
+ label: Create path index
+ run: ../tools/odgi/odgi_pathindex.cwl
+ in:
+ sparse_graph_index: buildGraph/odgiGraph
+ out: [ indexed_paths ]
+ segment_components:
+ label: Run component segmentation
+ run: ../tools/graph-genome-segmentation/component_segmentation.cwl
+ in:
+ bins: bin_paths/bins
+ cells_per_file: cells_per_file
+ pangenome_sequence:
+ source: bin_paths/pangenome_sequence
+ valueFrom: $(self[0])
+ # the bin_paths step is scattered over the bin_width array, but always using the same sparse_graph_index
+ # the pangenome_sequence that is extracted is exactly the same for the same sparse_graph_index
+ # regardless of bin_width, so we take the first pangenome_sequence as input for this step
+ out: [ colinear_components ]
diff --git a/workflows/pangenome-generate/sort_fasta_by_quality_and_len.cwl b/workflows/pangenome-generate/sort_fasta_by_quality_and_len.cwl
new file mode 100644
index 0000000..59f027e
--- /dev/null
+++ b/workflows/pangenome-generate/sort_fasta_by_quality_and_len.cwl
@@ -0,0 +1,18 @@
+cwlVersion: v1.1
+class: CommandLineTool
+inputs:
+ readsFA:
+ type: File
+ inputBinding: {position: 2}
+ script:
+ type: File
+ inputBinding: {position: 1}
+ default: {class: File, location: sort_fasta_by_quality_and_len.py}
+stdout: $(inputs.readsFA.nameroot).sorted_by_quality_and_len.fasta
+outputs:
+ sortedReadsFA:
+ type: stdout
+requirements:
+ InlineJavascriptRequirement: {}
+ ShellCommandRequirement: {}
+baseCommand: [python]
diff --git a/workflows/pangenome-generate/sort_fasta_by_quality_and_len.py b/workflows/pangenome-generate/sort_fasta_by_quality_and_len.py
new file mode 100644
index 0000000..e48fd68
--- /dev/null
+++ b/workflows/pangenome-generate/sort_fasta_by_quality_and_len.py
@@ -0,0 +1,35 @@
+#!/usr/bin/env python3
+
+# Sort the sequences by quality (percentage of number of N bases not called, descending) and by length (descending).
+# The best sequence is the longest one, with no uncalled bases.
+
+import os
+import sys
+import gzip
+
+def open_gzipsafe(path_file):
+ if path_file.endswith('.gz'):
+ return gzip.open(path_file, 'rt')
+ else:
+ return open(path_file)
+
+path_fasta = sys.argv[1]
+
+header_to_seq_dict = {}
+header_percCalledBases_seqLength_list = []
+
+with open_gzipsafe(path_fasta) as f:
+ for fasta in f.read().strip('\n>').split('>'):
+ header = fasta.strip('\n').split('\n')[0]
+
+ header_to_seq_dict[
+ header
+ ] = ''.join(fasta.strip('\n').split('\n')[1:])
+
+ seq_len = len(header_to_seq_dict[header])
+ header_percCalledBases_seqLength_list.append([
+ header, header_to_seq_dict[header].count('N'), (seq_len - header_to_seq_dict[header].count('N'))/seq_len, seq_len
+ ])
+
+for header, x, percCalledBases, seqLength_list in sorted(header_percCalledBases_seqLength_list, key=lambda x: (x[-2], x[-1]), reverse = True):
+ sys.stdout.write('>{}\n{}\n'.format(header, header_to_seq_dict[header]))
--
cgit v1.2.3
From 2d20bf90497588a297ca98a78ee0fbbcadf95569 Mon Sep 17 00:00:00 2001
From: AndreaGuarracino
Date: Mon, 27 Jul 2020 17:34:03 +0200
Subject: added in the FAQ three questions from the BCC 2020 attendees
---
doc/web/about.org | 18 ++++++++++++++++++
1 file changed, 18 insertions(+)
diff --git a/doc/web/about.org b/doc/web/about.org
index 8a954bb..29a80bf 100644
--- a/doc/web/about.org
+++ b/doc/web/about.org
@@ -17,7 +17,10 @@
- [[#how-do-i-change-the-work-flows][How do I change the work flows?]]
- [[#how-do-i-change-the-source-code][How do I change the source code?]]
- [[#should-i-choose-cc-by-or-cc0][Should I choose CC-BY or CC0?]]
+ - [[#are-there-also-variant-in-the-RDF-databases]][Are there also variant in the RDF databases?]
- [[#how-do-i-deal-with-private-data-and-privacy][How do I deal with private data and privacy?]]
+ - [[#do-you-have-any-checks-or-concerns-if-human-sequence-accidentally-submitted-to-your-service-as-part-of-a-fastq][Do you have any checks or concerns if human sequence accidentally submitted to your service as part of a fastq?]
+ - [[#does-PubSeq-support-only-SARS-CoV-2=data]][Does PubSeq support only SARS-CoV-2 data?]
- [[#how-do-i-communicate-with-you][How do I communicate with you?]]
- [[#who-are-the-sponsors][Who are the sponsors?]]
@@ -173,6 +176,12 @@ because we know people like the attribution clause.
In all honesty: we prefer both data and software to be free.
+* Are there also variant in the RDF databases? *
+
+We do output a RDF file with the pangenome built in, and you can parse it because it has variants implicitly.
+
+We are also writing tools to generate VCF files directly from the pangenome.
+
* How do I deal with private data and privacy?
A public sequence resource is about public data. Metadata can refer to
@@ -180,6 +189,15 @@ private data. You can use your own (anonymous) identifiers. We also
plan to combine identifiers with clinical data stored securely at
[[https://redcap-covid19.elixir-luxembourg.org/redcap/][REDCap]]. See the relevant [[https://github.com/arvados/bh20-seq-resource/issues/21][tracker]] for more information and contributing.
+* Do you have any checks or concerns if human sequence accidentally submitted to your service as part of a fastq? *
+
+We are planning to remove reads that match the human reference.
+
+* Does PubSeq support only SARS-CoV-2 data? *
+
+To date, PubSeq is a resource specific to SARS-CoV-2, but we are designing it to be able to support other species in the future.
+
+
* How do I communicate with you?
We use a [[https://gitter.im/arvados/pubseq?utm_source=share-link&utm_medium=link&utm_campaign=share-link][gitter channel]] you can join.
--
cgit v1.2.3