From ed0909ac8015310b00f78cfc125d52768a25f626 Mon Sep 17 00:00:00 2001 From: Andrea Guarracino Date: Sat, 18 Jul 2020 14:45:54 +0200 Subject: fixed repetition of CC licenses --- bh20simplewebuploader/templates/blurb.html | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/bh20simplewebuploader/templates/blurb.html b/bh20simplewebuploader/templates/blurb.html index 9eef7c2..067cc3b 100644 --- a/bh20simplewebuploader/templates/blurb.html +++ b/bh20simplewebuploader/templates/blurb.html @@ -2,12 +2,12 @@ This is the COVID-19 Public Sequence Resource (COVID-19 PubSeq) for SARS-CoV-2 virus sequences. COVID-19 PubSeq is a repository for sequences with a low barrier to entry for uploading sequence data - using best practices, including FAIR data. I.e., data published with a creative commons - CC0 or CC-4.0 license with metadata using state-of-the art standards + using best practices, including FAIR data. Data are published with + metadata using state-of-the art standards and, perhaps most importantly, providing standardised workflows that get triggered on upload, so that results are immediately available in standardised data formats. - + Your uploaded sequence will automatically be processed and incorporated into the public pangenome with metadata using worklows from the High Performance Open Biology Lab -- cgit v1.2.3 From 1c7ae3cc4a9261a6e0563c0b84cdecb40051cc03 Mon Sep 17 00:00:00 2001 From: AndreaGuarracino Date: Sun, 26 Jul 2020 18:04:11 +0200 Subject: added thumbnail and big poster in the footer --- .../BCC2020_AndreaGuarracino_COVID19PubSeq_Poster.pdf | Bin 0 -> 2971149 bytes .../BCC2020_AndreaGuarracino_COVID19PubSeq_Poster.png | Bin 0 -> 160370 bytes bh20simplewebuploader/static/main.css | 2 +- bh20simplewebuploader/templates/footer.html | 5 +++++ 4 files changed, 6 insertions(+), 1 deletion(-) create mode 100644 bh20simplewebuploader/static/image/BCC2020_AndreaGuarracino_COVID19PubSeq_Poster.pdf create mode 100644 bh20simplewebuploader/static/image/BCC2020_AndreaGuarracino_COVID19PubSeq_Poster.png diff --git a/bh20simplewebuploader/static/image/BCC2020_AndreaGuarracino_COVID19PubSeq_Poster.pdf b/bh20simplewebuploader/static/image/BCC2020_AndreaGuarracino_COVID19PubSeq_Poster.pdf new file mode 100644 index 0000000..7da8cd6 Binary files /dev/null and b/bh20simplewebuploader/static/image/BCC2020_AndreaGuarracino_COVID19PubSeq_Poster.pdf differ diff --git a/bh20simplewebuploader/static/image/BCC2020_AndreaGuarracino_COVID19PubSeq_Poster.png b/bh20simplewebuploader/static/image/BCC2020_AndreaGuarracino_COVID19PubSeq_Poster.png new file mode 100644 index 0000000..eae2721 Binary files /dev/null and b/bh20simplewebuploader/static/image/BCC2020_AndreaGuarracino_COVID19PubSeq_Poster.png differ diff --git a/bh20simplewebuploader/static/main.css b/bh20simplewebuploader/static/main.css index bdcc0bc..7c33d9c 100644 --- a/bh20simplewebuploader/static/main.css +++ b/bh20simplewebuploader/static/main.css @@ -177,7 +177,7 @@ span.dropt:hover {text-decoration: none; background: #ffffff; z-index: 6; } .about { display: grid; - grid-template-columns: 1fr 1fr; + grid-template-columns: 1fr 1fr 1fr; grid-auto-flow: row; } diff --git a/bh20simplewebuploader/templates/footer.html b/bh20simplewebuploader/templates/footer.html index 26ea82a..abf46c3 100644 --- a/bh20simplewebuploader/templates/footer.html +++ b/bh20simplewebuploader/templates/footer.html @@ -15,6 +15,11 @@

+
+ + BCC2020 Andrea Guarracino COVID19 PubSeq Poster + +
-- cgit v1.2.3 From 0bd9dadb8a2dabcd06deb9df3b1082f7e1d993fe Mon Sep 17 00:00:00 2001 From: AndreaGuarracino Date: Thu, 23 Jul 2020 15:13:27 +0200 Subject: updated homepage image, changing its name --- README.md | 2 +- image/homepage.png | Bin 0 -> 243544 bytes image/website.png | Bin 220860 -> 0 bytes 3 files changed, 1 insertion(+), 1 deletion(-) create mode 100644 image/homepage.png delete mode 100644 image/website.png diff --git a/README.md b/README.md index 8c3a589..03e4297 100644 --- a/README.md +++ b/README.md @@ -9,7 +9,7 @@ web interface. You can use it to upload the genomes of SARS-CoV-2 samples to make them publicly and freely available to other researchers. For more information see the [paper](./paper/paper.md). -![alt text](./image/website.png "Website") +![alt text](./image/homepage.png "Website") To get started, first [install the uploader](#installation), and use the `bh20-seq-uploader` command to [upload your data](#usage). diff --git a/image/homepage.png b/image/homepage.png new file mode 100644 index 0000000..f66f9fd Binary files /dev/null and b/image/homepage.png differ diff --git a/image/website.png b/image/website.png deleted file mode 100644 index fa57ca5..0000000 Binary files a/image/website.png and /dev/null differ -- cgit v1.2.3 From 8e052cb7355eed7ce4d7075b23c9b0439285f84e Mon Sep 17 00:00:00 2001 From: AndreaGuarracino Date: Sun, 26 Jul 2020 18:02:41 +0200 Subject: updated org files, removing unuseful information and adding the BCC2020 video talk and poster links in the about page --- doc/blog/using-covid-19-pubseq-part2.html | 33 +- doc/blog/using-covid-19-pubseq-part2.org | 25 +- doc/blog/using-covid-19-pubseq-part3.html | 2 +- doc/blog/using-covid-19-pubseq-part3.org | 2 +- doc/web/about.html | 1489 ++++++++++++++++++----------- doc/web/about.org | 2 + 6 files changed, 960 insertions(+), 593 deletions(-) diff --git a/doc/blog/using-covid-19-pubseq-part2.html b/doc/blog/using-covid-19-pubseq-part2.html index c047441..c041ebe 100644 --- a/doc/blog/using-covid-19-pubseq-part2.html +++ b/doc/blog/using-covid-19-pubseq-part2.html @@ -259,39 +259,12 @@ for the JavaScript code in this tag.
-

-As part of the COVID-19 Biohackathon 2020 we formed a working group to -create a COVID-19 Public Sequence Resource (COVID-19 PubSeq) for -Corona virus sequences. The general idea is to create a repository -that has a low barrier to entry for uploading sequence data using best -practices. I.e., data published with a creative commons 4.0 (CC-4.0) -license with metadata using state-of-the art standards and, perhaps -most importantly, providing standardised workflows that get triggered -on upload, so that results are immediately available in standardised -data formats. -

1 Finding output of workflows

-

-As part of the COVID-19 Biohackathon 2020 we formed a working group to -create a COVID-19 Public Sequence Resource (COVID-19 PubSeq) for -Corona virus sequences. The general idea is to create a repository -that has a low barrier to entry for uploading sequence data using best -practices. I.e., data published with a creative commons 4.0 (CC-4.0) -license with metadata using state-of-the art standards and, perhaps -most importantly, providing standardised workflows that get triggered -on upload, so that results are immediately available in standardised -data formats. -

-
-
-
-

2 Introduction

-
-

+

We are using Arvados to run common workflow language (CWL) pipelines. The most recent output is on display on a web page (with time stamp) and a full list is generated here. It is nice to start up, but for @@ -302,7 +275,7 @@ want to wade through thousands of output files!

-

3 The Arvados file interface

+

2 The Arvados file interface

Arvados has the web server, but it also has a REST API and associated @@ -384,7 +357,7 @@ arv-get 2be6af7b4741f2a5c5f8ff2bc6152d73+1955623+Ab9ad65d7fe958a053b3a57d545839d

-

4 Using the Arvados API

+

3 TODO Using the Arvados API

diff --git a/doc/blog/using-covid-19-pubseq-part2.org b/doc/blog/using-covid-19-pubseq-part2.org index d2a1cbc..349fd06 100644 --- a/doc/blog/using-covid-19-pubseq-part2.org +++ b/doc/blog/using-covid-19-pubseq-part2.org @@ -8,36 +8,13 @@ #+HTML_LINK_HOME: http://covid19.genenetwork.org #+HTML_HEAD: -As part of the COVID-19 Biohackathon 2020 we formed a working group to -create a COVID-19 Public Sequence Resource (COVID-19 PubSeq) for -Corona virus sequences. The general idea is to create a repository -that has a low barrier to entry for uploading sequence data using best -practices. I.e., data published with a creative commons 4.0 (CC-4.0) -license with metadata using state-of-the art standards and, perhaps -most importantly, providing standardised workflows that get triggered -on upload, so that results are immediately available in standardised -data formats. - * Table of Contents :TOC:noexport: - [[#finding-output-of-workflows][Finding output of workflows]] - - [[#introduction][Introduction]] - [[#the-arvados-file-interface][The Arvados file interface]] - [[#using-the-arvados-api][Using the Arvados API]] * Finding output of workflows -As part of the COVID-19 Biohackathon 2020 we formed a working group to -create a COVID-19 Public Sequence Resource (COVID-19 PubSeq) for -Corona virus sequences. The general idea is to create a repository -that has a low barrier to entry for uploading sequence data using best -practices. I.e., data published with a creative commons 4.0 (CC-4.0) -license with metadata using state-of-the art standards and, perhaps -most importantly, providing standardised workflows that get triggered -on upload, so that results are immediately available in standardised -data formats. - -* Introduction - We are using Arvados to run common workflow language (CWL) pipelines. The most recent output is on display on a [[https://workbench.lugli.arvadosapi.com/collections/lugli-4zz18-z513nlpqm03hpca][web page]] (with time stamp) and a full list is generated [[https://collections.lugli.arvadosapi.com/c=lugli-4zz18-z513nlpqm03hpca/][here]]. It is nice to start up, but for @@ -81,4 +58,4 @@ its listed UUID: : arv-get 2be6af7b4741f2a5c5f8ff2bc6152d73+1955623+Ab9ad65d7fe958a053b3a57d545839de18290843a@5ed7f3c5 -* Using the Arvados API +* TODO Using the Arvados API diff --git a/doc/blog/using-covid-19-pubseq-part3.html b/doc/blog/using-covid-19-pubseq-part3.html index 91879b0..df4a286 100644 --- a/doc/blog/using-covid-19-pubseq-part3.html +++ b/doc/blog/using-covid-19-pubseq-part3.html @@ -625,7 +625,7 @@ The web interface using this exact same script so it should just work

6.2 Example: uploading bulk GenBank sequences

-We also use above script to bulk upload GenBank sequences with a FASTA +We also use above script to bulk upload GenBank sequences with a FASTA and YAML extractor specific for GenBank. This means that the steps we took above for uploading a GenBank sequence are already automated.

diff --git a/doc/blog/using-covid-19-pubseq-part3.org b/doc/blog/using-covid-19-pubseq-part3.org index 03f37ab..e8fee36 100644 --- a/doc/blog/using-covid-19-pubseq-part3.org +++ b/doc/blog/using-covid-19-pubseq-part3.org @@ -234,6 +234,6 @@ The web interface using this exact same script so it should just work ** Example: uploading bulk GenBank sequences -We also use above script to bulk upload GenBank sequences with a [[https://github.com/arvados/bh20-seq-resource/blob/master/scripts/from_genbank_to_fasta_and_yaml.py][FASTA +We also use above script to bulk upload GenBank sequences with a [[https://github.com/arvados/bh20-seq-resource/blob/master/scripts/download_genbank_data/from_genbank_to_fasta_and_yaml.py][FASTA and YAML]] extractor specific for GenBank. This means that the steps we took above for uploading a GenBank sequence are already automated. diff --git a/doc/web/about.html b/doc/web/about.html index dfd4252..c971a4e 100644 --- a/doc/web/about.html +++ b/doc/web/about.html @@ -1,549 +1,964 @@ + "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"> - - - -About/FAQ - - - - + + + + About/FAQ + + + +
-

About/FAQ

- - -
-

1 What is the 'public sequence resource' about?

-
-

-The public sequence resource aims to provide a generic and useful -resource for COVID-19 research. The focus is on providing the best -possible sequence data with associated metadata that can be used for -sequence comparison and protein prediction. -

-
-
- -
-

2 Who created the public sequence resource?

-
-

-The public sequence resource is an initiative by bioinformatics and -ontology experts who want to create something agile and useful for the -wider research community. The initiative started at the COVID-19 -biohackathon in April 2020 and is ongoing. The main project drivers -are Pjotr Prins (UTHSC), Peter Amstutz (Curii), Andrea Guarracino -(University of Rome Tor Vergata), Michael Crusoe (Common Workflow -Language), Thomas Liener (consultant, formerly EBI), Erik Garrison -(UCSC) and Jerven Bolleman (Swiss Institute of Bioinformatics). -

- -

-Notably, as this is a free software initiative, the project represents -major work by hundreds of software developers and ontology and data -wrangling experts. Thank you everyone! -

-
-
- -
-

3 How does the public sequence resource compare to other data resources?

-
-

-The short version is that we use state-of-the-art practices in -bioinformatics using agile methods. Unlike the resources from large -institutes we can improve things on a dime and anyone can contribute -to building out this resource! Sequences from GenBank, EBI/ENA and -others are regularly added to PubSeq. We encourage people to everyone -to submit on PubSeq because of its superior live tooling and metadata -support (see the next question). -

- -

-Importantly: all data is published under either the Creative Commons -4.0 attribution license or the CC0 “No Rights Reserved” license which -means it data can be published and workflows can run in public -environments allowing for improved access for research and -reproducible results. This contrasts with some other public resources, -such as GISAID. -

-
-
- -
-

4 Why should I upload my data here?

-
-
    -
  1. We champion truly shareable data without licensing restrictions - with proper -attribution
  2. -
  3. We provide full metadata support using state-of-the-art ontology's
  4. -
  5. We provide a web-based sequence uploader and a command-line version -for bulk uploads
  6. -
  7. We provide a live SPARQL end-point for all metadata
  8. -
  9. We provide free data analysis and sequence comparison triggered on data upload
  10. -
  11. We do real work for you, with this link you can see the last -run took 5.5 hours!
  12. -
  13. We provide free downloads of all computed output
  14. -
  15. There is no need to set up pipelines and/or compute clusters
  16. -
  17. All workflows get triggered on uploading a new sequence
  18. -
  19. When someone (you?) improves the software/workflows and everyone benefits
  20. -
  21. Your data gets automatically integrated with the Swiss Institure of -Bioinformatics COVID-19 knowledge base -https://covid-19-sparql.expasy.org/ (Elixir Switzerland)
  22. -
  23. Your data will be used to develop drug targets
  24. -
- -

-Finally, if you upload your data here we have workflows that output -formatted data suitable for uploading to EBI resources (and soon -others). Uploading your data here get your data ready for upload to -multiple resources. -

-
-
- -
-

5 Why should I not upload by data here?

-
-

-Funny question. There are only good reasons to upload your data here -and make it available to the widest audience possible. -

- -

-In fact, you can upload your data here as well as to other -resources. It is your data after all. No one can prevent you from -uploading your data to multiple resources. -

- -

-We recommend uploading to EBI and NCBI resources using our data -conversion tools. It means you only enter data once and make the -process smooth. You can also use our command line data uploader -for bulk uploads! -

-
-
- -
-

6 How does the public sequence resource work?

-
-

-On uploading a sequence with metadata it will automatically be -processed and incorporated into the public pangenome with metadata -using workflows from the High Performance Open Biology Lab defined -here. -

-
-
- -
-

7 Who uses the public sequence resource?

-
-

-The Swiss Institute of Bioinformatics has included this data in -https://covid-19-sparql.expasy.org/ and made it part of Uniprot. -

- -

-The Pantograph viewer uses PubSeq data for their visualisations. -

- -

-UTHSC (USA), ESR (New Zealand) and ORNL (USA) use COVID-19 PubSeq data -for monitoring, protein prediction and drug development. -

-
-
- -
-

8 How can I contribute?

-
-

-You can contribute by submitting sequences, updating metadata, submit -issues on our issue tracker, and more importantly add functionality. -See 'How do I change the source code' below. Read through our online -documentation at http://covid19.genenetwork.org/blog as a starting -point. -

-
-
- -
-

9 Is this about open data?

-
-

-All data is published under a Creative Commons 4.0 attribution license -(CC-BY-4.0). You can download the raw and published (GFA/RDF/FASTA) -data and store it for further processing. -

-
-
- -
-

10 Is this about free software?

-
-

-Absolutely. Free software allows for fully reproducible pipelines. You -can take our workflows and data and run it elsewhere! -

-
-
- -
-

11 How do I upload raw data?

-
-

-We are preparing raw sequence data pipelines (fastq and BAM). The -reason is that we want the best data possible for downstream analysis -(including protein prediction and test development). The current -approach where people publish final sequences of SARS-CoV-2 is lacking -because it hides how this sequence was created. For reasons of -reproducible and improved results we want/need to work with the raw -sequence reads (both short reads and long reads) and take alternative -assembly variations into consideration. This is all work in progress. -

-
-
- -
-

12 How do I change metadata?

- -
- -
-

13 How do I change the work flows?

-
-

-Workflows are on github and can be modified. See also the BLOG -http://covid19.genenetwork.org/blog on workflows. -

-
-
- -
-

14 How do I change the source code?

-
-

-Go to our source code repositories, fork/clone the repository, change -something and submit a pull request (PR). That easy! Check out how -many PRs we already merged. -

-
-
- -
-

15 Should I choose CC-BY or CC0?

-
-

-Restrictive data licenses are hampering data sharing and reproducible -research. CC0 is the preferred license because it gives researchers -the most freedom. Since we provide metadata there is no reason for -others not to honour your work. We also provide CC-BY as an option -because we know people like the attribution clause. -

- -

-In all honesty: we prefer both data and software to be free. -

-
-
- -
-

16 How do I deal with private data and privacy?

-
-

-A public sequence resource is about public data. Metadata can refer to -private data. You can use your own (anonymous) identifiers. We also -plan to combine identifiers with clinical data stored securely at -REDCap. See the relevant tracker for more information and contributing. -

-
-
- -
-

17 How do I communicate with you?

-
-

-We use a gitter channel you can join. -

-
-
- -
-

18 Who are the sponsors?

-
-

-The main sponsors are listed in the footer. In addition to the time -generously donated by many contributors we also acknowledge Amazon AWS -for donating COVID-19 related compute time. -

-
-
+

About/FAQ

+ + +
+

1 What is the 'public sequence resource' about?

+
+

+ The public sequence resource aims to provide a generic and useful + resource for COVID-19 research. The focus is on providing the best + possible sequence data with associated metadata that can be used for + sequence comparison and protein prediction. +

+

+ We were at the Bioinformatics Community Conference 2020! Have a look at the + video talk + (alternative link) + and the poster. +

+
+
+ +
+

2 Who created the public sequence resource?

+
+

+ The public sequence resource is an initiative by bioinformatics and + ontology experts who want to create something agile and useful for the + wider research community. The initiative started at the COVID-19 + biohackathon in April 2020 and is ongoing. The main project drivers + are Pjotr Prins (UTHSC), Peter Amstutz (Curii), Andrea Guarracino + (University of Rome Tor Vergata), Michael Crusoe (Common Workflow + Language), Thomas Liener (consultant, formerly EBI), Erik Garrison + (UCSC) and Jerven Bolleman (Swiss Institute of Bioinformatics). +

+ +

+ Notably, as this is a free software initiative, the project represents + major work by hundreds of software developers and ontology and data + wrangling experts. Thank you everyone! +

+
+
+ +
+

3 How does the public sequence resource compare to + other data resources?

+
+

+ The short version is that we use state-of-the-art practices in + bioinformatics using agile methods. Unlike the resources from large + institutes we can improve things on a dime and anyone can contribute + to building out this resource! Sequences from GenBank, EBI/ENA and + others are regularly added to PubSeq. We encourage people to everyone + to submit on PubSeq because of its superior live tooling and metadata + support (see the next question). +

+ +

+ Importantly: all data is published under either the Creative Commons + 4.0 attribution license or the CC0 “No Rights Reserved” + license which + means it data can be published and workflows can run in public + environments allowing for improved access for research and + reproducible results. This contrasts with some other public resources, + such as GISAID. +

+
+
+ +
+

4 Why should I upload my data here?

+
+
    +
  1. We champion truly shareable data without licensing restrictions - with proper + attribution +
  2. +
  3. We provide full metadata support using state-of-the-art ontology's
  4. +
  5. We provide a web-based sequence uploader and a command-line version + for bulk uploads +
  6. +
  7. We provide a live SPARQL end-point for all metadata
  8. +
  9. We provide free data analysis and sequence comparison triggered on data upload
  10. +
  11. We do real work for you, with this link + you can see the last + run took 5.5 hours! +
  12. +
  13. We provide free downloads of all computed output
  14. +
  15. There is no need to set up pipelines and/or compute clusters
  16. +
  17. All workflows get triggered on uploading a new sequence
  18. +
  19. When someone (you?) improves the software/workflows and everyone benefits
  20. +
  21. Your data gets automatically integrated with the Swiss Institure of + Bioinformatics COVID-19 knowledge base + https://covid-19-sparql.expasy.org/ (Elixir + Switzerland) +
  22. +
  23. Your data will be used to develop drug targets
  24. +
+ +

+ Finally, if you upload your data here we have workflows that output + formatted data suitable for uploading to EBI + resources (and soon + others). Uploading your data here get your data ready for upload to + multiple resources. +

+
+
+ +
+

5 Why should I not upload by data here?

+
+

+ Funny question. There are only good reasons to upload your data here + and make it available to the widest audience possible. +

+ +

+ In fact, you can upload your data here as well as to other + resources. It is your data after all. No one can prevent you from + uploading your data to multiple resources. +

+ +

+ We recommend uploading to EBI and NCBI resources using our data + conversion tools. It means you only enter data once and make the + process smooth. You can also use our command line data uploader + for bulk uploads! +

+
+
+ +
+

6 How does the public sequence resource work?

+
+

+ On uploading a sequence with metadata it will automatically be + processed and incorporated into the public pangenome with metadata + using workflows from the High Performance Open Biology Lab defined + here. +

+
+
+ +
+

7 Who uses the public sequence resource?

+
+

+ The Swiss Institute of Bioinformatics has included this data in + https://covid-19-sparql.expasy.org/ and made it part + of Uniprot. +

+ +

+ The Pantograph viewer uses PubSeq data for their + visualisations. +

+ +

+ UTHSC (USA), ESR (New Zealand) and + ORNL (USA) use COVID-19 PubSeq data + for monitoring, protein prediction and drug development. +

+
+
+ +
+

8 How can I contribute?

+
+

+ You can contribute by submitting sequences, updating metadata, submit + issues on our issue tracker, and more importantly add functionality. + See 'How do I change the source code' below. Read through our online + documentation at http://covid19.genenetwork.org/blog + as a starting + point. +

+
+
+ +
+

9 Is this about open data?

+
+

+ All data is published under a Creative Commons + 4.0 attribution license + (CC-BY-4.0). You can download the raw and published (GFA/RDF/FASTA) + data and store it for further processing. +

+
+
+ +
+

10 Is this about free software?

+
+

+ Absolutely. Free software allows for fully reproducible pipelines. You + can take our workflows and data and run it elsewhere! +

+
+
+ +
+

11 How do I upload raw data?

+
+

+ We are preparing raw sequence data pipelines (fastq and BAM). The + reason is that we want the best data possible for downstream analysis + (including protein prediction and test development). The current + approach where people publish final sequences of SARS-CoV-2 is lacking + because it hides how this sequence was created. For reasons of + reproducible and improved results we want/need to work with the raw + sequence reads (both short reads and long reads) and take alternative + assembly variations into consideration. This is all work in progress. +

+
+
+ +
+

12 How do I change metadata?

+ +
+ +
+

13 How do I change the work flows?

+
+

+ Workflows are on github + and can be modified. See also the BLOG + http://covid19.genenetwork.org/blog on workflows. +

+
+
+ +
+

14 How do I change the source code?

+
+

+ Go to our source code repositories, + fork/clone the repository, change + something and submit a pull request + (PR). That easy! Check out how + many PRs we already merged. +

+
+
+ +
+

15 Should I choose CC-BY or CC0?

+
+

+ Restrictive data licenses are hampering data sharing and reproducible + research. CC0 is the preferred license because it gives researchers + the most freedom. Since we provide metadata there is no reason for + others not to honour your work. We also provide CC-BY as an option + because we know people like the attribution clause. +

+ +

+ In all honesty: we prefer both data and software to be free. +

+
+
+ +
+

16 How do I deal with private data and privacy?

+
+

+ A public sequence resource is about public data. Metadata can refer to + private data. You can use your own (anonymous) identifiers. We also + plan to combine identifiers with clinical data stored securely at + REDCap. See the relevant tracker for more information and + contributing. +

+
+
+ +
+

17 How do I communicate with you?

+
+

+ We use a gitter + channel you can join. +

+
+
+ +
+

18 Who are the sponsors?

+
+

+ The main sponsors are listed in the footer. In addition to the time + generously donated by many contributors we also acknowledge Amazon AWS + for donating COVID-19 related compute time. +

+
+
-
Created by Pjotr Prins (pjotr.public768 at thebird 'dot' nl) using Emacs org-mode and a healthy dose of Lisp!
Modified 2020-07-18 Sat 03:27
. +
+ Created by Pjotr Prins (pjotr.public768 at thebird 'dot' nl) using Emacs + org-mode and a healthy dose of Lisp!
Modified 2020-07-18 Sat 03:27
.
diff --git a/doc/web/about.org b/doc/web/about.org index 39fb667..8a954bb 100644 --- a/doc/web/about.org +++ b/doc/web/about.org @@ -28,6 +28,8 @@ resource for COVID-19 research. The focus is on providing the best possible sequence data with associated metadata that can be used for sequence comparison and protein prediction. +We were at the *Bioinformatics Community Conference 2020*! Have a look at the [[https://bcc2020.sched.com/event/coLw]][video talk] ([[https://drive.google.com/file/d/1skXHwVKM_gl73-_4giYIOQ1IlC5X5uBo/view?usp=sharing]][alternative link]) and the [[https://drive.google.com/file/d/1vyEgfvSqhM9yIwWZ6Iys-QxhxtVxPSdp/view?usp=sharing]][poster]. + * Who created the public sequence resource? The *public sequence resource* is an initiative by [[https://github.com/arvados/bh20-seq-resource/graphs/contributors][bioinformatics]] and -- cgit v1.2.3 From 088d0a7fa47d4dff5a38f42fe4d12b9841bb033e Mon Sep 17 00:00:00 2001 From: AndreaGuarracino Date: Thu, 23 Jul 2020 22:28:37 +0200 Subject: new workflow for odgi building from spoa gfa --- .../odgi-build-from-spoa-gfa.cwl | 29 ++++++++++++++++++++++ 1 file changed, 29 insertions(+) create mode 100644 workflows/pangenome-generate/odgi-build-from-spoa-gfa.cwl diff --git a/workflows/pangenome-generate/odgi-build-from-spoa-gfa.cwl b/workflows/pangenome-generate/odgi-build-from-spoa-gfa.cwl new file mode 100644 index 0000000..2459ce7 --- /dev/null +++ b/workflows/pangenome-generate/odgi-build-from-spoa-gfa.cwl @@ -0,0 +1,29 @@ +cwlVersion: v1.1 +class: CommandLineTool +inputs: + inputGFA: File +outputs: + odgiGraph: + type: File + outputBinding: + glob: $(inputs.inputGFA.nameroot).unchop.sorted.odgi +requirements: + InlineJavascriptRequirement: {} + ShellCommandRequirement: {} +hints: + DockerRequirement: + dockerPull: "quay.io/biocontainers/odgi:v0.3--py37h8b12597_0" + ResourceRequirement: + coresMin: 4 + ramMin: $(7 * 1024) + outdirMin: $(Math.ceil((inputs.inputGFA.size/(1024*1024*1024)+1) * 2)) + InitialWorkDirRequirement: + listing: + - entry: $(inputs.inputGFA) + writable: true +arguments: [odgi, build, -g, $(inputs.inputGFA), -o, -, + {shellQuote: false, valueFrom: "|"}, + odgi, unchop, -i, -, -o, -, + {shellQuote: false, valueFrom: "|"}, + odgi, sort, -i, -, -p, s, -o, $(inputs.inputGFA.nameroot).unchop.sorted.odgi + ] -- cgit v1.2.3 From e31d89f6b4c0d2a99eb6df90b85b4e51cb584817 Mon Sep 17 00:00:00 2001 From: AndreaGuarracino Date: Mon, 27 Jul 2020 17:04:00 +0200 Subject: added spoa workflow in a low memory consumption mode --- workflows/pangenome-generate/spoa.cwl | 27 +++++++++++++++++++++++++++ 1 file changed, 27 insertions(+) create mode 100644 workflows/pangenome-generate/spoa.cwl diff --git a/workflows/pangenome-generate/spoa.cwl b/workflows/pangenome-generate/spoa.cwl new file mode 100644 index 0000000..1e390d8 --- /dev/null +++ b/workflows/pangenome-generate/spoa.cwl @@ -0,0 +1,27 @@ +cwlVersion: v1.1 +class: CommandLineTool +inputs: + readsFA: File +stdout: $(inputs.readsFA.nameroot).g6.gfa +script: + type: File + default: {class: File, location: relabel-seqs.py} +outputs: + spoaGFA: + type: stdout +requirements: + InlineJavascriptRequirement: {} + ShellCommandRequirement: {} +hints: + DockerRequirement: + dockerPull: "quay.io/biocontainers/spoa:3.0.2--hc9558a2_0" + ResourceRequirement: + coresMin: 1 + ramMin: $(15 * 1024) + outdirMin: $(Math.ceil(inputs.readsFA.size/(1024*1024*1024) + 20)) +baseCommand: spoa +arguments: [ + $(inputs.readsFA), + -G, + -g, '-6' +] -- cgit v1.2.3 From 618f956eb03c6a6ad1cc16efc931f55b0dce83e1 Mon Sep 17 00:00:00 2001 From: AndreaGuarracino Date: Mon, 27 Jul 2020 17:27:07 +0200 Subject: added workflow to sort a multifasta by quality and length, and added the overall new pangenome generation workflow with SPOA --- .../pangenome-generate/pangenome-generate_spoa.cwl | 122 +++++++++++++++++++++ .../sort_fasta_by_quality_and_len.cwl | 18 +++ .../sort_fasta_by_quality_and_len.py | 35 ++++++ 3 files changed, 175 insertions(+) create mode 100644 workflows/pangenome-generate/pangenome-generate_spoa.cwl create mode 100644 workflows/pangenome-generate/sort_fasta_by_quality_and_len.cwl create mode 100644 workflows/pangenome-generate/sort_fasta_by_quality_and_len.py diff --git a/workflows/pangenome-generate/pangenome-generate_spoa.cwl b/workflows/pangenome-generate/pangenome-generate_spoa.cwl new file mode 100644 index 0000000..958ffb6 --- /dev/null +++ b/workflows/pangenome-generate/pangenome-generate_spoa.cwl @@ -0,0 +1,122 @@ +#!/usr/bin/env cwl-runner +cwlVersion: v1.1 +class: Workflow +requirements: + ScatterFeatureRequirement: {} + StepInputExpressionRequirement: {} +inputs: + inputReads: File[] + metadata: File[] + metadataSchema: File + subjects: string[] + exclude: File? + bin_widths: + type: int[] + default: [ 1, 4, 16, 64, 256, 1000, 4000, 16000] + doc: width of each bin in basepairs along the graph vector + cells_per_file: + type: int + default: 100 + doc: Cells per file on component_segmentation +outputs: + odgiGraph: + type: File + outputSource: buildGraph/odgiGraph + odgiPNG: + type: File + outputSource: vizGraph/graph_image + spoaGFA: + type: File + outputSource: induceGraph/spoaGFA + odgiRDF: + type: File + outputSource: odgi2rdf/rdf + readsMergeDedup: + type: File + outputSource: dedup/reads_dedup + mergedMetadata: + type: File + outputSource: mergeMetadata/merged + indexed_paths: + type: File + outputSource: index_paths/indexed_paths + colinear_components: + type: Directory + outputSource: segment_components/colinear_components +steps: + relabel: + in: + readsFA: inputReads + subjects: subjects + exclude: exclude + out: [relabeledSeqs, originalLabels] + run: relabel-seqs.cwl + dedup: + in: {reads: relabel/relabeledSeqs} + out: [reads_dedup, dups] + run: ../tools/seqkit/seqkit_rmdup.cwl + sort_by_quality_and_len: + in: {reads: dedup/reads_dedup} + out: [reads_sorted_by_quality_and_len] + run: sort_fasta_by_quality_and_len.cwl + induceGraph: + in: + readsFA: sort_by_quality_and_len/reads_sorted_by_quality_and_len + out: [spoaGFA] + run: spoa.cwl + buildGraph: + in: {inputGFA: induceGraph/spoaGFA} + out: [odgiGraph] + run: odgi-build-from-spoa-gfa.cwl + vizGraph: + in: + sparse_graph_index: buildGraph/odgiGraph + width: + default: 50000 + height: + default: 500 + path_per_row: + default: true + path_height: + default: 4 + out: [graph_image] + run: ../tools/odgi/odgi_viz.cwl + odgi2rdf: + in: {odgi: buildGraph/odgiGraph} + out: [rdf] + run: odgi_to_rdf.cwl + mergeMetadata: + in: + metadata: metadata + metadataSchema: metadataSchema + subjects: subjects + dups: dedup/dups + originalLabels: relabel/originalLabels + out: [merged] + run: merge-metadata.cwl + bin_paths: + run: ../tools/odgi/odgi_bin.cwl + in: + sparse_graph_index: buildGraph/odgiGraph + bin_width: bin_widths + scatter: bin_width + out: [ bins, pangenome_sequence ] + index_paths: + label: Create path index + run: ../tools/odgi/odgi_pathindex.cwl + in: + sparse_graph_index: buildGraph/odgiGraph + out: [ indexed_paths ] + segment_components: + label: Run component segmentation + run: ../tools/graph-genome-segmentation/component_segmentation.cwl + in: + bins: bin_paths/bins + cells_per_file: cells_per_file + pangenome_sequence: + source: bin_paths/pangenome_sequence + valueFrom: $(self[0]) + # the bin_paths step is scattered over the bin_width array, but always using the same sparse_graph_index + # the pangenome_sequence that is extracted is exactly the same for the same sparse_graph_index + # regardless of bin_width, so we take the first pangenome_sequence as input for this step + out: [ colinear_components ] diff --git a/workflows/pangenome-generate/sort_fasta_by_quality_and_len.cwl b/workflows/pangenome-generate/sort_fasta_by_quality_and_len.cwl new file mode 100644 index 0000000..59f027e --- /dev/null +++ b/workflows/pangenome-generate/sort_fasta_by_quality_and_len.cwl @@ -0,0 +1,18 @@ +cwlVersion: v1.1 +class: CommandLineTool +inputs: + readsFA: + type: File + inputBinding: {position: 2} + script: + type: File + inputBinding: {position: 1} + default: {class: File, location: sort_fasta_by_quality_and_len.py} +stdout: $(inputs.readsFA.nameroot).sorted_by_quality_and_len.fasta +outputs: + sortedReadsFA: + type: stdout +requirements: + InlineJavascriptRequirement: {} + ShellCommandRequirement: {} +baseCommand: [python] diff --git a/workflows/pangenome-generate/sort_fasta_by_quality_and_len.py b/workflows/pangenome-generate/sort_fasta_by_quality_and_len.py new file mode 100644 index 0000000..e48fd68 --- /dev/null +++ b/workflows/pangenome-generate/sort_fasta_by_quality_and_len.py @@ -0,0 +1,35 @@ +#!/usr/bin/env python3 + +# Sort the sequences by quality (percentage of number of N bases not called, descending) and by length (descending). +# The best sequence is the longest one, with no uncalled bases. + +import os +import sys +import gzip + +def open_gzipsafe(path_file): + if path_file.endswith('.gz'): + return gzip.open(path_file, 'rt') + else: + return open(path_file) + +path_fasta = sys.argv[1] + +header_to_seq_dict = {} +header_percCalledBases_seqLength_list = [] + +with open_gzipsafe(path_fasta) as f: + for fasta in f.read().strip('\n>').split('>'): + header = fasta.strip('\n').split('\n')[0] + + header_to_seq_dict[ + header + ] = ''.join(fasta.strip('\n').split('\n')[1:]) + + seq_len = len(header_to_seq_dict[header]) + header_percCalledBases_seqLength_list.append([ + header, header_to_seq_dict[header].count('N'), (seq_len - header_to_seq_dict[header].count('N'))/seq_len, seq_len + ]) + +for header, x, percCalledBases, seqLength_list in sorted(header_percCalledBases_seqLength_list, key=lambda x: (x[-2], x[-1]), reverse = True): + sys.stdout.write('>{}\n{}\n'.format(header, header_to_seq_dict[header])) -- cgit v1.2.3 From 2d20bf90497588a297ca98a78ee0fbbcadf95569 Mon Sep 17 00:00:00 2001 From: AndreaGuarracino Date: Mon, 27 Jul 2020 17:34:03 +0200 Subject: added in the FAQ three questions from the BCC 2020 attendees --- doc/web/about.org | 18 ++++++++++++++++++ 1 file changed, 18 insertions(+) diff --git a/doc/web/about.org b/doc/web/about.org index 8a954bb..29a80bf 100644 --- a/doc/web/about.org +++ b/doc/web/about.org @@ -17,7 +17,10 @@ - [[#how-do-i-change-the-work-flows][How do I change the work flows?]] - [[#how-do-i-change-the-source-code][How do I change the source code?]] - [[#should-i-choose-cc-by-or-cc0][Should I choose CC-BY or CC0?]] + - [[#are-there-also-variant-in-the-RDF-databases]][Are there also variant in the RDF databases?] - [[#how-do-i-deal-with-private-data-and-privacy][How do I deal with private data and privacy?]] + - [[#do-you-have-any-checks-or-concerns-if-human-sequence-accidentally-submitted-to-your-service-as-part-of-a-fastq][Do you have any checks or concerns if human sequence accidentally submitted to your service as part of a fastq?] + - [[#does-PubSeq-support-only-SARS-CoV-2=data]][Does PubSeq support only SARS-CoV-2 data?] - [[#how-do-i-communicate-with-you][How do I communicate with you?]] - [[#who-are-the-sponsors][Who are the sponsors?]] @@ -173,6 +176,12 @@ because we know people like the attribution clause. In all honesty: we prefer both data and software to be free. +* Are there also variant in the RDF databases? * + +We do output a RDF file with the pangenome built in, and you can parse it because it has variants implicitly. + +We are also writing tools to generate VCF files directly from the pangenome. + * How do I deal with private data and privacy? A public sequence resource is about public data. Metadata can refer to @@ -180,6 +189,15 @@ private data. You can use your own (anonymous) identifiers. We also plan to combine identifiers with clinical data stored securely at [[https://redcap-covid19.elixir-luxembourg.org/redcap/][REDCap]]. See the relevant [[https://github.com/arvados/bh20-seq-resource/issues/21][tracker]] for more information and contributing. +* Do you have any checks or concerns if human sequence accidentally submitted to your service as part of a fastq? * + +We are planning to remove reads that match the human reference. + +* Does PubSeq support only SARS-CoV-2 data? * + +To date, PubSeq is a resource specific to SARS-CoV-2, but we are designing it to be able to support other species in the future. + + * How do I communicate with you? We use a [[https://gitter.im/arvados/pubseq?utm_source=share-link&utm_medium=link&utm_campaign=share-link][gitter channel]] you can join. -- cgit v1.2.3