<?xml version="1.0" encoding="utf-8"?> <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"> <html xmlns="http://www.w3.org/1999/xhtml" lang="en" xml:lang="en"> <head> <!-- 2020-08-22 Sat 07:43 --> <meta http-equiv="Content-Type" content="text/html;charset=utf-8" /> <meta name="viewport" content="width=device-width, initial-scale=1" /> <title>COVID-19 PubSeq (part 4)</title> <meta name="generator" content="Org mode" /> <meta name="author" content="Pjotr Prins" /> <style type="text/css"> <!--/*--><![CDATA[/*><!--*/ .title { text-align: center; margin-bottom: .2em; } .subtitle { text-align: center; font-size: medium; font-weight: bold; margin-top:0; } .todo { font-family: monospace; color: red; } .done { font-family: monospace; color: green; } .priority { font-family: monospace; color: orange; } .tag { background-color: #eee; font-family: monospace; padding: 2px; font-size: 80%; font-weight: normal; } .timestamp { color: #bebebe; } .timestamp-kwd { color: #5f9ea0; } .org-right { margin-left: auto; margin-right: 0px; text-align: right; } .org-left { margin-left: 0px; margin-right: auto; text-align: left; } .org-center { margin-left: auto; margin-right: auto; text-align: center; } .underline { text-decoration: underline; } #postamble p, #preamble p { font-size: 90%; margin: .2em; } p.verse { margin-left: 3%; } pre { border: 1px solid #ccc; box-shadow: 3px 3px 3px #eee; padding: 8pt; font-family: monospace; overflow: auto; margin: 1.2em; } pre.src { position: relative; overflow: visible; padding-top: 1.2em; } pre.src:before { display: none; position: absolute; background-color: white; top: -10px; right: 10px; padding: 3px; border: 1px solid black; } pre.src:hover:before { display: inline;} /* Languages per Org manual */ pre.src-asymptote:before { content: 'Asymptote'; } pre.src-awk:before { content: 'Awk'; } pre.src-C:before { content: 'C'; } /* pre.src-C++ doesn't work in CSS */ pre.src-clojure:before { content: 'Clojure'; } pre.src-css:before { content: 'CSS'; } pre.src-D:before { content: 'D'; } pre.src-ditaa:before { content: 'ditaa'; } pre.src-dot:before { content: 'Graphviz'; } pre.src-calc:before { content: 'Emacs Calc'; } pre.src-emacs-lisp:before { content: 'Emacs Lisp'; } pre.src-fortran:before { content: 'Fortran'; } pre.src-gnuplot:before { content: 'gnuplot'; } pre.src-haskell:before { content: 'Haskell'; } pre.src-hledger:before { content: 'hledger'; } pre.src-java:before { content: 'Java'; } pre.src-js:before { content: 'Javascript'; } pre.src-latex:before { content: 'LaTeX'; } pre.src-ledger:before { content: 'Ledger'; } pre.src-lisp:before { content: 'Lisp'; } pre.src-lilypond:before { content: 'Lilypond'; } pre.src-lua:before { content: 'Lua'; } pre.src-matlab:before { content: 'MATLAB'; } pre.src-mscgen:before { content: 'Mscgen'; } pre.src-ocaml:before { content: 'Objective Caml'; } pre.src-octave:before { content: 'Octave'; } pre.src-org:before { content: 'Org mode'; } pre.src-oz:before { content: 'OZ'; } pre.src-plantuml:before { content: 'Plantuml'; } pre.src-processing:before { content: 'Processing.js'; } pre.src-python:before { content: 'Python'; } pre.src-R:before { content: 'R'; } pre.src-ruby:before { content: 'Ruby'; } pre.src-sass:before { content: 'Sass'; } pre.src-scheme:before { content: 'Scheme'; } pre.src-screen:before { content: 'Gnu Screen'; } pre.src-sed:before { content: 'Sed'; } pre.src-sh:before { content: 'shell'; } pre.src-sql:before { content: 'SQL'; } pre.src-sqlite:before { content: 'SQLite'; } /* additional languages in org.el's org-babel-load-languages alist */ pre.src-forth:before { content: 'Forth'; } pre.src-io:before { content: 'IO'; } pre.src-J:before { content: 'J'; } pre.src-makefile:before { content: 'Makefile'; } pre.src-maxima:before { content: 'Maxima'; } pre.src-perl:before { content: 'Perl'; } pre.src-picolisp:before { content: 'Pico Lisp'; } pre.src-scala:before { content: 'Scala'; } pre.src-shell:before { content: 'Shell Script'; } pre.src-ebnf2ps:before { content: 'ebfn2ps'; } /* additional language identifiers per "defun org-babel-execute" in ob-*.el */ pre.src-cpp:before { content: 'C++'; } pre.src-abc:before { content: 'ABC'; } pre.src-coq:before { content: 'Coq'; } pre.src-groovy:before { content: 'Groovy'; } /* additional language identifiers from org-babel-shell-names in ob-shell.el: ob-shell is the only babel language using a lambda to put the execution function name together. */ pre.src-bash:before { content: 'bash'; } pre.src-csh:before { content: 'csh'; } pre.src-ash:before { content: 'ash'; } pre.src-dash:before { content: 'dash'; } pre.src-ksh:before { content: 'ksh'; } pre.src-mksh:before { content: 'mksh'; } pre.src-posh:before { content: 'posh'; } /* Additional Emacs modes also supported by the LaTeX listings package */ pre.src-ada:before { content: 'Ada'; } pre.src-asm:before { content: 'Assembler'; } pre.src-caml:before { content: 'Caml'; } pre.src-delphi:before { content: 'Delphi'; } pre.src-html:before { content: 'HTML'; } pre.src-idl:before { content: 'IDL'; } pre.src-mercury:before { content: 'Mercury'; } pre.src-metapost:before { content: 'MetaPost'; } pre.src-modula-2:before { content: 'Modula-2'; } pre.src-pascal:before { content: 'Pascal'; } pre.src-ps:before { content: 'PostScript'; } pre.src-prolog:before { content: 'Prolog'; } pre.src-simula:before { content: 'Simula'; } pre.src-tcl:before { content: 'tcl'; } pre.src-tex:before { content: 'TeX'; } pre.src-plain-tex:before { content: 'Plain TeX'; } pre.src-verilog:before { content: 'Verilog'; } pre.src-vhdl:before { content: 'VHDL'; } pre.src-xml:before { content: 'XML'; } pre.src-nxml:before { content: 'XML'; } /* add a generic configuration mode; LaTeX export needs an additional (add-to-list 'org-latex-listings-langs '(conf " ")) in .emacs */ pre.src-conf:before { content: 'Configuration File'; } table { border-collapse:collapse; } caption.t-above { caption-side: top; } caption.t-bottom { caption-side: bottom; } td, th { vertical-align:top; } th.org-right { text-align: center; } th.org-left { text-align: center; } th.org-center { text-align: center; } td.org-right { text-align: right; } td.org-left { text-align: left; } td.org-center { text-align: center; } dt { font-weight: bold; } .footpara { display: inline; } .footdef { margin-bottom: 1em; } .figure { padding: 1em; } .figure p { text-align: center; } .equation-container { display: table; text-align: center; width: 100%; } .equation { vertical-align: middle; } .equation-label { display: table-cell; text-align: right; vertical-align: middle; } .inlinetask { padding: 10px; border: 2px solid gray; margin: 10px; background: #ffffcc; } #org-div-home-and-up { text-align: right; font-size: 70%; white-space: nowrap; } textarea { overflow-x: auto; } .linenr { font-size: smaller } .code-highlighted { background-color: #ffff00; } .org-info-js_info-navigation { border-style: none; } #org-info-js_console-label { font-size: 10px; font-weight: bold; white-space: nowrap; } .org-info-js_search-highlight { background-color: #ffff00; color: #000000; font-weight: bold; } .org-svg { width: 90%; } /*]]>*/--> </style> <link rel="Blog stylesheet" type="text/css" href="blog.css" /> <script type="text/javascript"> /* @licstart The following is the entire license notice for the JavaScript code in this tag. Copyright (C) 2012-2020 Free Software Foundation, Inc. The JavaScript code in this tag is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License (GNU GPL) as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. The code is distributed WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU GPL for more details. As additional permission under GNU GPL version 3 section 7, you may distribute non-source (e.g., minimized or compacted) forms of that code without the copy of the GNU GPL normally required by section 4, provided you include this license notice and a URL through which recipients can access the Corresponding Source. @licend The above is the entire license notice for the JavaScript code in this tag. */ <!--/*--><![CDATA[/*><!--*/ function CodeHighlightOn(elem, id) { var target = document.getElementById(id); if(null != target) { elem.cacheClassElem = elem.className; elem.cacheClassTarget = target.className; target.className = "code-highlighted"; elem.className = "code-highlighted"; } } function CodeHighlightOff(elem, id) { var target = document.getElementById(id); if(elem.cacheClassElem) elem.className = elem.cacheClassElem; if(elem.cacheClassTarget) target.className = elem.cacheClassTarget; } /*]]>*///--> </script> </head> <body> <div id="content"> <h1 class="title">COVID-19 PubSeq (part 4)</h1> <div id="table-of-contents"> <h2>Table of Contents</h2> <div id="text-table-of-contents"> <ul> <li><a href="#org935d151">1. Modify Metadata</a></li> <li><a href="#orgfb70872">2. What is the schema?</a></li> <li><a href="#orga76b489">3. How is the website generated?</a></li> <li><a href="#org80bb905">4. Changing the license field</a> <ul> <li><a href="#org3689b60">4.1. Modifying the schema</a></li> <li><a href="#org07e0c66">4.2. Adding fields to the form</a></li> <li><a href="#org1cfb94a">4.3. <span class="todo TODO">TODO</span> Testing the license fields</a></li> </ul> </li> <li><a href="#org88d4555">5. Changing GEO or location field</a> <ul> <li><a href="#org063bcfa">5.1. Relaxing the shex constraint</a></li> </ul> </li> </ul> </div> </div> <div id="outline-container-org935d151" class="outline-2"> <h2 id="org935d151"><span class="section-number-2">1</span> Modify Metadata</h2> <div class="outline-text-2" id="text-1"> <p> The public sequence resource uses multiple data formats listed on the <a href="http://covid19.genenetwork.org/download">download</a> page. One of the most exciting features is the full support for RDF and semantic web/linked data ontologies. This technology allows for querying data in unprescribed ways - that is, you can formulate your own queries without dealing with a preset model of that data (which is how one has to approach CSV files and SQL tables). Examples of exploring data are listed <a href="http://covid19.genenetwork.org/blog?id=using-covid-19-pubseq-part1">here</a>. </p> <p> In this BLOG we are going to look at the metadata entered on the COVID-19 PubSeq website (or command line client). It is important to understand that anyone, including you, can change that information! </p> </div> </div> <div id="outline-container-orgfb70872" class="outline-2"> <h2 id="orgfb70872"><span class="section-number-2">2</span> What is the schema?</h2> <div class="outline-text-2" id="text-2"> <p> The default metadata schema is listed <a href="https://github.com/arvados/bh20-seq-resource/blob/master/bh20sequploader/bh20seq-schema.yml">here</a>. </p> </div> </div> <div id="outline-container-orga76b489" class="outline-2"> <h2 id="orga76b489"><span class="section-number-2">3</span> How is the website generated?</h2> <div class="outline-text-2" id="text-3"> <p> Using the schema we use <a href="https://pypi.org/project/PyShEx/">pyshex</a> shex expressions and <a href="https://github.com/common-workflow-language/schema_salad">schema salad</a> to generate the <a href="https://github.com/arvados/bh20-seq-resource/blob/edb17e7f7caebfa1e76b21006b1772a33f4f7887/bh20simplewebuploader/templates/form.html#L47">input form</a>, <a href="https://github.com/arvados/bh20-seq-resource/blob/edb17e7f7caebfa1e76b21006b1772a33f4f7887/bh20sequploader/qc_metadata.py#L13">validate</a> the user input and to build <a href="https://github.com/arvados/bh20-seq-resource/blob/edb17e7f7caebfa1e76b21006b1772a33f4f7887/workflows/pangenome-generate/merge-metadata.py#L24">RDF</a>! All from that one metadata schema. </p> </div> </div> <div id="outline-container-org80bb905" class="outline-2"> <h2 id="org80bb905"><span class="section-number-2">4</span> Changing the license field</h2> <div class="outline-text-2" id="text-4"> </div> <div id="outline-container-org3689b60" class="outline-3"> <h3 id="org3689b60"><span class="section-number-3">4.1</span> Modifying the schema</h3> <div class="outline-text-3" id="text-4-1"> <p> One of the first things we want to do is to add a field for the data license. Initially we only supported CC-4.0 as a license, but we wanted to give uploaders the option to use an even more liberal CC0 license. The first step is to find a good ontology term for the field. Searching for `creative commons cc0 rdf' rendered this useful <a href="https://creativecommons.org/ns">page</a>. We also find an <a href="https://wiki.creativecommons.org/wiki/CC_License_Rdf_Overview">overview</a> where CC0 is represented as URI <a href="https://creativecommons.org/publicdomain/zero/1.0/">https://creativecommons.org/publicdomain/zero/1.0/</a>. Meanwhile the attribution license <a href="https://creativecommons.org/licenses/by/4.0/">https://creativecommons.org/licenses/by/4.0/</a>. According to this <a href="https://wiki.creativecommons.org/images/d/d6/Ccrel-1.0.pdf">document</a> we should really also add fields for attributionName and attributionURL. </p> <p> A minimal triple should be </p> <pre class="example"> id xhtml:license <http://creativecommons.org/licenses/by/4.0/> . </pre> <p> Other suggestions are </p> <pre class="example"> id dc:title "Description" . id cc:attributionName "Your Name" . id cc:attributionURL <http://resource.org/id> </pre> <p> and 'dc:source' which indicates the original source of any modified work, specified as a URI. The prefix 'cc:' is an abbreviation for <a href="http://creativecommons.org/ns">http://creativecommons.org/ns</a>#. </p> <p> Going back to the schema, where does it fit? Under host, sample, virus, technology or submitter block? It could fit under sample, but actually the license concerns the whole metadata block and sequence, so I think we can fit under its own license tag. For example </p> <p> id: placeholder </p> <pre class="example"> license: license_type: http://creativecommons.org/licenses/by/4.0/ attribution_title: "Sample ID" attribution_name: "John doe, Joe Boe, Jonny Oe" attribution_url: http://covid19.genenetwork.org/id attribution_source: https://www.ncbi.nlm.nih.gov/pubmed/323088888 </pre> <p> So, let's update the example. Notice the license info is optional - if it is missing we just assume the default CC-4.0. </p> <p> One thing that is interesting is that in the name space <a href="https://creativecommons.org/ns">https://creativecommons.org/ns</a> there is no mention of a title. I think it is useful, however, because we have no such field. So, we'll add it simply as a title field. Now the draft schema is </p> <div class="org-src-container"> <pre class="src src-js">- name: licenseSchema type: record fields: license_type: doc: License types as refined <span style="color: #fff59d;">in</span> https:<span style="color: #b0bec5;">//</span><span style="color: #b0bec5;">wiki.creativecommons.org/images/d/d6/Ccrel-1.0.pdf</span> type: string? jsonldPredicate: _id: https:<span style="color: #b0bec5;">//</span><span style="color: #b0bec5;">creativecommons.org/ns#License</span> title: doc: Attribution title related to license type: string? jsonldPredicate: _id: http:<span style="color: #b0bec5;">//</span><span style="color: #b0bec5;">semanticscience.org/resource/SIO_001167</span> attribution_url: doc: Attribution URL related to license type: string? jsonldPredicate: _id: https:<span style="color: #b0bec5;">//</span><span style="color: #b0bec5;">creativecommons.org/ns#Work</span> attribution_source: doc: Attribution source URL type: string? jsonldPredicate: _id: https:<span style="color: #b0bec5;">//</span><span style="color: #b0bec5;">creativecommons.org/ns#Work</span> </pre> </div> <p> Now, we are no ontology experts, right? So, next we submit a patch to our source tree and ask for feedback before wiring it up in the data entry form. The pull request was submitted <a href="https://github.com/arvados/bh20-seq-resource/pull/97">here</a> and reviewed on the gitter channel and I merged it. </p> </div> </div> <div id="outline-container-org07e0c66" class="outline-3"> <h3 id="org07e0c66"><span class="section-number-3">4.2</span> Adding fields to the form</h3> <div class="outline-text-3" id="text-4-2"> <p> To add the new fields to the form we have to modify it a little. If we go to the upload form we need to add the license box. The schema is loaded in <a href="https://github.com/arvados/bh20-seq-resource/blob/a0c8ebd57b875f265e8b0efec4abfaf892eb6c45/bh20simplewebuploader/main.py#L229">main.py</a> in the 'generate-form' function. </p> <p> With this <a href="https://github.com/arvados/bh20-seq-resource/commit/b9691c7deae30bd6422fb7b0681572b7b6f78ae3">patch</a> the website adds the license input fields on the form. </p> <p> Finally, to make RDF output work we need to add expressions to bh20seq-shex.rdf. This was done with this <a href="https://github.com/arvados/bh20-seq-resource/commit/f4ed46dae20abe5147871495ede2d6ac2b0854bc">patch</a>. In the end we decided to use the Dublin core title, <a href="http://purl.org/metadata/dublin_core_elements#Title">http://purl.org/metadata/dublin_core_elements#Title</a>: </p> <div class="org-src-container"> <pre class="src src-js">:licenseShape{ cc:License xsd:string; dc:Title xsd:string ?; cc:attributionName xsd:string ?; cc:attributionURL xsd:string ?; cc:attributionSource xsd:string ?; } </pre> </div> <p> Note that cc:AttributionSource is not really defined in the cc standard. </p> <p> When pushing the license info we discovered the workflow broke because the existing data had no licensing info. So we changed the license field to be optional - a missing license assumes it is CC-BY-4.0. </p> </div> </div> <div id="outline-container-org1cfb94a" class="outline-3"> <h3 id="org1cfb94a"><span class="section-number-3">4.3</span> <span class="todo TODO">TODO</span> Testing the license fields</h3> </div> </div> <div id="outline-container-org88d4555" class="outline-2"> <h2 id="org88d4555"><span class="section-number-2">5</span> Changing GEO or location field</h2> <div class="outline-text-2" id="text-5"> <p> When fetching information from GenBank and EBI/ENA we also translate the location into an unambiguous identifier. We opted for the wikidata tag. E.g. for New York city it is <a href="https://www.wikidata.org/wiki/Q60">https://www.wikidata.org/wiki/Q60</a> and for New York state it is <a href="https://www.wikidata.org/wiki/Q1384">https://www.wikidata.org/wiki/Q1384</a>. If everyone uses these metadata URIs it is easy to group when making queries. Note that we should be using <a href="http://www.wikidata.org/entity/Q60">http://www.wikidata.org/entity/Q60</a> in the dataset (http instead of https and entitity instead of wiki). </p> <p> Unfortunately the main repositories of SARS-CoV-2 have variable strings of text for location and/or GPS coordinates. For us to support our schema we had to translate all options and this proves expensive. </p> </div> <div id="outline-container-org063bcfa" class="outline-3"> <h3 id="org063bcfa"><span class="section-number-3">5.1</span> Relaxing the shex constraint</h3> <div class="outline-text-3" id="text-5-1"> <p> So we decide to relax the enforcement of this type of metadata and to allow for a free form string. </p> <p> The schema already used <a href="http://purl.obolibrary.org/obo/GAZ_00000448">http://purl.obolibrary.org/obo/GAZ_00000448</a> which states: </p> <div class="org-src-container"> <pre class="src src-js">Class: geographic location Term IRI: http:<span style="color: #b0bec5;">//</span><span style="color: #b0bec5;">purl.obolibrary.org/obo/GAZ_00000448</span> Definition: A reference to a place on the Earth, by its name or by its geographical location. </pre> </div> <p> and when you check count by location in the <a href="./demo">DEMO</a> it lists a free format. </p> <p> So, why does the validation step balk when importing GenBank? The problem was in the <a href="https://github.com/arvados/bh20-seq-resource/blob/46d4b7a3a31f6605f81d43ecd6651d60a5782364/bh20sequploader/bh20seq-shex.rdf#L39">shex check</a> for RDF generation. Removing the wikidata requirement relaxed the imports with this <a href="https://github.com/arvados/bh20-seq-resource/commit/f776816ee2b1af7ccc84afb494f68a81a51f5a76">patch</a>. </p> </div> </div> </div> </div> <div id="postamble" class="status"> <hr><small>Created by <a href="http://thebird.nl/">Pjotr Prins</a> (pjotr.public768 at thebird 'dot' nl) using Emacs org-mode and a healthy dose of Lisp!<br />Modified 2020-08-22 Sat 07:42</small>. </div> </body> </html>