#+TITLE: COVID-19 PubSeq (part 4)
#+AUTHOR: Pjotr Prins
# C-c C-e h h   publish
# C-c !         insert date (use . for active agenda, C-u C-c ! for date, C-u C-c . for time)
# C-c C-t       task rotate
# RSS_IMAGE_URL: http://xxxx.xxxx.free.fr/rss_icon.png

#+HTML_HEAD: <link rel="Blog stylesheet" type="text/css" href="blog.css" />


* Table of Contents                                                     :TOC:noexport:
 - [[#modify-metadata][Modify Metadata]]
 - [[#what-is-the-schema][What is the schema?]]
 - [[#how-is-the-website-generated][How is the website generated?]]
 - [[#modifying-the-schema][Modifying the schema]]

* Modify Metadata

The public sequence resource uses multiple data formats listed on the
[[./download][DOWNLOAD]] page. One of the most exciting features is the full support
for RDF and semantic web/linked data ontologies. This technology
allows for querying data in unprescribed ways - that is, you can
formulate your own queries without dealing with a preset model of that
data (so typical of CSV files and SQL tables). Examples of exploring
data are listed [[./blog?id=using-covid-19-pubseq-part1][here]].

In this BLOG we are going to look at the metadata entered on the
[[./][COVID-19 PubSeq]] website (or command line client). It is important to
understand that anyone, including you, can change that information!

* What is the schema?

The default metadata schema is listed [[https://github.com/arvados/bh20-seq-resource/blob/master/bh20sequploader/bh20seq-schema.yml][here]].

* How is the website generated?

Using the schema we use [[https://pypi.org/project/PyShEx/][pyshex]] shex expressions and [[https://github.com/common-workflow-language/schema_salad][schema salad]] to
generate the [[https://github.com/arvados/bh20-seq-resource/blob/edb17e7f7caebfa1e76b21006b1772a33f4f7887/bh20simplewebuploader/templates/form.html#L47][input form]], [[https://github.com/arvados/bh20-seq-resource/blob/edb17e7f7caebfa1e76b21006b1772a33f4f7887/bh20sequploader/qc_metadata.py#L13][validate]] the user input and to build [[https://github.com/arvados/bh20-seq-resource/blob/edb17e7f7caebfa1e76b21006b1772a33f4f7887/workflows/pangenome-generate/merge-metadata.py#L24][RDF]]!
All from that one metadata schema.

* Modifying the schema

One of the first things we want to do is to add a field for the data
license. Initially we only support CC-4.0 as a license by default, but
now we want to give uploaders the option to make it an even more
liberal CC0 license. The first step is to find a good ontology term
for the field. Searching for `creative commons cc0 rdf' rendered this
useful [[https://creativecommons.org/ns][page]].  We also find an [[https://wiki.creativecommons.org/wiki/CC_License_Rdf_Overview][overview]] where CC0 is represented as URI
https://creativecommons.org/publicdomain/zero/1.0/.  Meanwhile the
attribution license https://creativecommons.org/licenses/by/4.0/.
According to this [[https://wiki.creativecommons.org/images/d/d6/Ccrel-1.0.pdf][document]] we should really also add fields for
attributionName and attributionURL.

A minimal triple should be

: id  xhtml:license  <http://creativecommons.org/licenses/by/4.0/> .

Other suggestions are

: id  dc:title "Description" .
: id  cc:attributionName "Your Name" .
: id  cc:attributionURL <http://resource.org/id>

and 'dc:source' which indicates the original source of any modified
work, specified as a URI.
The prefix 'cc:' is an abbreviation for http://creativecommons.org/ns#.

Going back to the schema, where does it fit? Under host, sample,
virus, technology or submitter block? It could fit under sample, but
actually the license concerns the whole metadata block and sequence,
so I think we can fit under its own license tag. For example


id: placeholder

: license:
:     license_type: http://creativecommons.org/licenses/by/4.0/
:     attribution_title: "Sample ID"
:     attribution_name: "John doe, Joe Boe, Jonny Oe"
:     attribution_url: http://covid19.genenetwork.org/id
:     attribution_source: https://www.ncbi.nlm.nih.gov/pubmed/323088888

So, let's update the example. Notice the license info is optional - if it is missing
we just assume the default CC-4.0.

One thing that is interesting is that in the name space https://creativecommons.org/ns there
is no mention of a title. I think it is useful, however, because we have no such field.
So, we'll add it simply as a title field. Now the draft schema is

#+BEGIN_SRC js
- name: licenseSchema
  type: record
  fields:
    license_type:
      doc: License types as refined in https://wiki.creativecommons.org/images/d/d6/Ccrel-1.0.pdf
      type: string?
      jsonldPredicate:
          _id: https://creativecommons.org/ns#License
    title:
      doc: Attribution title related to license
      type: string?
      jsonldPredicate:
          _id: http://semanticscience.org/resource/SIO_001167
    attribution_url:
      doc: Attribution URL related to license
      type: string?
      jsonldPredicate:
          _id: https://creativecommons.org/ns#Work
    attribution_source:
      doc: Attribution source URL
      type: string?
      jsonldPredicate:
          _id: https://creativecommons.org/ns#Work
#+END_SRC

Now, we are no ontology experts, right? So, next we submit a patch to our source tree and
ask for feedback before wiring it up in the data entry form. The pull request was
submitted here FIXME.

/Note: work in progress/