aboutsummaryrefslogtreecommitdiff
path: root/doc/blog/using-covid-19-pubseq-part5.org
blob: fe1908ac185a572633649e7cfe38fe0ba2c41b1e (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
#+TITLE: COVID-19 PubSeq (part 4)
#+AUTHOR: Pjotr Prins
# C-c C-e h h   publish
# C-c !         insert date (use . for active agenda, C-u C-c ! for date, C-u C-c . for time)
# C-c C-t       task rotate
# RSS_IMAGE_URL: http://xxxx.xxxx.free.fr/rss_icon.png

#+HTML_HEAD: <link rel="Blog stylesheet" type="text/css" href="blog.css" />


* Table of Contents                                                     :TOC:noexport:
 - [[#modify-metadata][Modify Metadata]]
 - [[#what-is-the-schema][What is the schema?]]
 - [[#how-is-the-website-generated][How is the website generated?]]
 - [[#modifying-the-schema][Modifying the schema]]

* Modify Metadata

The public sequence resource uses multiple data formats listed on the
[[./download][DOWNLOAD]] page. One of the most exciting features is the full support
for RDF and semantic web/linked data ontologies. This technology
allows for querying data in unprescribed ways - that is, you can
formulate your own queries without dealing with a preset model of that
data (so typical of CSV files and SQL tables). Examples of exploring
data are listed [[./blog?id=using-covid-19-pubseq-part1][here]].

In this BLOG we are going to look at the metadata entered on the
[[./][COVID-19 PubSeq]] website (or command line client). It is important to
understand that anyone, including you, can change that information!

* What is the schema?

The default metadata schema is listed [[https://github.com/arvados/bh20-seq-resource/blob/master/bh20sequploader/bh20seq-schema.yml][here]].

* How is the website generated?

Using the schema we use [[https://pypi.org/project/PyShEx/][pyshex]] shex expressions and [[https://github.com/common-workflow-language/schema_salad][schema salad]] to
generate the [[https://github.com/arvados/bh20-seq-resource/blob/edb17e7f7caebfa1e76b21006b1772a33f4f7887/bh20simplewebuploader/templates/form.html#L47][input form]], [[https://github.com/arvados/bh20-seq-resource/blob/edb17e7f7caebfa1e76b21006b1772a33f4f7887/bh20sequploader/qc_metadata.py#L13][validate]] the user input and to build [[https://github.com/arvados/bh20-seq-resource/blob/edb17e7f7caebfa1e76b21006b1772a33f4f7887/workflows/pangenome-generate/merge-metadata.py#L24][RDF]]!
All from that one metadata schema.

* Modifying the schema

One of the first things we wanted to do is to add a field for the data
license. Initially we only support CC-4.0 as a license by default, but
now we want to give uploaders the option to make it an even more
liberal CC0 license. The first step is to find a good ontology term
for the field. Searching for `creative commons cc0 rdf' rendered this
useful [[https://creativecommons.org/ns][page]].  We also find an [[https://wiki.creativecommons.org/wiki/CC_License_Rdf_Overview][overview]] where CC0 is represented as URI
https://creativecommons.org/publicdomain/zero/1.0/.  Meanwhile the
attribution license https://creativecommons.org/licenses/by/4.0/.
According to this [[https://wiki.creativecommons.org/images/d/d6/Ccrel-1.0.pdf][document]] we should really also add fields for
attributionName and attributionURL.

/Note: work in progress/