doc/blog/using-covid-19-pubseq-part5.org


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195

#+TITLE: COVID-19 PubSeq (part 4)
#+AUTHOR: Pjotr Prins
# C-c C-e h h   publish
# C-c !         insert date (use . for active agenda, C-u C-c ! for date, C-u C-c . for time)
# C-c C-t       task rotate
# RSS_IMAGE_URL: http://xxxx.xxxx.free.fr/rss_icon.png

#+HTML_HEAD: <link rel="Blog stylesheet" type="text/css" href="blog.css" />


* Table of Contents                                                     :TOC:noexport:
 - [[#modify-metadata][Modify Metadata]]
 - [[#what-is-the-schema][What is the schema?]]
 - [[#how-is-the-website-generated][How is the website generated?]]
 - [[#changing-the-license-field][Changing the license field]]
   - [[#modifying-the-schema][Modifying the schema]]
   - [[#adding-fields-to-the-form][Adding fields to the form]]
   - [[#testing-the-license-fields][Testing the license fields]]
 - [[#changing-geo-or-location-field][Changing GEO or location field]]
   - [[#relaxing-the-shex-constraint][Relaxing the shex constraint]]

* Modify Metadata

The public sequence resource uses multiple data formats listed on the
[[http://covid19.genenetwork.org/download][download]] page. One of the most exciting features is the full support
for RDF and semantic web/linked data ontologies. This technology
allows for querying data in unprescribed ways - that is, you can
formulate your own queries without dealing with a preset model of that
data (which is how one has to approach CSV files and SQL
tables). Examples of exploring data are listed [[http://covid19.genenetwork.org/blog?id=using-covid-19-pubseq-part1][here]].

In this BLOG we are going to look at the metadata entered on the
COVID-19 PubSeq website (or command line client). It is important to
understand that anyone, including you, can change that information!

* What is the schema?

The default metadata schema is listed [[https://github.com/arvados/bh20-seq-resource/blob/master/bh20sequploader/bh20seq-schema.yml][here]].

* How is the website generated?

Using the schema we use [[https://pypi.org/project/PyShEx/][pyshex]] shex expressions and [[https://github.com/common-workflow-language/schema_salad][schema salad]] to
generate the [[https://github.com/arvados/bh20-seq-resource/blob/edb17e7f7caebfa1e76b21006b1772a33f4f7887/bh20simplewebuploader/templates/form.html#L47][input form]], [[https://github.com/arvados/bh20-seq-resource/blob/edb17e7f7caebfa1e76b21006b1772a33f4f7887/bh20sequploader/qc_metadata.py#L13][validate]] the user input and to build [[https://github.com/arvados/bh20-seq-resource/blob/edb17e7f7caebfa1e76b21006b1772a33f4f7887/workflows/pangenome-generate/merge-metadata.py#L24][RDF]]!
All from that one metadata schema.

* Changing the license field

** Modifying the schema

One of the first things we want to do is to add a field for the data
license. Initially we only supported CC-4.0 as a license, but
we wanted to give uploaders the option to use an even more
liberal CC0 license. The first step is to find a good ontology term
for the field. Searching for `creative commons cc0 rdf' rendered this
useful [[https://creativecommons.org/ns][page]].  We also find an [[https://wiki.creativecommons.org/wiki/CC_License_Rdf_Overview][overview]] where CC0 is represented as URI
https://creativecommons.org/publicdomain/zero/1.0/.  Meanwhile the
attribution license https://creativecommons.org/licenses/by/4.0/.
According to this [[https://wiki.creativecommons.org/images/d/d6/Ccrel-1.0.pdf][document]] we should really also add fields for
attributionName and attributionURL.

A minimal triple should be

: id  xhtml:license  <http://creativecommons.org/licenses/by/4.0/> .

Other suggestions are

: id  dc:title "Description" .
: id  cc:attributionName "Your Name" .
: id  cc:attributionURL <http://resource.org/id>

and 'dc:source' which indicates the original source of any modified
work, specified as a URI.
The prefix 'cc:' is an abbreviation for http://creativecommons.org/ns#.

Going back to the schema, where does it fit? Under host, sample,
virus, technology or submitter block? It could fit under sample, but
actually the license concerns the whole metadata block and sequence,
so I think we can fit under its own license tag. For example


id: placeholder

: license:
:     license_type: http://creativecommons.org/licenses/by/4.0/
:     attribution_title: "Sample ID"
:     attribution_name: "John doe, Joe Boe, Jonny Oe"
:     attribution_url: http://covid19.genenetwork.org/id
:     attribution_source: https://www.ncbi.nlm.nih.gov/pubmed/323088888

So, let's update the example. Notice the license info is optional - if it is missing
we just assume the default CC-4.0.

One thing that is interesting is that in the name space https://creativecommons.org/ns there
is no mention of a title. I think it is useful, however, because we have no such field.
So, we'll add it simply as a title field. Now the draft schema is

#+BEGIN_SRC js
- name: licenseSchema
  type: record
  fields:
    license_type:
      doc: License types as refined in https://wiki.creativecommons.org/images/d/d6/Ccrel-1.0.pdf
      type: string?
      jsonldPredicate:
          _id: https://creativecommons.org/ns#License
    title:
      doc: Attribution title related to license
      type: string?
      jsonldPredicate:
          _id: http://semanticscience.org/resource/SIO_001167
    attribution_url:
      doc: Attribution URL related to license
      type: string?
      jsonldPredicate:
          _id: https://creativecommons.org/ns#Work
    attribution_source:
      doc: Attribution source URL
      type: string?
      jsonldPredicate:
          _id: https://creativecommons.org/ns#Work
#+END_SRC

Now, we are no ontology experts, right? So, next we submit a patch to
our source tree and ask for feedback before wiring it up in the data
entry form. The pull request was submitted [[https://github.com/arvados/bh20-seq-resource/pull/97][here]] and reviewed on the
gitter channel and I merged it.

** Adding fields to the form

To add the new fields to the form we have to modify it a little. If we
go to the upload form we need to add the license box. The schema is
loaded in [[https://github.com/arvados/bh20-seq-resource/blob/a0c8ebd57b875f265e8b0efec4abfaf892eb6c45/bh20simplewebuploader/main.py#L229][main.py]] in the 'generate-form' function.

With this [[https://github.com/arvados/bh20-seq-resource/commit/b9691c7deae30bd6422fb7b0681572b7b6f78ae3][patch]] the website adds the license input fields on the form.

Finally, to make RDF output work we need to add expressions to bh20seq-shex.rdf. This
was done with this [[https://github.com/arvados/bh20-seq-resource/commit/f4ed46dae20abe5147871495ede2d6ac2b0854bc][patch]]. In the end we decided to use the Dublin core title,
http://purl.org/metadata/dublin_core_elements#Title:

#+BEGIN_SRC js
:licenseShape{
    cc:License xsd:string;
    dc:Title xsd:string ?;
    cc:attributionName xsd:string ?;
    cc:attributionURL xsd:string ?;
    cc:attributionSource xsd:string ?;
}
#+END_SRC

Note that cc:AttributionSource is not really defined in the cc standard.

When pushing the license info we discovered the workflow broke because
the existing data had no licensing info. So we changed the license
field to be optional - a missing license assumes it is CC-BY-4.0.

** TODO Testing the license fields

* Changing GEO or location field

When fetching information from GenBank and EBI/ENA we also translate
the location into an unambiguous identifier. We opted for the wikidata
tag. E.g. for New York city it is https://www.wikidata.org/wiki/Q60
and for New York state it is https://www.wikidata.org/wiki/Q1384. If
everyone uses these metadata URIs it is easy to group when making
queries. Note that we should be using
http://www.wikidata.org/entity/Q60 in the dataset (http instead of
https and entitity instead of wiki).

Unfortunately the main repositories of SARS-CoV-2 have variable
strings of text for location and/or GPS coordinates. For us to support
our schema we had to translate all options and this proves expensive.

** Relaxing the shex constraint

So we decide to relax the enforcement of this type of metadata and to
allow for a free form string.

The schema already used http://purl.obolibrary.org/obo/GAZ_00000448
which states:

#+BEGIN_SRC js
Class: geographic
  location
  Term IRI: http://purl.obolibrary.org/obo/GAZ_00000448
Definition: A reference to a place on
  the Earth, by its name or by its geographical location.
#+END_SRC

and when you check count by location in the [[./demo][DEMO]] it lists a free
format.

So, why does the validation step balk when importing GenBank?
The problem was in the [[https://github.com/arvados/bh20-seq-resource/blob/46d4b7a3a31f6605f81d43ecd6651d60a5782364/bh20sequploader/bh20seq-shex.rdf#L39][shex check]] for RDF generation.
Removing the wikidata requirement relaxed the imports with this
[[https://github.com/arvados/bh20-seq-resource/commit/f776816ee2b1af7ccc84afb494f68a81a51f5a76][patch]].