From f776816ee2b1af7ccc84afb494f68a81a51f5a76 Mon Sep 17 00:00:00 2001
From: Pjotr Prins
Date: Sat, 22 Aug 2020 13:40:46 +0100
Subject: Relaxing the location attribute
---
bh20sequploader/bh20seq-shex.rdf | 4 +--
doc/blog/using-covid-19-pubseq-part5.org | 54 +++++++++++++++++++++++++++-----
2 files changed, 49 insertions(+), 9 deletions(-)
diff --git a/bh20sequploader/bh20seq-shex.rdf b/bh20sequploader/bh20seq-shex.rdf
index bbc7309..4ed203b 100644
--- a/bh20sequploader/bh20seq-shex.rdf
+++ b/bh20sequploader/bh20seq-shex.rdf
@@ -36,7 +36,7 @@ PREFIX wikidata:
:sampleShape {
sio:SIO_000115 xsd:string;
evs:C25164 xsd:string;
- obo:GAZ_00000448 [wikidata:~] ;
+ obo:GAZ_00000448 xsd:string;
obo:OBI_0001895 xsd:string ?;
obo:NCIT_C41206 xsd:string ?;
obo:OBI_0001479 IRI {0,2};
@@ -76,4 +76,4 @@ PREFIX wikidata:
cc:attributionName xsd:string ?;
cc:attributionURL xsd:string ?;
cc:attributionSource xsd:string ?;
-}
\ No newline at end of file
+}
diff --git a/doc/blog/using-covid-19-pubseq-part5.org b/doc/blog/using-covid-19-pubseq-part5.org
index 99c8ebf..ec768ed 100644
--- a/doc/blog/using-covid-19-pubseq-part5.org
+++ b/doc/blog/using-covid-19-pubseq-part5.org
@@ -12,9 +12,11 @@
- [[#modify-metadata][Modify Metadata]]
- [[#what-is-the-schema][What is the schema?]]
- [[#how-is-the-website-generated][How is the website generated?]]
- - [[#modifying-the-schema][Modifying the schema]]
- - [[#adding-fields-to-the-form][Adding fields to the form]]
- - [[#testing-the-license-fields][Testing the license fields]]
+ - [[#changing-the-license-field][Changing the license field]]
+ - [[#modifying-the-schema][Modifying the schema]]
+ - [[#adding-fields-to-the-form][Adding fields to the form]]
+ - [[#testing-the-license-fields][Testing the license fields]]
+ - [[#changing-geo-or-location-field][Changing GEO or location field]]
* Modify Metadata
@@ -40,7 +42,9 @@ Using the schema we use [[https://pypi.org/project/PyShEx/][pyshex]] shex expres
generate the [[https://github.com/arvados/bh20-seq-resource/blob/edb17e7f7caebfa1e76b21006b1772a33f4f7887/bh20simplewebuploader/templates/form.html#L47][input form]], [[https://github.com/arvados/bh20-seq-resource/blob/edb17e7f7caebfa1e76b21006b1772a33f4f7887/bh20sequploader/qc_metadata.py#L13][validate]] the user input and to build [[https://github.com/arvados/bh20-seq-resource/blob/edb17e7f7caebfa1e76b21006b1772a33f4f7887/workflows/pangenome-generate/merge-metadata.py#L24][RDF]]!
All from that one metadata schema.
-* Modifying the schema
+* Changing the license field
+
+** Modifying the schema
One of the first things we want to do is to add a field for the data
license. Initially we only supported CC-4.0 as a license, but
@@ -120,11 +124,11 @@ our source tree and ask for feedback before wiring it up in the data
entry form. The pull request was submitted [[https://github.com/arvados/bh20-seq-resource/pull/97][here]] and reviewed on the
gitter channel and I merged it.
-* Adding fields to the form
+** Adding fields to the form
To add the new fields to the form we have to modify it a little. If we
go to the upload form we need to add the license box. The schema is
-loaded in [[https://github.com/arvados/bh20-seq-resource/blob/a0c8ebd57b875f265e8b0efec4abfaf892eb6c45/bh20simplewebuploader/main.py#L229][main.py]] in the 'generate_form' function.
+loaded in [[https://github.com/arvados/bh20-seq-resource/blob/a0c8ebd57b875f265e8b0efec4abfaf892eb6c45/bh20simplewebuploader/main.py#L229][main.py]] in the 'generate-form' function.
With this [[https://github.com/arvados/bh20-seq-resource/commit/b9691c7deae30bd6422fb7b0681572b7b6f78ae3][patch]] the website adds the license input fields on the form.
@@ -148,4 +152,40 @@ When pushing the license info we discovered the workflow broke because
the existing data had no licensing info. So we changed the license
field to be optional - a missing license assumes it is CC-BY-4.0.
-* TODO Testing the license fields
+** TODO Testing the license fields
+
+* Changing GEO or location field
+
+When fetching information from GenBank and EBI/ENA we also translate
+the location into an unambiguous identifier. We opted for the wikidata
+tag. E.g. for New York city it is https://www.wikidata.org/wiki/Q60
+and for New York state it is https://www.wikidata.org/wiki/Q1384. If
+everyone uses these metadata URIs it is easy to group when making
+queries. Note that we should be using
+http://www.wikidata.org/entity/Q60 in the dataset (http instead of
+https and entitity instead of wiki).
+
+Unfortunately the main repositories of SARS-CoV-2 have variable
+strings of text for location and/or GPS coordinates. For us to support
+our schema we had to translate all options and this proves expensive.
+
+So we decide to relax the enforcement of this type of metadata and to
+allow for a free form string.
+
+The schema already used http://purl.obolibrary.org/obo/GAZ_00000448
+which states:
+
+#+BEGIN_SRC js
+Class: geographic
+ location
+ Term IRI: http://purl.obolibrary.org/obo/GAZ_00000448
+Definition: A reference to a place on
+ the Earth, by its name or by its geographical location.
+#+END_SRC
+
+and when you check count by location in the [[./demo][DEMO]] it lists a free
+format.
+
+So, why does the validation step balk when importing GenBank?
+The problem was in the [[https://github.com/arvados/bh20-seq-resource/blob/46d4b7a3a31f6605f81d43ecd6651d60a5782364/bh20sequploader/bh20seq-shex.rdf#L39][shex check]] for RDF generation.
+Removing the wikidata requirement relaxed the imports.
--
cgit v1.2.3