RE: SKOS Quality Checkers from Vladimir Alexiev on 2014-01-20 (public-esw-thes@w3.org from January 2014)

From: Vladimir Alexiev <vladimir.alexiev@ontotext.com>
Date: Mon, 20 Jan 2014 18:52:50 +0200
To: "'Christian Mader'" <c.mader@semantic-web.at>, "'Osma Suominen'" <osma.suominen@helsinki.fi>
Cc: <public-esw-thes@w3.org>, "'Gregg Garcia'" <GGarcia@getty.edu>, "'Joan Cobb'" <JCobb@getty.edu>
Message-ID: <00a401cf1600$0dbb3f20$2931bd60$@alexiev@ontotext.com>
> Maybe the fastest way to learn about them is this joint paper?
> Osma Suominen and Christian Mader: Assessing and Improving the Quality of SKOS Vocabularies. Journal on Data Semantics, 2013.
> http://www.seco.tkk.fi/publications/2013/suominen-mader-skosquality.pdf

The paper is very nice indeed! 
I've read it in detail, and here are some remarks on some of the validation criteria from AAT's standpoint

** 4.2.1 Omitted or Invalid Language Tags

Ok, but make sure you're not too restrictive with parsing the tags. E.g. we use
qqq-002 "private language, region Africa" to denote what Getty calls "African language"

We also use private subtags in various positions, e.g. 
la vs 
la-x-liturgic vs 
la-x-medieval

and 
zh-Latn-pinyin vs 
zh-Latn-pinyin-x-hanyu vs 
zh-Latn-pinyin-x-notone

** 4.2.2 Incomplete Language Coverage
This may be relevant to Eurovoc (a relatively small vocab that's intended to have uniform/full coverage in numerous languages).

But it's not relevant to AAT, which has:
3 core languages (English, Spanish, Dutch)
1 core language in-progress (Chinese) 
over 100 languages (from Africaans to Zulu) that provide a few vernacular/loan terms, and never intended to have complete coverage.

The same can be observed for Rameau, and I'd guess any large Library or Cultural Heritage vocab.

So it'll be nice to add an option "core languages" and check coverage only against them.
And take only the first part of the langtag (sparql's langMatches()) because in AAT Chinese is covered with different transcriptions:
zh-Hant
zh-Latn-wadegile
zh-Latn-pinyin-x-hanyu
zh-Latn-pinyin-x-notone

** 4.2.4 Overlapping Labels

Two problems with this criterion as formulated:

a. AAT systematically includes the plural noun as prefLabel, and singular noun as altLabel.
E.g. the @en labels of http://getty.ontotext.com/resource/aat/300198841 include:
  prefLabel=rhyta, altLabel=rhyton, altLabel=rhytons
Your default similarity matching (I guess Levenstein with distance 1) would flag those

b.It is quite legitimate to have the prefLabel of one concept and altLabel of another be the same.
The query  select ?l ?x ?y {?x skos:prefLabel ?l. ?y skos:altLabel ?l}
at http://getty.ontotext.com/sparql 
finds 866 such pairs.

E.g. 300055155 prefLabel=awe (positive emotions, emotion, ... Associated Concepts Facet)
vs 300387898 altLabel=awe (the Aweti language)

Please note that AAT often includes a (qualifier) in parens to ensure that prefLabels are unique, e.g.:
300111178 English (culture or style) vs
300388277 English (language)

** 4.3.1 Orphan Concepts

AAT is 8-9 levels deep.
Yet, there is a surprisingly large number of topConcepts: 4291 out of 37058 or 11.5%; 
and many of them may not have skos:Concept children. 
 
E.g. 300054031 "drawing (metalworking)" is a top concept, although it's nested 8 levels deep:
<metal forming processes and techniques>, <metalworking processes and techniques>, <metalworking and metalworking processes and techniques>, <processes and techniques by material>, <processes and techniques by specific type>, <processes and techniques>, Processes and Techniques, Activities Facet
But all these levels are NOT skos:Concepts.

e.g. 300388277 English (language) http://getty.ontotext.com/resource/aat/300388277 is nested 5 levels deep:
<languages and writing systems by specific type>, <languages and writing systems>, language-related concepts, Associated Concepts, Associated Concepts Facet
but it doesn't have any children, nor skos:Concept parents.

Furthermore, AAT has 70 associative relations. 
But none of them are mapped to skos:related yet, because some connect non-Concepts while skos:related can connect only concepts.

So this criterion should take into account skos:Collection parents (i.e. skos:member^)

** 4.3.2 Disconnected Concept Clusters

Similarly for this criterion, you should consider the deeper skos:Collection structure.
Getty even has Concepts above some Collections (called Guide Terms).
In such case the links are:
skos:Collection -> skos:member -> skos:Concept -> iso:subordinateArray -> skos:Collection,iso:ThesaurusArray
(You'll find illustrations in other posts in this mailing list)

So: you should consider iso:subordinateArray and skos:member in addition to skos:narrower when making up the structure.

Best regards!
--
Vladimir Alexiev, PhD, PMP
Lead, Data and Ontology Management Group
Ontotext Corp, www.ontotext.com
Sirma Group Holding, www.sirma.com
Email: vladimir.alexiev@ontotext.com, skype:valexiev1  
Mobile: +359 888 568 132, SMS: 359888568132@sms.mtel.net
Landline: +359 (988) 106 084, Fax: +359 (2) 975 3226
Calendar: https://www.google.com/calendar/embed?src=vladimir%40sirma.bg
Received on Monday, 20 January 2014 16:53:15 UTC