Re: Hidden hierarchy example

On 30/05/14 13:31, Alf Eaton wrote:
> On 30 May 2014 12:33, Dan Brickley <danbri@google.com> wrote:
>> On 30 May 2014 12:24, Alf Eaton <eaton.alf@gmail.com> wrote:
>>> On 27 May 2014 11:07, Dan Brickley <danbri@google.com> wrote:
>>>
>>>> Here's an example of a CSV structure that hides a hierarchy within cell values.
>>>>
>>>> My expectation is that we won't specify a way to access such
>>>> complexity in our core work but it is worth bearing in mind when
>>>> thinking about extensions, hooks for other languages etc.
>>>>
>>>> This link has raw CSV and prettified HTML,
>>>> http://www.onetcenter.org/taxonomy/2010/list.html?d=1 ...
>>>>
>>>> Schema.org currently mentions this dataset as supplying possible
>>>> valuess to use in http://schema.org/JobPosting in the
>>>> http://schema.org/occupationalCategory property. It is very SKOS-like
>>>> data, consisting of a controlled code, with short text, long text, and
>>>> a hierarchy represented within the numeric structure of the codes. A
>>>> simple CSV mapping could expand these out into SKOS Concept like
>>>> structures; a fancy/custom mapping might figure out broader/narrower
>>>> relations that show e.g. 11-9041.01,Biofuels/Biodiesel Technology and
>>>> Product Development Managers as a specialization of
>>>> 11-9041.00,Architectural and Engineering Managers...
>>>>
>>>> I haven't figured out the exact rules to parse a hierarchy yet, but at
>>>> first look I'd guess it needs procedural code.
>>>
>>> I had a go at parsing the CSV into something that made the
>>> organisation structure browseable, and this mapping seemed to work
>>> quite well:
>>>
>>> 11-9041.01,Biofuels/Biodiesel Technology =>
>>> {
>>>    title: 'Biofuels/Biodiesel Technology',
>>>    subsubcategory: '11-9041.01'
>>>    subcategory: '11-9041',
>>>    category: '11',
>>> }
>>>
>>> This could be done declaratively: a regular expression
>>> (/^(\d+)-(\d+)\.(\d+)$/) specifies how to parse the hierarchical code
>>> into its constituent parts, then they just need to be combined one
>>> part at a time to get the ids for each level of the hierarchy. In the
>>> user interface, selecting category "11" shows only the items in that
>>> category (for want of a better term), and selecting subcategory
>>> "11-9041" shows only the items in that subcategory.
>>
>> Interesting - I was thinking of this mapping more directly into SKOS.
>> But perhaps exploding from regex into this fixed structure would be
>> enough to make the final step to SKOS feasible via SPARQL 1.1
>> CONSTRUCT? Ok, maybe that's getting arcane, but at least it's an
>> existing standard :)
>>
>>> In this particular case, there doesn't seem to be (as far as I can
>>> tell) an ontology providing labels or relationships between each level
>>> of the hierarchy, which would be useful.
>>
>> I believe this CSV is as close as we get to having such an ontology :)
>
> If the aim is to build a SKOS ontology from the CSV data, then I guess
> the end result would be something like this:
>
> <onetsoc:11-9041.01> <skos:prefLabel> "'Biofuels/Biodiesel Technology"
> <onetsoc:11-9041.01> <skos:broader> <onetsoc:11-9041>
> <onetsoc:11-9041> <skos:broader> <onetsoc:11>
>
> What would the input data need to be, for SPARQL CONSTRUCT to be able
> to build that output?

No need - you can get that by using a template if the template lets' you 
do the /^(\d+)-(\d+)\.(\d+)$/.

We (Epimorphics) [1] have found that you can get a lot done by applying 
multiple templates to the same data, CSV files often being denoramlized 
tables.  E.g. such as one template to extract the hierarchy, one to 
extract the instance data.

Partly, it's the pragmatics of not putting everything into one 
all-singing-all-dancing template

The "trick" is that RDF graphs are a set of triples so if you expand to get

<onetsoc:11-9041> <skos:broader> <onetsoc:11>

on multiple rows, then you still only get one triple in the converted data.

 Andy

[1] https://github.com/epimorphics/dclib

>
> I note that although other categorisation systems (e.g. MeSH) express
> their hierarchy in this way, others (e.g. Dewey Decimal) use a
> different system of identifiers that would be harder (impossible?) to
> split into a hierarchy with just a regular expression:
>
> <ddc:001.012> <skos:broader> <ddc:001.01>
> <ddc:001.01> <skos:broader> <ddc:001>
> <ddc:001> <skos:broader> <ddc:000>
>
> Alf
>

Received on Saturday, 31 May 2014 14:16:04 UTC