RE: Attempted example CSV metadata document and template from Tandy, Jeremy on 2014-06-23 (public-csv-wg@w3.org from June 2014)

From: Tandy, Jeremy <jeremy.tandy@metoffice.gov.uk>
Date: Mon, 23 Jun 2014 16:03:51 +0000
To: Ivan Herman <ivan@w3.org>
CC: Dan Brickley <danbri@google.com>, W3C CSV on the Web Working Group <public-csv-wg@w3.org>
Message-ID: <2624871D9A05174691BD59F8EFD68AE20884B55F@EXXCMPD1DAG3.cmpd1.metoffice.gov.uk>
> -----Original Message-----
> From: Ivan Herman [mailto:ivan@w3.org]
> Sent: 21 June 2014 08:38
> To: Tandy, Jeremy
> Cc: Dan Brickley; W3C CSV on the Web Working Group
> Subject: Re: Attempted example CSV metadata document and template
> 
> Jeremy,
> 
> one thing that I was wondering about was that the simple naming
> mechanism for the various microsyntaxes may not work out. Consider
> 
> 	"columns" : [
> 		{ "name" : "datetime",
> 		  ...
>                   "microsytax": [
> 			{ "name" : N1,
> 			  "regexp" : "...."
> 			},
> 			.....
>                   ]
> 		},
> 		{ "name" : "anothercolumn",
> 		  ...
> 		  "microsyntax"
> 			{ "name" : N1,
> 			  "regexp" : "...."
> 			},
> 			.....
> 		}
> 
> 	]
> 
> 
> When working through the cells in a row, what would 'N1' refer to?
> Unless we want to require the unicity of the microsyntax names, we may
> hit an issue. And I do not think requiring a unique name is a good
> idea; if the metadata becomes big, this may become a nuisance.

Agreed. I made the assumption that all instances of "name" within a given metadata document would need to be unique. I had not considered any mechanisms to make this easy for users; e.g. using the "name" from an enclosing object to automatically _namespace_ sub-names.

We could leave it to the user to ensure uniqueness (easy for us; adds load to the end user which is less good); in which case the example above would fail to validate.

Alternatively, we could apply a form of name-spacing; e.g. "datetime/N1" and "anothercolumn/N1" within your example above.

> 
> What this means is that the syntax becomes more complicated. Something
> like {datetime:N1} or something similar (which raises the issue of
> escape characters, too:-(

Agreed! I chose a different separator character to you, but the same issue applies.

> 
> As for the conditionals: mustache has some syntax for this which is a
> bit different
> 
> {{#bla}}
>    .. any template here
> {{/bla}}
> 
> although the mustache semantics is a bit different (afaik it relies on
> the existence or not of a key in an object). We could use the mustache
> semantics but we probably need something more, too, like "if 'bla' is a
> microsyntax name and is true if the value of the cell matches the
> regexp then it is true".

Syntax-wise, we want our metadata document to be valid JSON, so we would need something different to mustache. However, I agree that our use cases call for similar semantics. Perhaps the syntax might be something like:

"condition: {
    "operator": "if ({bla})",
    "template": {
        "name": "2010_Occupations-csv-to-ttl",
        "description": "Template converting CSV content to SKOS/RDF (expressed in Turtle syntax).",
        "type": "template",
        "path": "2010_Occupations-csv-to-ttl.ttl",
        "hasFormat": "text/turtle"
    }
}

In this case, I'm trying to say that the template will be triggered if the value of {bla} is true / not null etc. ... the value of {bla} is taken by evaluating the column (or microsyntax element) with "name" = "bla" for the row being processed. Like you say: """it relies on the existence or not of a key in an object"""

(I don't really like the syntax; I guess that others can come up with better.)

> 
> But I agree that the conditional complicates the templates a lot. Here
> is where our use cases may have to switch in: do our use cases justify
> the need for conditionals (remembering that, though we are discussing
> turtle here, I do not see any difference between generating turtle and
> generating XML or JSON through the same mechanism).

The requirement is ["R-ConditionalProcessingBasedOnCellValues"][1], motivated by the ExpressingHierarchyWithinOccupationalListings use case. This use case gives us two requirements:

i) triggering a template if a value of a cell is not null; e.g. to generate the SKOS concept scheme from the SOC structure ...

15-0000,,,,Computer and Mathematical Occupations,,,,,
,15-1100,,,Computer Occupations,,,,,
,,15-1110,,Computer and Information Research Scientists,,,,,
,,,15-1111,Computer and Information Research Scientists,,,,,

Here we can see that I only want a ex:SOC-MajorGroup entity created on the first row shown above (where col 1 is populated).

ii) triggering a template if a value of a cell equates to a particular string (or the opposite); e.g. when the value of "onetsoc-occupation" = "00" as shown in the example shown [earlier in this email thread][3]. ...

"operator": "if ({onetsoc-occupation} == '00')"

Perhaps there are cases for more complex operations? I don't know. Perhaps this is where call-back functions or promises could be used to parse a row and provide a Boolean response as to whether the template should be triggered? Again, I don't know ... and some considerable thought would be required to work out the details of such.

Jeremy

 

[1]: http://w3c.github.io/csvw/use-cases-and-requirements/index.html#R-ConditionalProcessingBasedOnCellValues 
[2]: http://w3c.github.io/csvw/use-cases-and-requirements/index.html#UC-ExpressingHierarchyWithinOccupationalListings 
[3]: http://lists.w3.org/Archives/Public/public-csv-wg/2014Jun/0127.html 

> 
> My 2 cents...
> 
> Ivan
> 
> 
> 
> 
> On 19 Jun 2014, at 14:36 , Tandy, Jeremy
> <jeremy.tandy@metoffice.gov.uk> wrote:
> 
> >> -----Original Message-----
> >> From: Dan Brickley [mailto:danbri@google.com]
> >> Sent: 18 June 2014 12:46
> >> To: Tandy, Jeremy
> >> Cc: CSV on the Web Working Group
> >> Subject: Re: Attempted example CSV metadata document and template
> >>
> >> On 12 June 2014 12:57, Tandy, Jeremy <jeremy.tandy@metoffice.gov.uk>
> >> wrote:
> >>> All -
> >>>
> >>> I've just uploaded to [GitHub][1] a rework of the "Simple Weather
> >> Observation" example. I've tried to create a CSV metadata document
> >> following the rules in the [Metadata Vocabulary for Tabular Data][2]
> >> and [Generating RDF from Tabular Data on the Web][3] documents.
> >>>
> >>> I would be particularly interested in:
> >>>
> >>> - corrections to errors!
> >>> - comments on additional proposed properties in the metadata
> >>> document ("short-name", "template", "microsyntax")
> >>> - use of "hasFormat" to specify the Content-Type associated with a
> >>> Template
> >>> - use of a REGEXP within a URI Template to convert ISO 8601 syntax
> >>> to a simplified form
> >>
> >> I don't completely understand this mechanism yet, but do you think
> it
> >> could be stretched to address the SKOS/codes issue in
> >> http://w3c.github.io/csvw/use-cases-and-requirements/#UC-
> >> ExpressingHierarchyWithinOccupationalListings
> >> where we'd want to explode strings like "15-1199.00", "15-1199.01"
> >> and emit triples like 'broader' when certain patterns matched?
> >>
> >> Dan
> >>
> >
> > OK ... let's have a go.
> >
> > Here's the header and a line of data:
> >
> > ---
> > O*NET-SOC 2010 Code,O*NET-SOC 2010 Title,O*NET-SOC 2010 Description
> > 15-1199.03,Web Administrators,"Manage web environment design,
> deployment, development and maintenance activities. [...]"
> > ---
> >
> > Here's a guess at the CSV metadata description in which I am using
> the ["multiple regexp each extracting a single value" pattern][1]:
> >
> > ---
> > {
> >   "name": "2010_Occupations",
> >   "title": "O*NET-SEC Occupational listing for 2010",
> >   "publisher": [{
> >       "name": "O*Net Resource Center",
> >       "web": " http://www.onetcenter.org/ "
> >   }],
> >   "resources": [{
> >       "name": "2010_Occupations-csv",
> >       "path": "2010_Occupations.csv",
> >       "schema": {"columns": [
> >           {
> >               "name": "onet-soc-2010-code",
> >               "title": "O*NET-SOC 2010 Code",
> >               "description": "O*NET Standard Occupational
> Classification Code (2010).",
> >               "type": "string",
> >               "required": true,
> >               "unique": true,
> >               "microsyntax": [{
> >                       "name": "soc-major-group",
> >                       "regexp": "/^(\d{2})-\d{4}.\d{2}$/"
> >                   },{
> >                       "name": "soc-minor-group",
> >                       "regexp": "/^\d{2}-(\d{2})\d{2}.\d{2}$/"
> >                   },{
> >                       "name": "soc-broad-group",
> >                       "regexp": "/^\d{2}-\d{2}(\d)\d.\d{2}$/"
> >                   },{
> >                       "name": "soc-detailed-occupation",
> >                       "regexp": "/^\d{2}-\d{3}(\d).\d{2}$/"
> >                   },{
> >                       "name": "onetsoc-occupation",
> >                       "regexp": "/^\d{2}-\d{4}.(\d{2})$/"
> >                   }
> >
> >               ]
> >           },
> >           {
> >               "name": "title",
> >               "title": "O*NET-SOC 2010 Title",
> >               "description": "Title of occupational classification.",
> >               "type": "string",
> >               "required": true
> >           },
> >           {
> >               "name": "description",
> >               "title": "O*NET-SOC 2010 Description",
> >               "description": Description of occupational
> classification.",
> >               "type": "string",
> >               "required": true
> >           }
> >       ]},
> >       "template": {
> >           "name": "2010_Occupations-csv-to-ttl",
> >           "description": "Template converting CSV content to SKOS/RDF
> (expressed in Turtle syntax).",
> >           "type": "template",
> >           "path": "2010_Occupations-csv-to-ttl.ttl",
> >           "hasFormat": "text/turtle"
> >       }
> >   }]
> > }
> > ---
> >
> > You can see that I've used the `microsyntax` object to capture the 5
> independent elements of the O*NET-SOC code each with its own regexp:
> "soc-major-group", "soc-minor-group", "soc-broad-group", "soc-detailed-
> occupation" and "onetsoc-occupation". Whether this is the _best_ way to
> do, I don't know ... it's just an idea to get us talking about
> possibilities and options!
> >
> > The template (prefixes etc. intentionally left out) might then be:
> >
> > ---
> > ex:{onet-soc-2010-code} a ex:ONETSOC-Occupation ;
> >    skos:notation "{onet-soc-2010-code}" ;
> >    skos:prefLabel "{title}" ;
> >    dct:description "{description}" ;
> >    skos:broader ex:{soc-major-group}-0000,
> >                 ex:{soc-major-group}-{soc-minor-group}00,
> >                 ex:{soc-major-group}-{soc-minor-group}{soc-broad-
> group}0,
> >                 ex:{soc-major-group}-{soc-minor-group}{soc-broad-
> group}{soc-detailed-occupation} .
> > ---
> >
> > However, this does not help when we look at the required _conditional
> > behaviour_: when the value of "onetsoc-occupation" = "00" this is
> > identical to the term from the SOC taxonomy, and the template should
> > be more like
> >
> > ---
> > ex:{soc-major-group}-{soc-minor-group}{soc-broad-group}{soc-detailed-
> occupation} a ex:SOC-DetailedOccupation ;
> >    skos:notation "{soc-major-group}-{soc-minor-group}{soc-broad-
> group}{soc-detailed-occupation}" ;
> >    skos:prefLabel "{title}" ;
> >    dct:description "{description}" ;
> >    skos:broader ex:{soc-major-group}-0000,
> >                 ex:{soc-major-group}-{soc-minor-group}00,
> >                 ex:{soc-major-group}-{soc-minor-group}{soc-broad-
> group}0 .
> > ---
> >
> > It occurs to be that we may wish to trigger different templates based
> on a conditional response - or even whether we wish to trigger a
> template at all for a given line!
> >
> > Thinking out of the box (is that a euphemism for "making it up as I
> go along"?), it would seem that each "template" block in the CSV
> metadata might have a "condition" statement that tells it when to fire
> - using values of column names or microsyntax element names? e.g.
> >
> > ---
> >       "template": {
> >           "name": "2010_Occupations-csv-to-ttl",
> >           "description": "Template converting CSV content to SKOS/RDF
> (expressed in Turtle syntax).",
> >           "type": "template",
> >           "path": "2010_Occupations-csv-to-ttl.ttl",
> >           "hasFormat": "text/turtle",
> >           "condition": "if {soc-detailed-occupation} != '00'"
> >       }
> > ---
> >
> > Default behaviour (if no "condition" statement included) would be
> _always_ to trigger the template for each row.
> >
> > However, looking at this, I am immediately concerned that including
> if-then-else blocks and comparison operators hugely increases the
> complexity of our work. Perhaps this is a good point to "bug out" to
> some external agent (e.g. call-back function or promise).
> >
> > Jeremy
> >
> > [1]:
> > https://github.com/w3c/csvw/blob/gh-pages/examples/csv-metadata-and-
> te
> > mplate-for-simple-weather-obs-example.md#multiple-regexp-each-
> extracti
> > ng-single-value
> >
> >>
> >>> - thoughts about a way to describe that microsyntax format within
> >>> the
> >> metadata document (see CellMicrosyntax requirement][4]), e.g. to
> >> define the sub-elements within the microsyntax that may be extracted
> >> for use later - see [Parsing cell microsyntax][5].
> >>>
> >>> Comments welcome.
> >>>
> >>> Jeremy
> >>>
> >>>
> >>> [1]:
> >>> https://github.com/w3c/csvw/blob/gh-pages/examples/csv-metadata-
> and-
> >> te
> >>> mplate-for-simple-weather-obs-example.md
> >>> [2]: http://w3c.github.io/csvw/metadata/index.html
> >>> [3]: http://w3c.github.io/csvw/csv2rdf/
> >>> [4]:
> >>> http://w3c.github.io/csvw/use-cases-and-requirements/#R-
> >> CellMicrosynta
> >>> x
> >>> [5]:
> >>> https://github.com/w3c/csvw/blob/gh-pages/examples/csv-metadata-
> and-
> >> te
> >>> mplate-for-simple-weather-obs-example.md#parsing-cell-microsyntax
> 
> 
> ----
> Ivan Herman, W3C
> Digital Publishing Activity Lead
> Home: http://www.w3.org/People/Ivan/
> mobile: +31-641044153
> GPG: 0x343F1A3D
> WebID: http://www.ivan-herman.net/foaf#me
> 
> 
> 
>
Received on Monday, 23 June 2014 16:04:24 UTC