Re: R-CellValueMicroSyntax

> On 30 Apr 2014, at 19:57 , Jeni Tennison <jeni@jenitennison.com> wrote:
> 
>> See http://w3c.github.io/csvw/use-cases-and-requirements/#R-CellValueMicroSyntax
>> 
>> I’d like to have a quick discussion about this requirement because I think it’s covering a wide range of things which we might take different positions on when considering whether they’re in scope.
>> 
>> The use cases show four types of microsyntax:
>> 
>>  1. various date/time syntaxes (not just ISO-8601 ones)
>>  2. comma-separated lists of editors within fields in UC-JournalArticleSearch
>>  3. embedded structured data (eg XML (VML) in UC-PaloAltoTreeData)
>>  4. semi-structured text in UC-PaloAltoTreeData
>> 
>> And I can see four things you might want to do with them:
>> 
>>  A. document the microsyntax so that humans can understand what it’s conveying
>>  B. validate the values to make sure they conform to the microsyntax you expect
>>  C. label the value as being in a particular microsyntax when converting into JSON/XML/RDF (eg marking an XML value as an XMLLiteral)
>>  D. process the microsyntax into an appropriate data structure when converting into JSON/XML/RDF (eg mapping the XML value into an appropriate JSON object)
>> 
>> I want to suggest that:
>> 
>> * We should mark as Deferred the intersection of 3 & D — we shouldn’t expect CSV processors to be able to take values that are XML and convert them into RDF or into JSON.
>> 
>> * We should mark as Deferred the intersection of 4 & D — similarly, we shouldn’t expect CSV processors to be able to take arbitrary semi-structured text and convert it into XML/JSON/RDF.
> 
> I agree with both.
> 
> But what about 2 and D? We could say (in JSON or RDF) that this means putting the values into a list, but I worried of the situation where the result should be a list of a particular datatype, for example. It might be complicated (but we should try).

I would strongly support this - it is a classic case we deal with at GBIF, where there are e.g. multiple image URLs comma separated in the source data, that are extracted to lists.  For the simple case of tokenized lists, would the resultant datatype of each token not be expected to be the same as if no microsyntax was present?  

E.g. annotating a CSV such that:

column 7 is “dc:references” with datatype of URL and delimited by “ | ":
  http://www.flickr.com/photos/dhobern/8707604066/ | http://www.flickr.com/photos/dhobern/8707604076/

Could be handled as a list of 2 URLs by any CSV parser supporting micro syntax.

In the CSVs we’ve observed this is a classic case that can be handled relatively simply with consistency, while the other examples provided (e.g. embedded XML) are much more tricky to handle consistently.

>> Otherwise I’m happy to include those requirements. WRT to the data model, I don’t think that means we need the data model to say that values in a CSV file *are* lists or object structures; I think we can continue to say that they’re annotated strings, and the annotation (which might include a definition of the format of the string) can be used to validate the string and (in some cases) convert it into a suitable value or data structure.
> 
> 
> I am also not sure what 2+'B' means. Do you mean we should have some sort of a 'schema' like description on the structure of a particular microsyntax when converting the CSV file into the Data Model? Ie, that the microsyntax should be a number, followed by a data, followed by something else? I am tempted to push this into a Deferred category as well, ie, the conversion into the Data Model should be opaque with a possible human readable description.
> 
> If I put a possible implementer's hat on, I would probably implement the conversion into a Data Model (I actually did something like that in node.js as a JavaScript-learning exercise recently) by giving the possibility to the user to add a callback function on cells to make any conversion that is possible). I wonder whether this should remain an implementation-specific trick or something we would describe in the conversion process.
> 
> Ivan
> 
> 
>> 
>> Cheers,
>> 
>> Jeni
>> --  
>> Jeni Tennison
>> http://www.jenitennison.com/
>> 
> 
> 
> ----
> Ivan Herman, W3C 
> Digital Publishing Activity Lead
> Home: http://www.w3.org/People/Ivan/
> mobile: +31-641044153
> GPG: 0x343F1A3D
> FOAF: http://www.ivan-herman.net/foaf
> 
> 
> 
> 
> 

----------------------------------------------------------------------------------------
Tim Robertson - GBIF Head of Informatics - trobertson@gbif.org
Global Biodiversity Information Facility http://www.gbif.org/
GBIF Secretariat, Universitetsparken 15, DK-2100 Copenhagen Ř, Denmark
Tel: +45 3532 1487  Mob: +45 2826 1487  Fax: +45 2875 1480
----------------------------------------------------------------------------------------

Received on Thursday, 1 May 2014 08:28:50 UTC