Re: CSV-LD proposal from Ivan Herman on 2014-02-04 (public-csv-wg@w3.org from February 2014)

From: Ivan Herman <ivan@w3.org>
Date: Tue, 4 Feb 2014 09:49:49 +0100
To: Gregg Kellogg <gregg@greggkellogg.net>
Cc: W3C CSV on the Web Working Group <public-csv-wg@w3.org>
Message-Id: <9FB62FF0-CF1A-4A9E-A0EE-0DF5B71EE44D@w3.org>
On 03 Feb 2014, at 20:45 , Gregg Kellogg <gregg@greggkellogg.net> wrote:

> On Feb 3, 2014, at 6:50 AM, Ivan Herman <ivan@w3.org> wrote:
> 
>> Hey Gregg,
>> 
>> - A clarification please... In the section on Table Join representation[1] you say 'Data such as this does not readily transform to JSON-LD'. I want to understand this better.
> 
> What I meant was that such data does not readily transform to JSON-LD using a single node definition with one node per row, as it contains data from multiple entities. Also, JSON-LD keyword aliases allow for multiple aliases to represent the same keyword (e.g. doap_id and foaf_id both are aliases for @id), but when transforming back from JSON-LD only one of these will be selected (the shortest and lexagraphically first). This is motivation for describing the entity mapping section.
> 
>> It is correct, isn't it, that you can transform that into a set of JSON-LD objects, one row per object (in RDF terms, a row into a set of properties having a common bnode subject, each row being a different one). I guess what you mean is that, in ideal term, you want a mapping resulting in what you describe in the Entity mapping section[2], ie, making use of the fact that these have similar subjects. 
> 
> Yes; certainly some transformation to an object with a bnode subject would be possible, just not very useful IMO. This is why I suggest entity mapping as a way of recovering the entities described in a single row. There are certainly pathological use-cases, but may constrain the form of the CSV that can be performed. For example the table shown did not contain an equivalent doap:developer column, which is necessary when relating the DOAP properties to the FOAF properties in the same row; however, this could be inferred in the entity mapping.
> 
>> The similar issue in the RDB Direct Mapping spec[3] is taken care of by the fact that, in a relational database, one may have a primary key; in the direct mapping, if there is a primary key in a table, that (well, a URI representation thereof) will be chosen as the common subject (instead of a blank node). Isn't this what you are looking for?
> 
> Well, in a CSV context, it might be hard to distinguish this from data in a single row. From an RDF perspective, if two different tables (or rows) had the same primary identifier, then they would denote the same entity. The use case I was noting is when a row has multiple columns which are identifiers for a subset of the columns within the row, for example doap_id and foaf_id are each identifiers, with the doap_* and foaf_* columns being apportioned to one or the other. The JSON-LD frame in the entity mapping example defines this mapping. Certainly, one of the identifier columns may be more _primary_ than the other, that likely being the left-hand-side of a join.

We may be in a violent agreement here, but maybe not; I am not sure I fully understand what you write in all details:-(

Let me try to explain my thoughts here. What I would like is to simplify things insofar as the various specs we are supposed to develop should relate to one another. That means.

1. We will have to develop some metadata for CSV. I presume this metadata is supposed to describe things like datatypes for columns (possibly), some semantics attached to the column names in general. It *may* also designate one of the columns as 'primary' key, a bit like in a relational database. I think that is an important piece of information, ie, metadata, regardless of any conversions. Other metadata information *may* attach additional information to columns and cells roughly along the lines of the information used in JSON-LD @context: whether the value is a URI Reference or a string, whether the column refers to a vocabulary item, etc.

2. We have a CSV->JSON conversion that, roughly:
	- it is an array of objects; each object corresponds to a row in the CSV file, with column names as keys and cell values as, well, values
	- the conversion takes into account the primary key, insofar as row objects are merged if their primary key are identical

3. We have a mapping of the metadata to a JSON-LD @context file

I *think* that this covers your example, too. If the metadata correctly identifies
	- doap_id as a primary key
	- doap_id and foaf_id as @id-s in the JSON-LD sense (note that this is not the same as primary key!)
	- foaf_name and doap_name as strings

Then the correct JSON-LD can be generated, if needed; that also means that, via this mapping, correct RDF can be generated. If the user does not care about RDF, then, however, #2 above provides a perfectly usable JSON mapping of the data, and he/she can stop there.

Is it different from what you have in mind? I actually do not think so, I may just describe in other terms...

Cheers

Ivan




> 
>> Putting this into this context, I believe the issue is what will the metadata of the CSV file contain; that metadata (whose definition is one of the goals of this WG!) may do exactly that: designate one column as the 'primary key'. Once that is done, mapping into JSON (but, probably, XML and of course RDF) becomes way more obvious. 
> 
> Yes, but determining that one entity "contains" the other.
> 
>> Is this what you call an 'entity mapping' to JSON(-LD)?
> 
> yes.
> 
>> - As for the general approach: I think there are similarities to the mapping of JSON, XML, and RDF that we have to exploit. I would probably look at [3] for a general line of thoughts, which may be moderated by some metadata (nothing as complex as R2RML[4], though) like the primary key above. I would leave XML aside for a moment; I guess what would be very important for our users is indeed, as you propose, to map the CSV file on JSON but following as much as we can the JSON-LD structures, so that the result can be turned into RDF if necessary by a suitable @context (and that @context may also be generated through the metadata). Ie, if somebody just wants JSON and does not even want to utter the term 'RDF', then that is fine, he/she can use JSON; if somebody wants RDF for whatever reasons, then, say, the @context+JSON -> Turtle mapping is already provided by current specifications.
> 
> Yes, exactly. The lesson of JSON-LD is that you can create a format (or transformation, in this case) which appeals to developers as is, without requiring them to buy into the whole RDF echo-system. This is why I thought the term CVS-LD useful in invoking the same developer-friendly view of turning CSV into structured data. Of course, JSON could also be turned into XML without going through RDF as well.
> 
> Although I don't expect to attend telecons directly, if people would like to discuss this further in a telcon, I will of course make myself available.
> 
> Gregg
> 
>> Thx
>> 
>> Ivan
>> 
>> 
>> 
>> [1] https://www.w3.org/2013/csvw/wiki/CSV-LD#Table_Join_representation
>> [2] https://www.w3.org/2013/csvw/wiki/CSV-LD#Entity_Mapping
>> [3] http://www.w3.org/TR/rdb-direct-mapping/
>> [4] http://www.w3.org/TR/r2rml/
>> 
>> On 01 Feb 2014, at 02:52 , Gregg Kellogg <gregg@greggkellogg.net> wrote:
>> 
>>> I added a proposal for something I call CSV-LD to the wiki [1]. As the name might suggest, this is strongly tied to JSON-LD, and uses JSON-LD context and frame definitions to both provide meaning to CSV, allowing it to be losslessly transformed to JSON-LD, or to create CSV from JSON-LD (with or without embedding).
>>> 
>>> Consider this a straw-man proposal. It does lay out some use cases that are generally useful (and perhaps should be copied to other pages on the wiki), but there may be more use cases to consider. IMO, creating a specification for this, and extending an existing JSON-LD implementation to support this would not be too difficult.
>>> 
>>> Gregg Kellogg
>>> gregg@greggkellogg.net
>>> 
>>> [1] https://www.w3.org/2013/csvw/wiki/CSV-LD
>>> 
>>> 
>> 
>> 
>> ----
>> Ivan Herman, W3C 
>> Digital Publishing Activity Lead
>> Home: http://www.w3.org/People/Ivan/
>> mobile: +31-641044153
>> GPG: 0x343F1A3D
>> FOAF: http://www.ivan-herman.net/foaf
>> 
>> 
>> 
>> 
>> 
> 


----
Ivan Herman, W3C 
Digital Publishing Activity Lead
Home: http://www.w3.org/People/Ivan/
mobile: +31-641044153
GPG: 0x343F1A3D
FOAF: http://www.ivan-herman.net/foaf
Received on Tuesday, 4 February 2014 08:50:18 UTC