Re: CSV-LD proposal from Gregg Kellogg on 2014-02-04 (public-csv-wg@w3.org from February 2014)

From: Gregg Kellogg <gregg@greggkellogg.net>
Date: Tue, 4 Feb 2014 10:36:04 -0800
To: Ivan Herman <ivan@w3.org>
Cc: W3C CSV on the Web Working Group <public-csv-wg@w3.org>
Message-Id: <18ECACCD-9FC2-4FB6-96A4-588FF6E3FA51@greggkellogg.net>
On Feb 4, 2014, at 12:49 AM, Ivan Herman <ivan@w3.org> wrote:

> 
> On 03 Feb 2014, at 20:45 , Gregg Kellogg <gregg@greggkellogg.net> wrote:
> 
>> On Feb 3, 2014, at 6:50 AM, Ivan Herman <ivan@w3.org> wrote:
>> 
>>> Hey Gregg,
>>> 
>>> - A clarification please... In the section on Table Join representation[1] you say 'Data such as this does not readily transform to JSON-LD'. I want to understand this better.
>> 
>> What I meant was that such data does not readily transform to JSON-LD using a single node definition with one node per row, as it contains data from multiple entities. Also, JSON-LD keyword aliases allow for multiple aliases to represent the same keyword (e.g. doap_id and foaf_id both are aliases for @id), but when transforming back from JSON-LD only one of these will be selected (the shortest and lexagraphically first). This is motivation for describing the entity mapping section.
>> 
>>> It is correct, isn't it, that you can transform that into a set of JSON-LD objects, one row per object (in RDF terms, a row into a set of properties having a common bnode subject, each row being a different one). I guess what you mean is that, in ideal term, you want a mapping resulting in what you describe in the Entity mapping section[2], ie, making use of the fact that these have similar subjects. 
>> 
>> Yes; certainly some transformation to an object with a bnode subject would be possible, just not very useful IMO. This is why I suggest entity mapping as a way of recovering the entities described in a single row. There are certainly pathological use-cases, but may constrain the form of the CSV that can be performed. For example the table shown did not contain an equivalent doap:developer column, which is necessary when relating the DOAP properties to the FOAF properties in the same row; however, this could be inferred in the entity mapping.
>> 
>>> The similar issue in the RDB Direct Mapping spec[3] is taken care of by the fact that, in a relational database, one may have a primary key; in the direct mapping, if there is a primary key in a table, that (well, a URI representation thereof) will be chosen as the common subject (instead of a blank node). Isn't this what you are looking for?
>> 
>> Well, in a CSV context, it might be hard to distinguish this from data in a single row. From an RDF perspective, if two different tables (or rows) had the same primary identifier, then they would denote the same entity. The use case I was noting is when a row has multiple columns which are identifiers for a subset of the columns within the row, for example doap_id and foaf_id are each identifiers, with the doap_* and foaf_* columns being apportioned to one or the other. The JSON-LD frame in the entity mapping example defines this mapping. Certainly, one of the identifier columns may be more _primary_ than the other, that likely being the left-hand-side of a join.
> 
> We may be in a violent agreement here, but maybe not; I am not sure I fully understand what you write in all details:-(
> 
> Let me try to explain my thoughts here. What I would like is to simplify things insofar as the various specs we are supposed to develop should relate to one another. That means.
> 
> 1. We will have to develop some metadata for CSV. I presume this metadata is supposed to describe things like datatypes for columns (possibly), some semantics attached to the column names in general. It *may* also designate one of the columns as 'primary' key, a bit like in a relational database. I think that is an important piece of information, ie, metadata, regardless of any conversions. Other metadata information *may* attach additional information to columns and cells roughly along the lines of the information used in JSON-LD @context: whether the value is a URI Reference or a string, whether the column refers to a vocabulary item, etc.

Yes.

> 2. We have a CSV->JSON conversion that, roughly:
> 	- it is an array of objects; each object corresponds to a row in the CSV file, with column names as keys and cell values as, well, values
> 	- the conversion takes into account the primary key, insofar as row objects are merged if their primary key are identical

I think that there are a couple of modes here. When a row roughly corresponds to a database record, or a set of records sharing a single or composite primary key, then yes, there is basically a JSON object for each row. If multiple rows share a primary key but some other value differs, it is a multi-valued relationship, which could be a second table with a foreign key relationship to the primary key (e.g. SQL), or multiple values of a given RDF predicate on the same subject.

However, where the row represents data from multiple tables where there is something like a foreign-key relationship between the tables, and columns are apportioned to one or the other of the tables, something like JSON-LD chaining is necessary. This was the point of my DOAP/FOAF table examples.

This also relates to how data that may be duplicated between rows, such as would be the case for a given person knowing other people, and the primary key and name of the first person would be the same in each row where the primary key and name of the second person were unique for each row. (similarly for my DOAP/FOAF example).

I think this use case is important.

> 3. We have a mapping of the metadata to a JSON-LD @context file
> 
> I *think* that this covers your example, too. If the metadata correctly identifies
> 	- doap_id as a primary key
> 	- doap_id and foaf_id as @id-s in the JSON-LD sense (note that this is not the same as primary key!)
> 	- foaf_name and doap_name as strings

Yes, but this can't be done entirely using a JSON-LD context file, which is why my example used a frame. A frame is basically a context plus entity linking. This is what allows me to allocate doap_id to the outer object, and foaf_id to the inner object.

For example, if I have the following:

doap_id doap_name foaf_id foaf_name
<rdf> "RDF" <gk> "Gregg"
<rdf> "RDF" <arto> "Arto"

I probably have in mind something like

<rdf> doap:name "RDF"; doap:developer <gregg>, <arto> .
<gregg> foaf:name "Gregg" .
<arto> foaf:name "Arto" .

It's the framing which allows me to do this (and insert the inferred doap:developer property).

> Then the correct JSON-LD can be generated, if needed; that also means that, via this mapping, correct RDF can be generated. If the user does not care about RDF, then, however, #2 above provides a perfectly usable JSON mapping of the data, and he/she can stop there.

Yes.

> Is it different from what you have in mind? I actually do not think so, I may just describe in other terms...

I think we basically have the same idea here, but there are some implementation details I've hand-waved about. If there's interest, I can work up more formal text and could probably do a quick implementation to use online.

Gregg

> Cheers
> 
> Ivan
> 
> 
> 
> 
>> 
>>> Putting this into this context, I believe the issue is what will the metadata of the CSV file contain; that metadata (whose definition is one of the goals of this WG!) may do exactly that: designate one column as the 'primary key'. Once that is done, mapping into JSON (but, probably, XML and of course RDF) becomes way more obvious. 
>> 
>> Yes, but determining that one entity "contains" the other.
>> 
>>> Is this what you call an 'entity mapping' to JSON(-LD)?
>> 
>> yes.
>> 
>>> - As for the general approach: I think there are similarities to the mapping of JSON, XML, and RDF that we have to exploit. I would probably look at [3] for a general line of thoughts, which may be moderated by some metadata (nothing as complex as R2RML[4], though) like the primary key above. I would leave XML aside for a moment; I guess what would be very important for our users is indeed, as you propose, to map the CSV file on JSON but following as much as we can the JSON-LD structures, so that the result can be turned into RDF if necessary by a suitable @context (and that @context may also be generated through the metadata). Ie, if somebody just wants JSON and does not even want to utter the term 'RDF', then that is fine, he/she can use JSON; if somebody wants RDF for whatever reasons, then, say, the @context+JSON -> Turtle mapping is already provided by current specifications.
>> 
>> Yes, exactly. The lesson of JSON-LD is that you can create a format (or transformation, in this case) which appeals to developers as is, without requiring them to buy into the whole RDF echo-system. This is why I thought the term CVS-LD useful in invoking the same developer-friendly view of turning CSV into structured data. Of course, JSON could also be turned into XML without going through RDF as well.
>> 
>> Although I don't expect to attend telecons directly, if people would like to discuss this further in a telcon, I will of course make myself available.
>> 
>> Gregg
>> 
>>> Thx
>>> 
>>> Ivan
>>> 
>>> 
>>> 
>>> [1] https://www.w3.org/2013/csvw/wiki/CSV-LD#Table_Join_representation
>>> [2] https://www.w3.org/2013/csvw/wiki/CSV-LD#Entity_Mapping
>>> [3] http://www.w3.org/TR/rdb-direct-mapping/
>>> [4] http://www.w3.org/TR/r2rml/
>>> 
>>> On 01 Feb 2014, at 02:52 , Gregg Kellogg <gregg@greggkellogg.net> wrote:
>>> 
>>>> I added a proposal for something I call CSV-LD to the wiki [1]. As the name might suggest, this is strongly tied to JSON-LD, and uses JSON-LD context and frame definitions to both provide meaning to CSV, allowing it to be losslessly transformed to JSON-LD, or to create CSV from JSON-LD (with or without embedding).
>>>> 
>>>> Consider this a straw-man proposal. It does lay out some use cases that are generally useful (and perhaps should be copied to other pages on the wiki), but there may be more use cases to consider. IMO, creating a specification for this, and extending an existing JSON-LD implementation to support this would not be too difficult.
>>>> 
>>>> Gregg Kellogg
>>>> gregg@greggkellogg.net
>>>> 
>>>> [1] https://www.w3.org/2013/csvw/wiki/CSV-LD
>>>> 
>>>> 
>>> 
>>> 
>>> ----
>>> Ivan Herman, W3C 
>>> Digital Publishing Activity Lead
>>> Home: http://www.w3.org/People/Ivan/
>>> mobile: +31-641044153
>>> GPG: 0x343F1A3D
>>> FOAF: http://www.ivan-herman.net/foaf
>>> 
>>> 
>>> 
>>> 
>>> 
>> 
> 
> 
> ----
> Ivan Herman, W3C 
> Digital Publishing Activity Lead
> Home: http://www.w3.org/People/Ivan/
> mobile: +31-641044153
> GPG: 0x343F1A3D
> FOAF: http://www.ivan-herman.net/foaf
> 
> 
> 
> 
>
Received on Tuesday, 4 February 2014 18:36:33 UTC