Re: Some comments on the RDF->CSV document from Ivan Herman on 2014-04-27 (public-csv-wg@w3.org from April 2014)

From: Ivan Herman <ivan@w3.org>
Date: Sun, 27 Apr 2014 06:59:37 +0200
To: Gregg Kellogg <gregg@greggkellogg.net>
Cc: Andy Seaborne <andy@apache.org>, W3C CSV on the Web Working Group <public-csv-wg@w3.org>
Message-Id: <B9492E3C-547C-4850-A08A-7D9B4CD15024@w3.org>
On 26 Apr 2014, at 21:22 , Gregg Kellogg <gregg@greggkellogg.net> wrote:

> On Apr 24, 2014, at 1:28 AM, Ivan Herman <ivan@w3.org> wrote:
> 
>> 
>> On 23 Apr 2014, at 19:55 , Gregg Kellogg <gregg@greggkellogg.net> wrote:
>> 
>>> On Apr 23, 2014, at 8:13 AM, Ivan Herman <ivan@w3.org> wrote:
>>> 
>>>> (To avoid any misunderstandings, I looked at http://w3c.github.io/csvw/csv2rdf/)
>>>> 
>>>> I am o.k. with the general approach, and with the level of simplicity/complexity of the templates. I would probably want each feature in the templates to be backed up with a reasonable use case (ideally, a use case in real use), but the 'melody', as is documented now, is fine to me. My litmus test is whether the mapping is implementable in simple and small JS library running on client side (not exclusively there, but also there). I think this is essential if we want any acceptance of this by client side web apps, ie, if we want to maintain a minimal level of hope that client side applications would use this:-). 
>>> 
>>> Agreed. If we have general agreement on the approach, we can begin to flesh out the document; it certainly should include concrete examples that illustrate the different cases, as well as defined test cases. Using real-world use cases, if simple, is fine, but they can be messy, so settling on a simplified set of use cases may work better within such a spec. We can always make more complicated test cases, which also serve to illustrate behavior.
>> 
>> We should, nevertheless, try to see real-life use cases. See my reply to Jeni: use cases should determine whether a specific feature is necessary or not.
> 
> Absolutely. Even if the spec uses simpler examples to illustrate the functionality, we can create test cases with more complicated data. I think the next step, beyond expanding the descriptions in the CSV2RDF doc a bit, is to invent some formats and use them in new tests.
> 
>> (I think that the current templates may indeed be backed up with use cases; they are pretty elementary. But, for example, and in spite of my comments below, it may happen that the use cases for setting the language tag per cell level is not that important, in which case we can forget about that issue altogether...)
> 
> Agreed.
> 
>>>> For the syntax question: I think my litmus test also means that a JSON syntax is almost a must: I do not expect anybody to start writing a turtle parser in JS for the purpose of an RDF mapping. The template seems to be fairly simple and probably has a straightforward description in JSON, ie, I do not believe that to be an issue...
>>> 
>>> The methods I used in the CSV2RDF document are substantially the same as that used in CSV-LD for creating JSON-LD. In fact, the transformation process can probably be done simply at the syntax layer except for datatype coercions. The CSV-LD use case also benefits from information in an associated context. In particular, in mapping fields with sub-delimiters for representing multiple values. But, I think that looking further at RFC-6570 provides some other means (for example {list*} could expand to an array of the delimited values. At this point, a purely syntactic transformation becomes unfeasible, at least for multiple subject or predicate values.
>> 
>> ... in which case we fall back in the alternative of trusting some external processing to handle this and we should not go there. I think we should keep away from a declarative description of sub delimiters...
> 
> If it's not declarative, then perhaps it needs to be handled as an input parameter. I think we have a requirement to handle field microsyntaxes, which include an sub-field separator (see R-CellValueMicroSyntax [1]), although it's not accepted yet. I think it might be enough to designate a column as having a specific microsyntax, including a sub-field delimiter. If we accept Jeni's point #3, we'll need a way to provide such information on a per-field basis anyway.

I am not sure I fully agree here.

First of all, as a more general issue, we have to decide what accepting the 'R-*' really mean for us. I know I am vague (I have not cleared this in my own mind yet) but accepting an 'R-*' may mean that it is a genuine issue with data being published out there, ie, a genuine requirement in general, but this does not necessarily mean that we have to provide a complete solution right now (at least in the first version of our specs). But maybe 'accepting' means that we have to cover that, and we could say 'acknowledge' for requirements that are indeed out there but we cannot cover.

As for the microsyntaxes yes, I believe this is the typical case where the external turing machine should be called for. It is not clear to me what the reference to that will be, whether it is a 'callback' per cell, per row, per column, or per the whole dataset; each has its value and maybe each has to be defined. I do not know. Clearly, a single callback to the full dataset is the simplest, and this is where the relative weights of the use case comes in: if a particular feature, though genuine and present out there, is relatively infrequently used, then we may decide to cover it with a per-table callback and thereby push it aside...

(I guess this is the typical case of an 80/20 cut)

Ivan



> 
> Gregg
> 
> [1] http://w3c.github.io/csvw/use-cases-and-requirements/index.html#R-CellValueMicroSyntax
> 
>>>> ---
>>>> 
>>>> The templates are on rows on columns, which presupposes a homogeneity of the table; again, I would want to check that against use cases. In particular, I wonder whether the templates that sets the language tag for a whole column is o.k. (e.g., if the column is something like 'native name' for cities, then each cell may have a different language tag; I am not sure how we would handle that.)
>>> 
>>> Using a Turtle syntax, I don't see how we can represent a literal language using a template representation. This would likely require using some non-literal representation in which language was a property. OTOH, a CSV-LD representation could use a template for languages (or datatypes).
>>> 
>>>> ---
>>>> 
>>>> From a more general point of view, an obvious issue on which we will have to give an answer to is the relationship of the template language to R2RML. As far as I could see, the features in the current template language are an almost strict subset of R2RML (I am not sure about the datatype mappings; R2RML makes use of SQL datatypes which we do not want to refer to). 
>>>> 
>>>> That being said, if we just referred to R2RML in our spec we would scare away a lot of people; meaning that we should probably not do it. However, a precise mapping to R2RML may still be necessary to be written down in the document, in case somebody want to use an existing R2RML engine. We should also check that the simple (template-less) mapping is similarly a subset to Direct Mapping, and document that
>>> 
>>> I was reaching for something that doesn't require a deep understanding of RDF, which IMO, R2RML does. I think R2RML is important for handling complex use cases (maybe this includes the language issue you mentioned), and we should reference it as such. This could allow us to focus on the 80% use case and keep things as simple as possible.
>> 
>> I fully agree with your assessment on R2RML. What I am saying is that the question about the relationship with R2RML will inevitably come, and having some sort of a comparison (which can be a non-normative appendix) in the document is probably good to have.
>> 
>> Cheers
>> 
>> Ivan
>> 
>>> 
>>>> ---
>>>> 
>>>> I was also wondering on the call, whether the template is RDF specific, or whether at least the general direction could be reused for a JSON mapping or, if needed, XML. I guess this is certainly true for JSON: the templates to use the right predicate names can be reused to generate the keys, for example. But I have not done a detailed analysis on this, and there are, almost surely, RDF specific features. But we should probably try to factor out the common parts.
>>> 
>>> Probably what's RDF specific is performing URI- or datatype-specific processing when casting field values to the appropriate representation, and dealing with multiple sub-values, where it comes to treating them as subject or predicate.
>>> 
>>>> (Of course, there is a question whether we need a separate JSON, or whether the current mapping would simply produce JSON-LD, ie, JSON. I am a little bit afraid of the RDF features, like blank nodes or @type, transpire into generic JSON which people may not want...
>>>> 
>>>> ---
>>>> 
>>>> Minor issue: the automatic numbering/naming of predicates should take into account RTL writing direction, see Yakov's examples for CSV files in Arabic or Hebrew...
> 
> 
>>> We haven't gotten into metadata about the entire document, which would include RTL and other things like skip rows and skip columns. Presumably, RTL would just cause a processor to reverse each record, which may also be a function of an underlying CSV library.
>>> 
>>> Gregg
>>> 
>>>> Ivan
>>>> 
>>>> ----
>>>> Ivan Herman, W3C 
>>>> Digital Publishing Activity Lead
>>>> Home: http://www.w3.org/People/Ivan/
>>>> mobile: +31-641044153
>>>> GPG: 0x343F1A3D
>>>> FOAF: http://www.ivan-herman.net/foaf
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
>> ----
>> Ivan Herman, W3C 
>> Digital Publishing Activity Lead
>> Home: http://www.w3.org/People/Ivan/
>> mobile: +31-641044153
>> GPG: 0x343F1A3D
>> FOAF: http://www.ivan-herman.net/foaf
>> 
>> 
>> 
>> 
>> 
> 


----
Ivan Herman, W3C 
Digital Publishing Activity Lead
Home: http://www.w3.org/People/Ivan/
mobile: +31-641044153
GPG: 0x343F1A3D
FOAF: http://www.ivan-herman.net/foaf
Received on Sunday, 27 April 2014 05:00:09 UTC