- From: David Booth <david@dbooth.org>
- Date: Mon, 03 Mar 2014 20:14:26 -0500
- To: Gregg Kellogg <gregg@greggkellogg.net>, Andy Seaborne <andy@apache.org>
- CC: Richard Cyganiak <richard@cyganiak.de>, Niklas Lindström <lindstream@gmail.com>, "public-csv-wg@w3.org" <public-csv-wg@w3.org>
Hi Gregg, Andy, Richard and others, On 03/02/2014 12:57 PM, Gregg Kellogg wrote: > On Mar 2, 2014, at 9:43 AM, Andy Seaborne <andy@apache.org> wrote: >> >> On 01/03/14 14:59, Richard Cyganiak wrote: >>> David, >>> >>> Let me first add one more clarification. I don't think of a >>> Tarql mapping as a CSV-to-RDF mapping. I think of it as a >>> logical-table-to-RDF mapping. Whether the table comes from CSV, >>> TSV, SAS, SPSS or relational doesn't matter, as long as we define >>> a sensible mapping from each of these syntaxes to a table of RDF >>> terms with named columns. These mappings are generally easy to >>> define, lossless, and don't add much arbitrary extra >>> information. >> >> +1 to having this step brought out explicitly. We can deal with >> syntax to RDF terms step, involving syntax details and any >> additional information to guide choice of datatypes (is 2014 a >> string, an integer, a Gregorian year?), and then have a step of >> putting into RDF, whether direct or mapped. > > +1 too. > > IMO, 2014 is an integer, "2014" is a string. Column metadata should > be able to type field as datatyped literal, reference or identifier. > > Direct mapping simply generates either anonymous records, or records > identified by fragid, also using fragids to compose properties based > on column names: simplest possible transformation to triples in the > absence of metadata. Mapping metadata allows more sophisticated > mappings. Correct, but therein also lies the danger. I'm afraid the following explanation is rather long, and may be obvious to many -- and if so, I apologize -- but I want to be sure that I'm being as clear as possible, because I realized that I was *not* sufficiently clear in the use case that I submitted, plus I've learned (the hard way) that different people sometimes come into these efforts with quite different expectations and objectives. What one person thinks is obvious may not be at all obvious to someone else who has different objectives. The reason I like the Direct Mapping approach is that it cleanly factors out the simple, syntactic mapping from the semantic transformations that are needed to achieve alignment with some target model. It then allows *all* of the semantic mappings to be done in the same language, regardless of source format, rather than mixing the syntactic mapping with the semantic alignment step. I like this because, to my mind, when integrating data from diverse sources, there will almost always be semantic transformations needed to achieve semantic alignment, and I would rather use a single, common way to do those semantic transformations than having several ways to do them, each one specific to a particular source format. (One question though is, where is the line between syntactic and semantic transformation? I'll come back to that later.) Suppose an application uses a particular **target RDF model** or ontology, i.e., the application expects its input data to use certain classes, predicates, namespaces, and other usage patterns. To consume data from a particular source, two things logically need to happen: (a) syntactic mapping to convert the data from whatever format it is in, to RDF; and (b) semantic mapping to align the source data model with the target RDF model. These two logical steps can either be done as one single physical step or as more than one physical step. In general, the publisher of the source data has no knowledge of the application that is consuming that data, and hence cannot be expected to provide sufficient metadata that would map the source data all the way to that application's target RDF model. Indeed, there may be *many* such applications, each one with its own target RDF model. Hence, all the publisher can do is to (at most) provide metadata that allows a consumer to automatically map the source data to a **source RDF model**. There are various ways the publisher can conceive of the source RDF model. In general, the more complex the mapping, the more it becomes biased toward a particular assumed application, rather than simply reflecting the intended meaning of the published data. Ideally, the publisher should supply metadata that expresses the intended meaning of the published data, *without* biasing it toward any particular application. Metadata that can be automatically deterministically discovered (merely by following standards) could certainly include mapping rules that are intended to perform exactly this purpose, and I assume that that is what you and others have in mind by defining a standard way to locate and represent CSV+ metadata. To my mind, this should be a major focus of the working group's efforts. So far so good. But when something like Tarql is used, there is no clean division between the syntactic mapping and the semantic mapping, because Tarql allows *any* semantic transformation to be performed. This can be both good and bad. Obviously it can be useful to have that expressive power. But there is also a danger that the publisher may comingle the task of exposing the meaning of that data, with the task of aligning its RDF model to that of a particular application, rather than cleanly separating the two. Indeed, if the publisher also has an application that needs to consume the data as RDF, then the publisher will have a great deal of temptation to do so, because it will simplify his/her immediate task. But doing so may introduce model bias that makes it harder for other applications to use the data -- particularly if the resulting model involves information loss. This can easily happen if the application that the publisher has in mind does not need some of the information that is in the data, and it can happen completely unconsciously, as the publisher may not have conceived of the many creative uses to which that data may be put. (Of course, it could also make the data *easier* for other applications to use, if those other applications happen to use target RDF models that are the same or almost the same as the model used by the publisher's application.) In spite of such a danger, it is only reasonable to assume that the publisher best knows the intended meaning of his/her data, and thus in some sense we simply have to trust him/her to exercise good judgement in publishing metadata that as faithfully as possible reflects the intended meaning of the data without bias toward any particular target model. Following this line of thinking, one could reasonably argue that publishers should have the power, when writing their CSV+ metadata, to specify arbitrary semantic transformations even though that power may sometimes be abused. But do CSV+ publishers *need* the power to express arbitrary semantic transformations, which Tarql (or SPARQL) provides, just to expose the intended meaning of a table? I'm not sure that they do. I'm hoping that a more constrained, declarative form will suffice for the simple task of exposing the data's intended meaning. This indeed may be what most members of the working group have in mind already, and if so then that's great, but I thought I should bring it up explicitly. I also think there's another factor that should be considered, which I'll try to illustrate with two scenarios. In scenario #1, a data consumer locates a published CSV+ table that has *no* accompanying metadata. A CSV+ Direct Mapping is applied, to interpret that table as RDF. Two mappings are then crafted to transform that RDF to RDF that is semantically aligned with the consumer's target model: the first mapping, MS, transforms the directly mapped RDF to -- hopefully -- the publisher's intended source model by replacing default URI prefixes and datatypes with intended prefixes and datatypes, and maybe a little more; the second mapping, MT, transforms that source model to the consumer's target model. Clearly MT must be written by the consumer, but MS might actually be written by the publisher, or someone who at least understands the data's meaning. In scenario #2, the publisher of that same CSV+ table installs accompanying metadata for that spreadsheet. In doing so, it would be nice if: (a) the publisher could simply install mapping MS from scenario #1 in the right location without change (assuming MS does indeed reflect the intended source model); (b) by following the W3C standard, the consumer would then view the data as RDF that reflects the intended source model; (c) the consumer could still use mapping MT (unchanged) to transform from the source model to the consumer's target model; and (d) MS and MT are written in the same language. In other words, it would be nice if these mappings did not have to be rewritten just because MS is moved to accompany the published table. BTW, in going through this explanation, it occurs to me that I was not sufficiently clear in my use case description http://lists.w3.org/Archives/Public/public-csv-wg-comments/2014Feb/0007.html because I only addressed the case in which there is no metadata available. I hope that this lengthy explanation has helped to clarify my goals. In particular, I hope: - for a standard, deterministic mapping from *any* published CSV+ table to RDF; - that such a mapping will use any associated authoritative metadata to best capture the publisher's intended meaning of the data; - that syntactic mappings are *decoupled* from semantic mappings; - that there is a CSV+ Direct Mapping style that prevents or discourages model bias (as described above) by preventing or discouraging semantic mappings in the metadata; and - that semantic mappings are SPARQL-rules-friendly, either by using SPARQL conventions or by using using conventions that can be conveniently used from SPARQL. In this distinction, I am viewing semantic mappings as being transformations from RDF to RDF. I think there is some gray area in what operations should be categorized as syntactic mappings versus semantic mappings, as some might be legitimately viewed as both. Rules of thumb that come to mind for semantic mappings: - those that are also used to achieve model alignment, i.e., they are not solely used for one data format; or - those that change the structure of the data model, rather than just the terms. Again, I apologize for the length of this explanation, but I hope it has added more clarity. Thanks! David Booth
Received on Tuesday, 4 March 2014 01:14:55 UTC