- From: Andy Seaborne <andy@apache.org>
- Date: Wed, 28 May 2014 12:17:44 +0100
- To: Ivan Herman <ivan@w3.org>
- CC: W3C CSV on the Web Working Group <public-csv-wg@w3.org>, Jeni Tennison <jeni@jenitennison.com>
Ivan, You are taking the simple cases used to illustrate the approach and then assuming it is a complete design. It'd simple to me: * Metadata and code is used to produce a bunch of bindings (strictly, RDF terms) * Bindings are used to produce RDF fragments. I don't see why you require writing complex expression in the template. It's hard to do, requires the user writing expressions multiple times, and works in only one syntax at a time. Andy On 28/05/14 11:34, Ivan Herman wrote: > Well, > > Jeremy's example is fine, but let me change it a little bit to make my point. Let us suppose, that the data set is of the form: > > Date-time Air temperature (Cel) Dew-point temperature (Cel) > 13 December, 2013, at 8am 11.2 10.2 > 13 December, 2013, at 9am -unknown- 10.2 > > - Ie, the date is not stored in ISO format, but in some sort of a more readable format. My understanding of the metadata (I cc Jeni explicitly to check whether I am right) that the metadata for both fields include a type:datetime and and accompanying format:"dd MM,...." formatting string. > > Clearly, the field content is not appropriate for RDF as a direct replacement to produce an xsd:date. We have then two possibilities: either we say > > <> <URLforDate-Time> {Date-Time}^^xsd:dateTime ; > > in which case we say that the template engine has to go through some heuristics to try to match the string to a dateTime, ie, to produce a correct object value; or we say something like (ad-hoc syntax here): > > <> <URLforDate-Time> {Date-Time}^^{{type}:{format}} ; > > which instructs the template engine to generate a datatype using the metadata values to do it, thereby relying on the metadata provided by the data publisher > > - the second field in the final column is a bit similar (say there is was a problem on the sensor, hence the non integer field). > > <> <URLforAir-Temperature> {Air temperature (Cel)}^^xsd:double ; > > a mechanical template replacement will go wrong, unless there is again, some sort of a replacement heuristics involved. However, the metadata specification is such that the original author can provide a 'type' entry for the column as a whole (setting it, to, say, float), but can also add a specific metadata value for 'type' on each individual field, in this case adding a 'type:string'. Which would make the > > <> <URLforAir-Temperature> {Air temperature (Cel)}^^{{type}} ; > > a better approach. > > Some more answers below: > > On 27 May 2014, at 17:29 , Andy Seaborne <andy@apache.org> wrote: > >> Ivan, >> >> I am not saying the metadata is disjoint - it is a valuable input to conversion and I fully expect e.g. it to be used to get datatypes right if that is what the data consumer wants. > > Well, we do seem to agree then... What I am saying is that the metadata should be reused, when necessary, but the 'how' needs specification! Because at this moment we do not have them... > >> >> The metadata as it stands at the moment is insufficient to capture the conceptual meaning of denormalized data. >> >> There is no guaranteed there is metadata at all. > > True on both counts. And remember that the mechanical approach was saying: do a simple RDF/XML/JSON conversion as simply as possible based on the availability (or not!) of the metadata, defining in details which metadata entries are meaningful, possibly adding one or two basic metadata entries that are meaningful for XML/RDF/JSON only, respectively; if the result is not appropriate for the final processing, use some XSLT/SPARQL CONSTRUCT/???? as a second, more complex and dataformat specific step. > > (I think we are at the same level of discussion as the contrast between Direct Mapping and R2RML:-) > >> >> The roles and intentions of data publisher and data consumer are not the same. > > Sure. > > Again: I am not trying to make some ideological issue out of it, very far from it. The only thing I am saying is that we have to specify the template based approach so that it is bound, when possible and necessary, to the rest of the work we do, before making a decision. One we have a skeleton for that (which I have not seen yet) we can also decide whether a detailed specification of the templates would require a 2 years' work for each of the formats (which, as you correctly said, is my _main_ concern) or whether it is a trivial thing to do. > > Andy, just to make it even clearer, in case it was not: if you can convince me that such a specification _can_ be done very easily indeed, and its implementation does not require such a complex implementation as, say, the complete parsing of RDFa or even JSON-LD, then I am all for it! (Because I do like the approach, conceptually). > > (By the way the second point is also important: to be able to keep Web Developers on board, whatever we produce should be such that, say, a CVS->JSON conversion should be doable by reasonably seasoned Javascript programmer quickly and in a few pages (let us put the complexity of CVS parsing aside, which is complex on its own right...).) > > (Clearly, my experience of an RDFa implementer influences me. RDFa is fine and properly defined and useful, but the amount of work I had to put into the pyRdfa parser is a warning sign.) > > You guys have a good meeting! > > Cheers > > Ivan > > >> >> The duplication I see is that we would have to define flat conversion and, separately, reshaping conversion. >> >> One of the examples we have is >> >> https://github.com/w3c/csvw/blob/gh-pages/examples/simple-weather-observation.md >> >> How would you approach that? >> >> Andy >> >> On 27/05/14 20:32, Ivan Herman wrote: >>> Well... I think we disagree here. >>> >>> Of course, the metadata is *also* used for, say, displaying the CSV file. But making the metadata disjoint from the RDF/JSON/XML conversion means that there is an unnecessary duplication of the terms, and I am pretty much against that at this point. Obviously, some of the terms may be meaningful for, say, an XML conversion only, others have no meaning for any of the syntaxes, but I find it counterintuitive if things have to be repeated. The obvious examples are language tags or the datatype for a field, just from the top of my head; the choice of the primary key columns that would determine the subject for JSON or RDF may be another. (I have just arrived to my hotel in NYC and I have to go out, so I could not check all the details.) >>> >>> What this means, in my view, is that >>> >>> - if we go for the 'mechanical' approach that I wrote down then, for *some* of the metadata entries we provide a natural mapping to the RDF concepts (which may always be overwritten somehow with RDF specific values). This is more or less what I wrote down, though not all keys have a systematic RDF equivalent >>> >>> - if we go to the graph templating approach then the graph templates should be defined in a way that, for some of the values (like the ones I cited above) there is a syntax extracting those. The syntax may be very simple (something like {{key}} meaning that the value of a key valid for that field is used), but I am not sure we would not get to some 'if-then-else' issues ('if the language is set, generate a language tagged literal, otherwise a plain literal'). >>> >>> I am *not* saying the template mechanism cannot be defined and, if so, it may well be superior. But I do not believe the specification would be as simple as in the examples you had... But I believe we should have a more detailed sketch for a specification, a bit what I did for the mechanical approach, before making an informed decision. >>> >>> Chers >>> >>> Ivan >>> >>> >>> On 27 May 2014, at 06:56 , Andy Seaborne <andy@apache.org> wrote: >>> >>>> Ivan, >>>> >>>> What I gave was a description of the graph templating approach. It is not a complete spec. As I see it, we are trying to establish the scope of part of the technical work of the working group and Jeremy's example is a (the) example we have of CSV to RDF conversion. >>>> >>>> Part of the scoping is the relationship of metadata to conversion. >>>> >>>> The metadata is about what the CSV file "is", and details about it's publication. So it is not capturing everything about the conceptual information that is CSV file is about. >>>> >>>> An explicitly provided template for RDF conversion is what the user wants and puts in structure that isn't obvious from the CSV file alone nor declared in the metadata. >>>> >>>> They may be different; it may be intended. The authorship roles are different. >>>> >>>> Metadata is going help display CSV files in HTML and it's a great help in finding CSV files on the web, and validating them. Metadata comes primarily from the data publisher. An advanced template comes from the data consumer and is the format use by the conversion tool. >>>> >>>> Only deriving conversion from metadata makes assumptions about the emergence of provided metadata - I doubt that metadata info for existing CSV publications is going to emerge quickly and there is a lot of existing CSV data. It seems dubious to me to assume the data consumer is going to write missing metadata to drive flat conversion, when they still have further steps to perform to get what they want. >>>> >>>> When there is CSV publisher, and data consumer wanting RDF (or JSON, or XML), so they aren't reading the CSV file directly as CSV, all you need is a template, written by the data consumer, and a tool that processes templates. >>>> >>>> Andy >>>> >>>> On 22/05/14 15:18, Ivan Herman wrote: >>>>> Hi Andy, >>>>> >>>>> thanks. >>>>> >>>>> My problem is not with these simple cases. My problem is to understand how templates will be combined with the metadata definition in general; at the moment these are fairly disconnected. >>>>> >>>>> Looking at the latest draft of Jeni, each field may have its own particular set of properties (although some of them can be set for the column as a whole, it can be specialized for a specific field). This means that a pattern of the sort >>>>> >>>>> <something> <something> {colname} . >>>>> >>>>> may become slightly underspecified. For example, in your example, you translated the metadata including a datatype definition into something like >>>>> >>>>> <something> <something> {colname}^^xsd:double >>>>> >>>>> but that may not be o. >>>> >>>> Only deriving conversion from metadata makes assumptions about the emergence of provided metadata - I doubt that metadata info for existing CSV publications is going to emerge quickly and there is a lot of existing CSV data. It seems dubious to me to assume the data consumer is going to write missing metadata to drive flat conversion, when they still have further steps to perform to get what they want. Instead, they'll write code to go CSV to what they want. >>>> >>>> When there is CSV publisher, and data consumer wanting RDF (or JSON, or XML), so they aren't reading the CSV file directly as CSV, all you need is a template, written by the data consumer, and a tool that processes templates.k.; it should be something like >>>>> >>>>> <something <something> {colname}^^xsd:{{datatype}} >>>>> >>>>> where '{{datatype}}' is my ad-hoc syntax to denote the _value_ of the property "datatype". Actually, it may become more complicated insofar as the datatype value should probably not be taken verbatim, ie, if it says 'number', than it should be translated to its xml schema counterpart (either we include an if-then-else into the template language or we have to write down a specification on how exactly the template processor works for each field and its properties). Another example is the 'separator' field; if a field includes a 'separator' property, then the result of the template expansion may become something like >>>>> >>>>> <something> <something> (l1 l2 l3 l4) . >>>>> >>>>> It all can be done of course. But, unless we keep the templates completely disjoint from the metadata (which I think would be a mistake) we have quite some work to do reconciling the templates with the metadata definition:-( Did you have any thought on that already? >>>>> >>>>> Ivan >>>>> >>>>> P.S. Sorry, I am off-line at the moment due to a power outage, I cannot check Gregg's older document; maybe he did deal with these. >>>>> >>>>> >>>>> >>>>> On 21 May 2014, at 19:46 , Andy Seaborne <andy@apache.org> wrote: >>>>> >>>>>> I have written up more on graph templates: >>>>>> >>>>>> https://github.com/w3c/csvw/blob/gh-pages/examples/graph-templating.md >>>>>> >>>>>> Andy >>>>>> >>>>> >>>>> >>>>> ---- >>>>> Ivan Herman, W3C >>>>> Digital Publishing Activity Lead >>>>> Home: http://www.w3.org/People/Ivan/ >>>>> mobile: +31-641044153 >>>>> GPG: 0x343F1A3D >>>>> WebID: http://www.ivan-herman.net/foaf#me >>> >>> >>> ---- >>> Ivan Herman, W3C >>> Digital Publishing Activity Lead >>> Home: http://www.w3.org/People/Ivan/ >>> mobile: +31-641044153 >>> GPG: 0x343F1A3D >>> WebID: http://www.ivan-herman.net/foaf#me >>> >>> >>> >>> >>> >> >> > > > ---- > Ivan Herman, W3C > Digital Publishing Activity Lead > Home: http://www.w3.org/People/Ivan/ > mobile: +31-641044153 > GPG: 0x343F1A3D > WebID: http://www.ivan-herman.net/foaf#me > > > > >
Received on Wednesday, 28 May 2014 11:18:18 UTC