- From: Hans-Jürgen Rennau <hjrennau@gmail.com>
- Date: Sat, 26 Feb 2022 01:43:03 +0100
- To: Martynas Jusevičius <martynas@atomgraph.com>
- Cc: Christian Chiarcos <christian.chiarcos@web.de>, Paul Tyson <phtyson@sbcglobal.net>, semantic-web@w3.org
- Message-ID: <CA+H2zTCb=YeAbJ4cJFxFy=4jX7gWV8Y-50bcN6csPVWSA5GDAA@mail.gmail.com>
Absolutely agree - to me, that's the idea behind rml.io. Your concern about optimization is a puzzle to me. A declarative mapping language à la rml is not optimized for anything, except for clean and clear statement of the intended result - the what, not the how. That's the art of it. Optimization comes later, and the clearer our thought, well-structured and intuitive its capturing - the larger the scope for optimization behind the scenes. And the more sustainable and enduring the result of time spent, because a mapping defined today, investing, say, eight hours may be slowly processed today, faster in a month, and much faster in a year, without me spending another minute on this. So a key benefit is potential return on investment. And I mention in passing - how much cheaper is the *maintenance* of 100 simple artifacts (say, yarrrml documents) not bothering with optimization and backed by a sophisticated implementation - compared with 100 artifacts twice or ten times as complex in their quest for speed. Did I really understand you correctly - in spite of its amazing generality, the rml approach is not promising, because not concerned with optimization? With kind regards - Hans-Jürgen Am Fr., 25. Feb. 2022 um 21:58 Uhr schrieb Martynas Jusevičius < martynas@atomgraph.com>: > > > On Fri, 25 Feb 2022 at 21.49, Hans-Jürgen Rennau <hjrennau@gmail.com> > wrote: > >> Thank you, Christian, that's helpful! >> >> We must take care not to stumble over differences of terminology. When I >> say "formats", I mean syntax types: XML, JSON, HTML, CSV, TSV, ... So there >> are not two XML formats, but XML is a format, JSON is a format, etc. Not >> important here if this is a fortunate choice, as long as we avoid a >> misunderstanding. >> >> What you call "formats" is something different; something which I use to >> call "document types". A vocabulary or, in a narrower sense, an explicit or >> implicit model of structure, names and meanings. For example, two different >> web service messages, say FooRequest and BarResponse, are two document >> types. They certainly require two different mappings. We need not waste >> time on discussion about the fact that every document type requires its own >> custom mapping to RDF. It is obvious, and therefore I would not speak about >> the absence of a one stop. We have to map non-RDF documents to RDF again >> and again, again and again having to deal with different document types. >> The permanent need to speak (to map) is the very reason one may ask for a >> language (a dedicated mapping language). >> >> To summarize my position: the necessity to perform custom-mapping has >> nothing to do with the state of technology, but with the nature of things. >> It is from this perspective that the goal of a uniform mapping language, >> applicable to all or many formats (syntax types) becomes interesting. As we >> appreciate the benefits of a uniform styling language (CSS), a uniform >> document transformation language (XSLT and XQuery), a uniform modeling >> language (UML), a uniform locator model (URI), etc. - we might also >> appreciate a uniform to-RDF mapping language. (Mind you - uniform does not >> mean that it's always the best choice, but often.) >> >> Concerning the raw RDF: perhaps an appropriate approach in some >> scenarios, but with little interest from the point of view of a unified >> mapping language, as one is thrown back on a generic transformation task. >> Which is exactly the baseline from where to depart. >> > > I think that is the idea behind https://rml.io/. > > Another misunderstanding concerns the term "declarative", I'll return to >> that later. >> >> Kind regards, Hans-Jürgen >> >> PS: I wonder which crosses you would make: >> A uniform to-RDF mapping language? >> o Too unclear what it means >> o Not feasible >> o Not useful >> o Pointless because: ______ >> > > I would cross all of the above. You can’t have a mapping language that is > equally optimized for both tabular and tree data. For example, streaming > transformation of CSV is trivial, but streaming transformation of XML is > complex. > You might succeed in creating a general mapping, but it would be very > shallow and un-optimizable. > > Am Fr., 25. Feb. 2022 um 17:24 Uhr schrieb Christian Chiarcos < >> christian.chiarcos@web.de>: >> >>> Am Fr., 25. Feb. 2022 um 15:44 Uhr schrieb Hans-Jürgen Rennau < >>> hjrennau@gmail.com>: >>> >>>> Thank you, Christian. Before responding, I have a couple of questions. >>>> You write: >>>> >>>> " (a) source to RDF (specific to the source format[s], there *cannot* >>>> be a one-stop solution because the sources are heterogeneous, we can only >>>> try to aggregate -- GRDDL+XSL, R2RML, RMI, JSON-LD contexts, YARRRML are of >>>> that kind, and except for being physically integrated with the source data, >>>> RDFa is, too)" >>>> >>>> I do not understand - YARRRML *is* a one-stop solution for all source >>>> formats included in an extensible list of formats, already now including: >>>> RDB, CSV, JSON, XML. So in principle it is a comprehensive solution for a >>>> given heterogeneous set of data sources. Could you explain what you mean, >>>> saying "there cannot be a one-stop solution"? >>>> >>> >>> It is, in fact, an example of aggregation. So, even if providing a >>> common level of abstraction for different formats, the underlying machinery >>> has to be specific to these source formats. Coverage for XML is great, of >>> course, but "support for XML" doesn't necessarily mean that all XML formats >>> are covered. A generic XML converter is not very helpful if your data >>> requires to keep track of dependencies between multiple XML files, for >>> example. You can convert/extract from DOCX documents with generic XML >>> technology (+ZIP), but it's a nightmare that -- if done right -- requires >>> you to understand hundreds of pages of documentation (including, but not >>> limited to >>> https://interoperability.blob.core.windows.net/files/MS-DOCX/%5bMS-DOCX%5d.pdf). >>> As long as people keep on inventing formats, and as long as these formats >>> keep on evolving, any aggregation-based solution will be incomplete. Hence >>> not a one-stop-solution -- unless *all* format developers *everywhere* >>> decide to work on the same aggregator platform and stop developing ad-hoc >>> converters for ad-hoc formats (spoiler: despite serious efforts and *some >>> progress*, this is not what happened in the past 50 years: In academia and >>> among developers, we see fragmentation and conventions widely used within >>> their own communities but not so much beyond them -- as in the SW; and in >>> the industry, limited interoperability is actively used as tool against >>> competitors). >>> >>> >>>> And a second question: what does "raw RDF representation" mean? I >>>> suppose the result of a purely mechanical translation, derived from item >>>> names, but I am not sure. >>>> >>> >>> A raw RDF representation of, say, XML in RDF can just encode the XML >>> data structure in RDF, i.e., create an my:Element for every element, >>> my:name for its name, an my:Attribute for every attribute, my:child for >>> every child, my:next for every sibling, etc. (my: is an invented namespace, >>> replace by whatever prefix you want for XML data structures.) That's >>> trivial and can be done with a few dozen lines in XSL (e.g., >>> https://github.com/acoli-repo/LLODifier/blob/master/tei/tei2llod.xsl). >>> And it will convert any XML document into RDF. But this is not a meaningful >>> representation and it's too verbose to be practical, because you easily >>> create thousands of triples for pieces of information that can be expressed >>> in just a few, as you encode the complete structure of the XML file. In >>> order to get the semantics out of the jungle of XML data you need to filter >>> and aggregate, and that can be (more or less) effectively done with SPARQL >>> (for example). >>> >>> Best, >>> Christian >>> >>>> >>>> Thank you in advance, kind regards - Hans-Jürgen >>>> >>>> Am Fr., 25. Feb. 2022 um 14:05 Uhr schrieb Christian Chiarcos < >>>> christian.chiarcos@web.de>: >>>> >>>>> Speaking of ways of thinking, integration means among other things >>>>>> that a graceful transition between tree and graph representation is a >>>>>> natural thing for us, almost as natural as an arithmetic operation, or the >>>>>> validation of a document against a schema, or conducting a query. If there >>>>>> is an abundance of tools, this is alarming enough; even more alarming is >>>>>> the common view point that people may have to write custom code. For tasks >>>>>> of a fundamental nature we should have fundamental answers, which means >>>>>> declarative tools - *if* this is possible. With declarative I mean: >>>>>> allowing the user to describe the desired result and not care about how to >>>>>> achieve it. >>>>>> >>>>> >>>>> Well, that's not quite true, you need to at least partially describe >>>>> three parts, the desired result, the expected input and the relation >>>>> between them. It seems you want something more than declarative in the >>>>> traditional sense, but you want a model that is not procedural, i.e., >>>>> doesn't require the handling of any internal states, but a plain mapping. >>>>> >>>>> XSLT <3.0 falls under this definition as long as you don't do >>>>> recursion over named templates (because it didn't let you update the values >>>>> of variables) -- and for your use case you wouldn't need that. >>>>> Likewise, SPARQL CONSTRUCT is (not updates, because then you could >>>>> iterate), and I guess the same holds for most query languages. >>>>> However: It is generally considered a strength that both languages are >>>>> capable of doing iterations, and if these are in fact required by a use >>>>> case, a non-procedural formalization of the mapping would just not be >>>>> applicable anymore. >>>>> >>>>>> >>>>>> If well done and not incurring a significant loss of flexibility, the >>>>>> gain of efficiency and the increase of reliability is obvious. (Imagine >>>>>> people driving self-made cars.) >>>>>> >>>>> >>>>> There are query languages that claim to be Turing-complete [e.g., >>>>> https://info.tigergraph.com/gsql], and in the general sense of >>>>> separating computation from control flow (which is the entire point of a >>>>> query language), they are declarative, but as they provide unlimited >>>>> capabilities for iteration or recursion, they would not be under your >>>>> definition. The lesson here is that that there is a level of irreducible >>>>> complexity as soon as you need to iterate and/or update internal states. If >>>>> you feel that these are not necessary, you can *make* any query-based >>>>> mapping language compliant to your criteria if you just eliminate the parts >>>>> of the language that deal with iteration and the update (not the binding) >>>>> of variables. That basically requires you to write your own validator to >>>>> filter out functionalities that you don't want to support, nothing more, as >>>>> the rest is handled by off-the-shelf technology. And for some languages >>>>> (say, SPARQL), already the choice of query operator (CONSTRUCT / SELECT) >>>>> does that for you. >>>>> >>>>> So, a partial answer to your question seems to be: *Any query >>>>> language* (minus a few of its functionalities) would do. Beyond that, >>>>> selection criteria are no longer a matter of functionality but of verbosity >>>>> and entry bias, and the choice is up to the kind of source data you have. >>>>> Overall, there seem to be basically three types of transformations: >>>>> >>>>> (a) source to RDF (specific to the source format[s], there *cannot* be >>>>> a one-stop solution because the sources are heterogeneous, we can only try >>>>> to aggregate -- GRDDL+XSL, R2RML, RMI, JSON-LD contexts, YARRRML are of >>>>> that kind, and except for being physically integrated with the source data, >>>>> RDFa is, too) >>>>> (b) source to SPARQL variable bindings (+ SPARQL CONSTRUCT, as in >>>>> TARQL; this is a one-stop solution in the sense that you can apply one >>>>> language to *all* input formats, however, only those formats supported by >>>>> your tool; the difference to the first group is that the mapping language >>>>> itself is SPARQL, so it is probably more easily applicable for an >>>>> occasional user of SW/LD technology than any special-purpose or >>>>> source-specific formalism) >>>>> (c) source to a raw RDF representation + SPARQL CONSTRUCT (this is an >>>>> extension of (a) and the idea of some of the *software* solutions >>>>> suggested, but parts of the mapping effort are shifted from the >>>>> source-specific converters into the query (as in (b); this could also be >>>>> more portable than (a) as the format-/domain-specific can be covered by >>>>> generic converters/mapping) >>>>> >>>>> A fourth type, (d) source to raw RDF + SPARQL Update would fall out of >>>>> your classification -- but all of them would normally be considered >>>>> declarative. >>>>> >>>>> Best, >>>>> Christian >>>>> >>>>
Received on Saturday, 26 February 2022 00:43:30 UTC