Re: semsheets from Martynas Jusevičius on 2022-02-26 (semantic-web@w3.org from February 2022)

From: Martynas Jusevičius <martynas@atomgraph.com>
Date: Sat, 26 Feb 2022 07:39:39 +0100
To: Hans-Jürgen Rennau <hjrennau@gmail.com>
Cc: Christian Chiarcos <christian.chiarcos@web.de>, Paul Tyson <phtyson@sbcglobal.net>, semantic-web@w3.org
Message-ID: <CAE35Vmx8Sev=5_B9e4hGN6D8iVPdQqU38DS4zcL26LbiXtHntQ@mail.gmail.com>
On Sat, 26 Feb 2022 at 01.43, Hans-Jürgen Rennau <hjrennau@gmail.com> wrote:

> Absolutely agree - to me, that's the idea behind rml.io.
>
> Your concern about optimization is a puzzle to me. A declarative mapping
> language à la rml is not optimized for anything, except for clean and clear
> statement of the intended result - the what, not the how. That's the art of
> it. Optimization comes later, and the clearer our thought, well-structured
> and intuitive its capturing - the larger the scope for optimization behind
> the scenes. And the more sustainable and enduring the result of time spent,
> because a mapping defined today, investing, say, eight hours may be slowly
> processed today, faster in a month, and much faster in a year, without me
> spending another minute on this. So a key benefit is potential return on
> investment. And I mention in passing - how much cheaper is the
> *maintenance* of 100 simple artifacts (say, yarrrml documents) not
> bothering with optimization and backed by a sophisticated implementation -
> compared with 100 artifacts twice or ten times as complex in their quest
> for speed.
>
> Did I really understand you correctly - in spite of its amazing
> generality, the rml approach is not promising, because not concerned with
> optimization?
>

I have no use for RML, because applying data model-specific transformations
(instead of one general language) has not been a problem, in my experience.
SPARQL and XSLT implementations will be more explicit, very likely more
performant if done right, and, as I wrote before, by far better
standardized and with wider suppot.


With kind regards - Hans-Jürgen
>
> Am Fr., 25. Feb. 2022 um 21:58 Uhr schrieb Martynas Jusevičius <
> martynas@atomgraph.com>:
>
>>
>>
>> On Fri, 25 Feb 2022 at 21.49, Hans-Jürgen Rennau <hjrennau@gmail.com>
>> wrote:
>>
>>> Thank you, Christian, that's helpful!
>>>
>>> We must take care not to stumble over differences of terminology. When I
>>> say "formats", I mean syntax types: XML, JSON, HTML, CSV, TSV, ... So there
>>> are not two XML formats, but XML is a format, JSON is a format, etc. Not
>>> important here if this is a fortunate choice, as long as we avoid a
>>> misunderstanding.
>>>
>>> What you call "formats" is something different; something which I use to
>>> call "document types". A vocabulary or, in a narrower sense, an explicit or
>>> implicit model of structure, names and meanings. For example, two different
>>> web service messages, say FooRequest and BarResponse, are two document
>>> types. They certainly require two different mappings. We need not waste
>>> time on discussion about the fact that every document type requires its own
>>> custom mapping to RDF. It is obvious, and therefore I would not speak about
>>> the absence of a one stop. We have to map non-RDF documents to RDF again
>>> and again, again and again having to deal with different document types.
>>> The permanent need to speak (to map) is the very reason one may ask for a
>>> language (a dedicated mapping language).
>>>
>>> To summarize my position: the necessity to perform custom-mapping has
>>> nothing to do with the state of technology, but with the nature of things.
>>> It is from this perspective that the goal of a uniform mapping language,
>>> applicable to all or many formats (syntax types) becomes interesting. As we
>>> appreciate the benefits of a uniform styling language (CSS), a uniform
>>> document transformation language (XSLT and XQuery), a uniform modeling
>>> language (UML), a uniform locator model (URI), etc. - we might also
>>> appreciate a uniform to-RDF mapping language. (Mind you - uniform does not
>>> mean that it's always the best choice, but often.)
>>>
>>> Concerning the raw RDF: perhaps an appropriate approach in some
>>> scenarios, but with little interest from the point of view of a unified
>>> mapping language, as one is thrown back on a generic transformation task.
>>> Which is exactly the baseline from where to depart.
>>>
>>
>> I think that is the idea behind https://rml.io/.
>>
>> Another misunderstanding concerns the term "declarative", I'll return to
>>> that later.
>>>
>>> Kind regards, Hans-Jürgen
>>>
>>> PS: I wonder which crosses you would make:
>>> A uniform to-RDF mapping language?
>>> o Too unclear what it means
>>> o Not feasible
>>> o Not useful
>>> o Pointless because: ______
>>>
>>
>> I would cross all of the above. You can’t have a mapping language that is
>> equally optimized for both tabular and tree data. For example, streaming
>> transformation of CSV is trivial, but streaming transformation of XML is
>> complex.
>> You might succeed in creating a general mapping, but it would be very
>> shallow and un-optimizable.
>>
>> Am Fr., 25. Feb. 2022 um 17:24 Uhr schrieb Christian Chiarcos <
>>> christian.chiarcos@web.de>:
>>>
>>>> Am Fr., 25. Feb. 2022 um 15:44 Uhr schrieb Hans-Jürgen Rennau <
>>>> hjrennau@gmail.com>:
>>>>
>>>>> Thank you, Christian. Before responding, I have a couple of questions.
>>>>> You write:
>>>>>
>>>>> " (a) source to RDF (specific to the source format[s], there *cannot*
>>>>> be a one-stop solution because the sources are heterogeneous, we can only
>>>>> try to aggregate -- GRDDL+XSL, R2RML, RMI, JSON-LD contexts, YARRRML are of
>>>>> that kind, and except for being physically integrated with the source data,
>>>>> RDFa is, too)"
>>>>>
>>>>> I do not understand - YARRRML *is* a one-stop solution for all source
>>>>> formats included in an extensible list of formats, already now including:
>>>>> RDB, CSV, JSON, XML. So in principle it is a comprehensive solution for a
>>>>> given heterogeneous set of data sources. Could you explain what you mean,
>>>>> saying "there cannot be a one-stop solution"?
>>>>>
>>>>
>>>> It is, in fact, an example of aggregation. So, even if providing a
>>>> common level of abstraction for different formats, the underlying machinery
>>>> has to be specific to these source formats. Coverage for XML is great, of
>>>> course, but "support for XML" doesn't necessarily mean that all XML formats
>>>> are covered. A generic XML converter is not very helpful if your data
>>>> requires to keep track of dependencies between multiple XML files, for
>>>> example. You can convert/extract from DOCX documents with generic XML
>>>> technology (+ZIP), but it's a nightmare that -- if done right -- requires
>>>> you to understand hundreds of pages of documentation (including, but not
>>>> limited to
>>>> https://interoperability.blob.core.windows.net/files/MS-DOCX/%5bMS-DOCX%5d.pdf).
>>>> As long as people keep on inventing formats, and as long as these formats
>>>> keep on evolving, any aggregation-based solution will be incomplete. Hence
>>>> not a one-stop-solution -- unless *all* format developers *everywhere*
>>>> decide to work on the same aggregator platform and stop developing ad-hoc
>>>> converters for ad-hoc formats (spoiler: despite serious efforts and *some
>>>> progress*, this is not what happened in the past 50 years: In academia and
>>>> among developers, we see fragmentation and conventions widely used within
>>>> their own communities but not so much beyond them -- as in the SW; and in
>>>> the industry, limited interoperability is actively used as tool against
>>>> competitors).
>>>>
>>>>
>>>>> And a second question: what does "raw RDF representation" mean? I
>>>>> suppose the result of a purely mechanical translation, derived from item
>>>>> names, but I am not sure.
>>>>>
>>>>
>>>> A raw RDF representation of, say, XML in RDF can just encode the XML
>>>> data structure in RDF, i.e., create an my:Element for every element,
>>>> my:name for its name, an my:Attribute for every attribute, my:child for
>>>> every child, my:next for every sibling, etc. (my: is an invented namespace,
>>>> replace by whatever prefix you want for XML data structures.) That's
>>>> trivial and can be done with a few dozen lines in XSL (e.g.,
>>>> https://github.com/acoli-repo/LLODifier/blob/master/tei/tei2llod.xsl).
>>>> And it will convert any XML document into RDF. But this is not a meaningful
>>>> representation and it's too verbose to be practical, because you easily
>>>> create thousands of triples for pieces of information that can be expressed
>>>> in just a few, as you encode the complete structure of the XML file. In
>>>> order to get the semantics out of the jungle of XML data you need to filter
>>>> and aggregate, and that can be (more or less) effectively done with SPARQL
>>>> (for example).
>>>>
>>>> Best,
>>>> Christian
>>>>
>>>>>
>>>>> Thank you in advance, kind regards - Hans-Jürgen
>>>>>
>>>>> Am Fr., 25. Feb. 2022 um 14:05 Uhr schrieb Christian Chiarcos <
>>>>> christian.chiarcos@web.de>:
>>>>>
>>>>>> Speaking of ways of thinking, integration means among other things
>>>>>>> that a graceful transition between tree and graph representation is a
>>>>>>> natural thing for us, almost as natural as an arithmetic operation, or the
>>>>>>> validation of a document against a schema, or conducting a query. If there
>>>>>>> is an abundance of tools, this is alarming enough; even more alarming is
>>>>>>> the common view point that people may have to write custom code. For tasks
>>>>>>> of a fundamental nature we should have fundamental answers, which means
>>>>>>> declarative tools - *if* this is possible. With declarative I mean:
>>>>>>> allowing the user to describe the desired result and not care about how to
>>>>>>> achieve it.
>>>>>>>
>>>>>>
>>>>>> Well, that's not quite true, you need to at least partially describe
>>>>>> three parts, the desired result, the expected input and the relation
>>>>>> between them. It seems you want something more than declarative in the
>>>>>> traditional sense, but you want a model that is not procedural, i.e.,
>>>>>> doesn't require the handling of any internal states, but a plain mapping.
>>>>>>
>>>>>> XSLT <3.0 falls under this definition as long as you don't do
>>>>>> recursion over named templates (because it didn't let you update the values
>>>>>> of variables) -- and for your use case you wouldn't need that.
>>>>>> Likewise, SPARQL CONSTRUCT is (not updates, because then you could
>>>>>> iterate), and I guess the same holds for most query languages.
>>>>>> However: It is generally considered a strength that both languages
>>>>>> are capable of doing iterations, and if these are in fact required by a use
>>>>>> case, a non-procedural formalization of the mapping would just not be
>>>>>> applicable anymore.
>>>>>>
>>>>>>>
>>>>>>> If well done and not incurring a significant loss of flexibility,
>>>>>>> the gain of efficiency and the increase of reliability is obvious. (Imagine
>>>>>>> people driving self-made cars.)
>>>>>>>
>>>>>>
>>>>>> There are query languages that claim to be Turing-complete [e.g.,
>>>>>> https://info.tigergraph.com/gsql], and in the general sense of
>>>>>> separating computation from control flow (which is the entire point of a
>>>>>> query language), they are declarative, but as they provide unlimited
>>>>>> capabilities for iteration or recursion, they would not be under your
>>>>>> definition. The lesson here is that that there is a level of irreducible
>>>>>> complexity as soon as you need to iterate and/or update internal states. If
>>>>>> you feel that these are not necessary, you can *make* any query-based
>>>>>> mapping language compliant to your criteria if you just eliminate the parts
>>>>>> of the language that deal with iteration and the update (not the binding)
>>>>>> of variables. That basically requires you to write your own validator to
>>>>>> filter out functionalities that you don't want to support, nothing more, as
>>>>>> the rest is handled by off-the-shelf technology. And for some languages
>>>>>> (say, SPARQL), already the choice of query operator (CONSTRUCT / SELECT)
>>>>>> does that for you.
>>>>>>
>>>>>> So, a partial answer to your question seems to be: *Any query
>>>>>> language* (minus a few of its functionalities) would do. Beyond that,
>>>>>> selection criteria are no longer a matter of functionality but of verbosity
>>>>>> and entry bias, and the choice is up to the kind of source data you have.
>>>>>> Overall, there seem to be basically three types of transformations:
>>>>>>
>>>>>> (a) source to RDF (specific to the source format[s], there *cannot*
>>>>>> be a one-stop solution because the sources are heterogeneous, we can only
>>>>>> try to aggregate -- GRDDL+XSL, R2RML, RMI, JSON-LD contexts, YARRRML are of
>>>>>> that kind, and except for being physically integrated with the source data,
>>>>>> RDFa is, too)
>>>>>> (b) source to SPARQL variable bindings (+ SPARQL CONSTRUCT, as in
>>>>>> TARQL; this is a one-stop solution in the sense that you can apply one
>>>>>> language to *all* input formats, however, only those formats supported by
>>>>>> your tool; the difference to the first group is that the mapping language
>>>>>> itself is SPARQL, so it is probably more easily applicable for an
>>>>>> occasional user of SW/LD technology than any special-purpose or
>>>>>> source-specific formalism)
>>>>>> (c) source to a raw RDF representation + SPARQL CONSTRUCT (this is an
>>>>>> extension of (a) and the idea of some of the *software* solutions
>>>>>> suggested, but parts of the mapping effort are shifted from the
>>>>>> source-specific converters into the query (as in (b); this could also be
>>>>>> more portable than (a) as the format-/domain-specific can be covered by
>>>>>> generic converters/mapping)
>>>>>>
>>>>>> A fourth type, (d) source to raw RDF + SPARQL Update would fall out
>>>>>> of your classification -- but all of them would normally be considered
>>>>>> declarative.
>>>>>>
>>>>>> Best,
>>>>>> Christian
>>>>>>
>>>>>
Received on Saturday, 26 February 2022 06:40:07 UTC