Re: semsheets from Hans-Jürgen Rennau on 2022-02-26 (semantic-web@w3.org from February 2022)

From: Hans-Jürgen Rennau <hjrennau@gmail.com>
Date: Sat, 26 Feb 2022 01:43:03 +0100
To: Martynas Jusevičius <martynas@atomgraph.com>
Cc: Christian Chiarcos <christian.chiarcos@web.de>, Paul Tyson <phtyson@sbcglobal.net>, semantic-web@w3.org
Message-ID: <CA+H2zTCb=YeAbJ4cJFxFy=4jX7gWV8Y-50bcN6csPVWSA5GDAA@mail.gmail.com>
Absolutely agree - to me, that's the idea behind rml.io.

Your concern about optimization is a puzzle to me. A declarative mapping
language à la rml is not optimized for anything, except for clean and clear
statement of the intended result - the what, not the how. That's the art of
it. Optimization comes later, and the clearer our thought, well-structured
and intuitive its capturing - the larger the scope for optimization behind
the scenes. And the more sustainable and enduring the result of time spent,
because a mapping defined today, investing, say, eight hours may be slowly
processed today, faster in a month, and much faster in a year, without me
spending another minute on this. So a key benefit is potential return on
investment. And I mention in passing - how much cheaper is the
*maintenance* of 100 simple artifacts (say, yarrrml documents) not
bothering with optimization and backed by a sophisticated implementation -
compared with 100 artifacts twice or ten times as complex in their quest
for speed.

Did I really understand you correctly - in spite of its amazing generality,
the rml approach is not promising, because not concerned with optimization?

With kind regards - Hans-Jürgen

Am Fr., 25. Feb. 2022 um 21:58 Uhr schrieb Martynas Jusevičius <
martynas@atomgraph.com>:

>
>
> On Fri, 25 Feb 2022 at 21.49, Hans-Jürgen Rennau <hjrennau@gmail.com>
> wrote:
>
>> Thank you, Christian, that's helpful!
>>
>> We must take care not to stumble over differences of terminology. When I
>> say "formats", I mean syntax types: XML, JSON, HTML, CSV, TSV, ... So there
>> are not two XML formats, but XML is a format, JSON is a format, etc. Not
>> important here if this is a fortunate choice, as long as we avoid a
>> misunderstanding.
>>
>> What you call "formats" is something different; something which I use to
>> call "document types". A vocabulary or, in a narrower sense, an explicit or
>> implicit model of structure, names and meanings. For example, two different
>> web service messages, say FooRequest and BarResponse, are two document
>> types. They certainly require two different mappings. We need not waste
>> time on discussion about the fact that every document type requires its own
>> custom mapping to RDF. It is obvious, and therefore I would not speak about
>> the absence of a one stop. We have to map non-RDF documents to RDF again
>> and again, again and again having to deal with different document types.
>> The permanent need to speak (to map) is the very reason one may ask for a
>> language (a dedicated mapping language).
>>
>> To summarize my position: the necessity to perform custom-mapping has
>> nothing to do with the state of technology, but with the nature of things.
>> It is from this perspective that the goal of a uniform mapping language,
>> applicable to all or many formats (syntax types) becomes interesting. As we
>> appreciate the benefits of a uniform styling language (CSS), a uniform
>> document transformation language (XSLT and XQuery), a uniform modeling
>> language (UML), a uniform locator model (URI), etc. - we might also
>> appreciate a uniform to-RDF mapping language. (Mind you - uniform does not
>> mean that it's always the best choice, but often.)
>>
>> Concerning the raw RDF: perhaps an appropriate approach in some
>> scenarios, but with little interest from the point of view of a unified
>> mapping language, as one is thrown back on a generic transformation task.
>> Which is exactly the baseline from where to depart.
>>
>
> I think that is the idea behind https://rml.io/.
>
> Another misunderstanding concerns the term "declarative", I'll return to
>> that later.
>>
>> Kind regards, Hans-Jürgen
>>
>> PS: I wonder which crosses you would make:
>> A uniform to-RDF mapping language?
>> o Too unclear what it means
>> o Not feasible
>> o Not useful
>> o Pointless because: ______
>>
>
> I would cross all of the above. You can’t have a mapping language that is
> equally optimized for both tabular and tree data. For example, streaming
> transformation of CSV is trivial, but streaming transformation of XML is
> complex.
> You might succeed in creating a general mapping, but it would be very
> shallow and un-optimizable.
>
> Am Fr., 25. Feb. 2022 um 17:24 Uhr schrieb Christian Chiarcos <
>> christian.chiarcos@web.de>:
>>
>>> Am Fr., 25. Feb. 2022 um 15:44 Uhr schrieb Hans-Jürgen Rennau <
>>> hjrennau@gmail.com>:
>>>
>>>> Thank you, Christian. Before responding, I have a couple of questions.
>>>> You write:
>>>>
>>>> " (a) source to RDF (specific to the source format[s], there *cannot*
>>>> be a one-stop solution because the sources are heterogeneous, we can only
>>>> try to aggregate -- GRDDL+XSL, R2RML, RMI, JSON-LD contexts, YARRRML are of
>>>> that kind, and except for being physically integrated with the source data,
>>>> RDFa is, too)"
>>>>
>>>> I do not understand - YARRRML *is* a one-stop solution for all source
>>>> formats included in an extensible list of formats, already now including:
>>>> RDB, CSV, JSON, XML. So in principle it is a comprehensive solution for a
>>>> given heterogeneous set of data sources. Could you explain what you mean,
>>>> saying "there cannot be a one-stop solution"?
>>>>
>>>
>>> It is, in fact, an example of aggregation. So, even if providing a
>>> common level of abstraction for different formats, the underlying machinery
>>> has to be specific to these source formats. Coverage for XML is great, of
>>> course, but "support for XML" doesn't necessarily mean that all XML formats
>>> are covered. A generic XML converter is not very helpful if your data
>>> requires to keep track of dependencies between multiple XML files, for
>>> example. You can convert/extract from DOCX documents with generic XML
>>> technology (+ZIP), but it's a nightmare that -- if done right -- requires
>>> you to understand hundreds of pages of documentation (including, but not
>>> limited to
>>> https://interoperability.blob.core.windows.net/files/MS-DOCX/%5bMS-DOCX%5d.pdf).
>>> As long as people keep on inventing formats, and as long as these formats
>>> keep on evolving, any aggregation-based solution will be incomplete. Hence
>>> not a one-stop-solution -- unless *all* format developers *everywhere*
>>> decide to work on the same aggregator platform and stop developing ad-hoc
>>> converters for ad-hoc formats (spoiler: despite serious efforts and *some
>>> progress*, this is not what happened in the past 50 years: In academia and
>>> among developers, we see fragmentation and conventions widely used within
>>> their own communities but not so much beyond them -- as in the SW; and in
>>> the industry, limited interoperability is actively used as tool against
>>> competitors).
>>>
>>>
>>>> And a second question: what does "raw RDF representation" mean? I
>>>> suppose the result of a purely mechanical translation, derived from item
>>>> names, but I am not sure.
>>>>
>>>
>>> A raw RDF representation of, say, XML in RDF can just encode the XML
>>> data structure in RDF, i.e., create an my:Element for every element,
>>> my:name for its name, an my:Attribute for every attribute, my:child for
>>> every child, my:next for every sibling, etc. (my: is an invented namespace,
>>> replace by whatever prefix you want for XML data structures.) That's
>>> trivial and can be done with a few dozen lines in XSL (e.g.,
>>> https://github.com/acoli-repo/LLODifier/blob/master/tei/tei2llod.xsl).
>>> And it will convert any XML document into RDF. But this is not a meaningful
>>> representation and it's too verbose to be practical, because you easily
>>> create thousands of triples for pieces of information that can be expressed
>>> in just a few, as you encode the complete structure of the XML file. In
>>> order to get the semantics out of the jungle of XML data you need to filter
>>> and aggregate, and that can be (more or less) effectively done with SPARQL
>>> (for example).
>>>
>>> Best,
>>> Christian
>>>
>>>>
>>>> Thank you in advance, kind regards - Hans-Jürgen
>>>>
>>>> Am Fr., 25. Feb. 2022 um 14:05 Uhr schrieb Christian Chiarcos <
>>>> christian.chiarcos@web.de>:
>>>>
>>>>> Speaking of ways of thinking, integration means among other things
>>>>>> that a graceful transition between tree and graph representation is a
>>>>>> natural thing for us, almost as natural as an arithmetic operation, or the
>>>>>> validation of a document against a schema, or conducting a query. If there
>>>>>> is an abundance of tools, this is alarming enough; even more alarming is
>>>>>> the common view point that people may have to write custom code. For tasks
>>>>>> of a fundamental nature we should have fundamental answers, which means
>>>>>> declarative tools - *if* this is possible. With declarative I mean:
>>>>>> allowing the user to describe the desired result and not care about how to
>>>>>> achieve it.
>>>>>>
>>>>>
>>>>> Well, that's not quite true, you need to at least partially describe
>>>>> three parts, the desired result, the expected input and the relation
>>>>> between them. It seems you want something more than declarative in the
>>>>> traditional sense, but you want a model that is not procedural, i.e.,
>>>>> doesn't require the handling of any internal states, but a plain mapping.
>>>>>
>>>>> XSLT <3.0 falls under this definition as long as you don't do
>>>>> recursion over named templates (because it didn't let you update the values
>>>>> of variables) -- and for your use case you wouldn't need that.
>>>>> Likewise, SPARQL CONSTRUCT is (not updates, because then you could
>>>>> iterate), and I guess the same holds for most query languages.
>>>>> However: It is generally considered a strength that both languages are
>>>>> capable of doing iterations, and if these are in fact required by a use
>>>>> case, a non-procedural formalization of the mapping would just not be
>>>>> applicable anymore.
>>>>>
>>>>>>
>>>>>> If well done and not incurring a significant loss of flexibility, the
>>>>>> gain of efficiency and the increase of reliability is obvious. (Imagine
>>>>>> people driving self-made cars.)
>>>>>>
>>>>>
>>>>> There are query languages that claim to be Turing-complete [e.g.,
>>>>> https://info.tigergraph.com/gsql], and in the general sense of
>>>>> separating computation from control flow (which is the entire point of a
>>>>> query language), they are declarative, but as they provide unlimited
>>>>> capabilities for iteration or recursion, they would not be under your
>>>>> definition. The lesson here is that that there is a level of irreducible
>>>>> complexity as soon as you need to iterate and/or update internal states. If
>>>>> you feel that these are not necessary, you can *make* any query-based
>>>>> mapping language compliant to your criteria if you just eliminate the parts
>>>>> of the language that deal with iteration and the update (not the binding)
>>>>> of variables. That basically requires you to write your own validator to
>>>>> filter out functionalities that you don't want to support, nothing more, as
>>>>> the rest is handled by off-the-shelf technology. And for some languages
>>>>> (say, SPARQL), already the choice of query operator (CONSTRUCT / SELECT)
>>>>> does that for you.
>>>>>
>>>>> So, a partial answer to your question seems to be: *Any query
>>>>> language* (minus a few of its functionalities) would do. Beyond that,
>>>>> selection criteria are no longer a matter of functionality but of verbosity
>>>>> and entry bias, and the choice is up to the kind of source data you have.
>>>>> Overall, there seem to be basically three types of transformations:
>>>>>
>>>>> (a) source to RDF (specific to the source format[s], there *cannot* be
>>>>> a one-stop solution because the sources are heterogeneous, we can only try
>>>>> to aggregate -- GRDDL+XSL, R2RML, RMI, JSON-LD contexts, YARRRML are of
>>>>> that kind, and except for being physically integrated with the source data,
>>>>> RDFa is, too)
>>>>> (b) source to SPARQL variable bindings (+ SPARQL CONSTRUCT, as in
>>>>> TARQL; this is a one-stop solution in the sense that you can apply one
>>>>> language to *all* input formats, however, only those formats supported by
>>>>> your tool; the difference to the first group is that the mapping language
>>>>> itself is SPARQL, so it is probably more easily applicable for an
>>>>> occasional user of SW/LD technology than any special-purpose or
>>>>> source-specific formalism)
>>>>> (c) source to a raw RDF representation + SPARQL CONSTRUCT (this is an
>>>>> extension of (a) and the idea of some of the *software* solutions
>>>>> suggested, but parts of the mapping effort are shifted from the
>>>>> source-specific converters into the query (as in (b); this could also be
>>>>> more portable than (a) as the format-/domain-specific can be covered by
>>>>> generic converters/mapping)
>>>>>
>>>>> A fourth type, (d) source to raw RDF + SPARQL Update would fall out of
>>>>> your classification -- but all of them would normally be considered
>>>>> declarative.
>>>>>
>>>>> Best,
>>>>> Christian
>>>>>
>>>>
Received on Saturday, 26 February 2022 00:43:30 UTC