Re: semsheets from Martynas Jusevičius on 2022-02-25 (semantic-web@w3.org from February 2022)

From: Martynas Jusevičius <martynas@atomgraph.com>
Date: Fri, 25 Feb 2022 21:58:01 +0100
To: Hans-Jürgen Rennau <hjrennau@gmail.com>
Cc: Christian Chiarcos <christian.chiarcos@web.de>, Paul Tyson <phtyson@sbcglobal.net>, semantic-web@w3.org
Message-ID: <CAE35VmwC96o5e0XcNSgLsCq8CQX0NoWOf=V+TH=kFhsqAMe1fA@mail.gmail.com>
On Fri, 25 Feb 2022 at 21.49, Hans-Jürgen Rennau <hjrennau@gmail.com> wrote:

> Thank you, Christian, that's helpful!
>
> We must take care not to stumble over differences of terminology. When I
> say "formats", I mean syntax types: XML, JSON, HTML, CSV, TSV, ... So there
> are not two XML formats, but XML is a format, JSON is a format, etc. Not
> important here if this is a fortunate choice, as long as we avoid a
> misunderstanding.
>
> What you call "formats" is something different; something which I use to
> call "document types". A vocabulary or, in a narrower sense, an explicit or
> implicit model of structure, names and meanings. For example, two different
> web service messages, say FooRequest and BarResponse, are two document
> types. They certainly require two different mappings. We need not waste
> time on discussion about the fact that every document type requires its own
> custom mapping to RDF. It is obvious, and therefore I would not speak about
> the absence of a one stop. We have to map non-RDF documents to RDF again
> and again, again and again having to deal with different document types.
> The permanent need to speak (to map) is the very reason one may ask for a
> language (a dedicated mapping language).
>
> To summarize my position: the necessity to perform custom-mapping has
> nothing to do with the state of technology, but with the nature of things.
> It is from this perspective that the goal of a uniform mapping language,
> applicable to all or many formats (syntax types) becomes interesting. As we
> appreciate the benefits of a uniform styling language (CSS), a uniform
> document transformation language (XSLT and XQuery), a uniform modeling
> language (UML), a uniform locator model (URI), etc. - we might also
> appreciate a uniform to-RDF mapping language. (Mind you - uniform does not
> mean that it's always the best choice, but often.)
>
> Concerning the raw RDF: perhaps an appropriate approach in some scenarios,
> but with little interest from the point of view of a unified mapping
> language, as one is thrown back on a generic transformation task. Which is
> exactly the baseline from where to depart.
>

I think that is the idea behind https://rml.io/.

Another misunderstanding concerns the term "declarative", I'll return to
> that later.
>
> Kind regards, Hans-Jürgen
>
> PS: I wonder which crosses you would make:
> A uniform to-RDF mapping language?
> o Too unclear what it means
> o Not feasible
> o Not useful
> o Pointless because: ______
>

I would cross all of the above. You can’t have a mapping language that is
equally optimized for both tabular and tree data. For example, streaming
transformation of CSV is trivial, but streaming transformation of XML is
complex.
You might succeed in creating a general mapping, but it would be very
shallow and un-optimizable.

Am Fr., 25. Feb. 2022 um 17:24 Uhr schrieb Christian Chiarcos <
> christian.chiarcos@web.de>:
>
>> Am Fr., 25. Feb. 2022 um 15:44 Uhr schrieb Hans-Jürgen Rennau <
>> hjrennau@gmail.com>:
>>
>>> Thank you, Christian. Before responding, I have a couple of questions.
>>> You write:
>>>
>>> " (a) source to RDF (specific to the source format[s], there *cannot* be
>>> a one-stop solution because the sources are heterogeneous, we can only try
>>> to aggregate -- GRDDL+XSL, R2RML, RMI, JSON-LD contexts, YARRRML are of
>>> that kind, and except for being physically integrated with the source data,
>>> RDFa is, too)"
>>>
>>> I do not understand - YARRRML *is* a one-stop solution for all source
>>> formats included in an extensible list of formats, already now including:
>>> RDB, CSV, JSON, XML. So in principle it is a comprehensive solution for a
>>> given heterogeneous set of data sources. Could you explain what you mean,
>>> saying "there cannot be a one-stop solution"?
>>>
>>
>> It is, in fact, an example of aggregation. So, even if providing a common
>> level of abstraction for different formats, the underlying machinery has to
>> be specific to these source formats. Coverage for XML is great, of course,
>> but "support for XML" doesn't necessarily mean that all XML formats are
>> covered. A generic XML converter is not very helpful if your data requires
>> to keep track of dependencies between multiple XML files, for example. You
>> can convert/extract from DOCX documents with generic XML technology (+ZIP),
>> but it's a nightmare that -- if done right -- requires you to understand
>> hundreds of pages of documentation (including, but not limited to
>> https://interoperability.blob.core.windows.net/files/MS-DOCX/%5bMS-DOCX%5d.pdf).
>> As long as people keep on inventing formats, and as long as these formats
>> keep on evolving, any aggregation-based solution will be incomplete. Hence
>> not a one-stop-solution -- unless *all* format developers *everywhere*
>> decide to work on the same aggregator platform and stop developing ad-hoc
>> converters for ad-hoc formats (spoiler: despite serious efforts and *some
>> progress*, this is not what happened in the past 50 years: In academia and
>> among developers, we see fragmentation and conventions widely used within
>> their own communities but not so much beyond them -- as in the SW; and in
>> the industry, limited interoperability is actively used as tool against
>> competitors).
>>
>>
>>> And a second question: what does "raw RDF representation" mean? I
>>> suppose the result of a purely mechanical translation, derived from item
>>> names, but I am not sure.
>>>
>>
>> A raw RDF representation of, say, XML in RDF can just encode the XML data
>> structure in RDF, i.e., create an my:Element for every element, my:name for
>> its name, an my:Attribute for every attribute, my:child for every child,
>> my:next for every sibling, etc. (my: is an invented namespace, replace by
>> whatever prefix you want for XML data structures.) That's trivial and can
>> be done with a few dozen lines in XSL (e.g.,
>> https://github.com/acoli-repo/LLODifier/blob/master/tei/tei2llod.xsl).
>> And it will convert any XML document into RDF. But this is not a meaningful
>> representation and it's too verbose to be practical, because you easily
>> create thousands of triples for pieces of information that can be expressed
>> in just a few, as you encode the complete structure of the XML file. In
>> order to get the semantics out of the jungle of XML data you need to filter
>> and aggregate, and that can be (more or less) effectively done with SPARQL
>> (for example).
>>
>> Best,
>> Christian
>>
>>>
>>> Thank you in advance, kind regards - Hans-Jürgen
>>>
>>> Am Fr., 25. Feb. 2022 um 14:05 Uhr schrieb Christian Chiarcos <
>>> christian.chiarcos@web.de>:
>>>
>>>> Speaking of ways of thinking, integration means among other things that
>>>>> a graceful transition between tree and graph representation is a natural
>>>>> thing for us, almost as natural as an arithmetic operation, or the
>>>>> validation of a document against a schema, or conducting a query. If there
>>>>> is an abundance of tools, this is alarming enough; even more alarming is
>>>>> the common view point that people may have to write custom code. For tasks
>>>>> of a fundamental nature we should have fundamental answers, which means
>>>>> declarative tools - *if* this is possible. With declarative I mean:
>>>>> allowing the user to describe the desired result and not care about how to
>>>>> achieve it.
>>>>>
>>>>
>>>> Well, that's not quite true, you need to at least partially describe
>>>> three parts, the desired result, the expected input and the relation
>>>> between them. It seems you want something more than declarative in the
>>>> traditional sense, but you want a model that is not procedural, i.e.,
>>>> doesn't require the handling of any internal states, but a plain mapping.
>>>>
>>>> XSLT <3.0 falls under this definition as long as you don't do recursion
>>>> over named templates (because it didn't let you update the values of
>>>> variables) -- and for your use case you wouldn't need that.
>>>> Likewise, SPARQL CONSTRUCT is (not updates, because then you could
>>>> iterate), and I guess the same holds for most query languages.
>>>> However: It is generally considered a strength that both languages are
>>>> capable of doing iterations, and if these are in fact required by a use
>>>> case, a non-procedural formalization of the mapping would just not be
>>>> applicable anymore.
>>>>
>>>>>
>>>>> If well done and not incurring a significant loss of flexibility, the
>>>>> gain of efficiency and the increase of reliability is obvious. (Imagine
>>>>> people driving self-made cars.)
>>>>>
>>>>
>>>> There are query languages that claim to be Turing-complete [e.g.,
>>>> https://info.tigergraph.com/gsql], and in the general sense of
>>>> separating computation from control flow (which is the entire point of a
>>>> query language), they are declarative, but as they provide unlimited
>>>> capabilities for iteration or recursion, they would not be under your
>>>> definition. The lesson here is that that there is a level of irreducible
>>>> complexity as soon as you need to iterate and/or update internal states. If
>>>> you feel that these are not necessary, you can *make* any query-based
>>>> mapping language compliant to your criteria if you just eliminate the parts
>>>> of the language that deal with iteration and the update (not the binding)
>>>> of variables. That basically requires you to write your own validator to
>>>> filter out functionalities that you don't want to support, nothing more, as
>>>> the rest is handled by off-the-shelf technology. And for some languages
>>>> (say, SPARQL), already the choice of query operator (CONSTRUCT / SELECT)
>>>> does that for you.
>>>>
>>>> So, a partial answer to your question seems to be: *Any query language*
>>>> (minus a few of its functionalities) would do. Beyond that, selection
>>>> criteria are no longer a matter of functionality but of verbosity and entry
>>>> bias, and the choice is up to the kind of source data you have. Overall,
>>>> there seem to be basically three types of transformations:
>>>>
>>>> (a) source to RDF (specific to the source format[s], there *cannot* be
>>>> a one-stop solution because the sources are heterogeneous, we can only try
>>>> to aggregate -- GRDDL+XSL, R2RML, RMI, JSON-LD contexts, YARRRML are of
>>>> that kind, and except for being physically integrated with the source data,
>>>> RDFa is, too)
>>>> (b) source to SPARQL variable bindings (+ SPARQL CONSTRUCT, as in
>>>> TARQL; this is a one-stop solution in the sense that you can apply one
>>>> language to *all* input formats, however, only those formats supported by
>>>> your tool; the difference to the first group is that the mapping language
>>>> itself is SPARQL, so it is probably more easily applicable for an
>>>> occasional user of SW/LD technology than any special-purpose or
>>>> source-specific formalism)
>>>> (c) source to a raw RDF representation + SPARQL CONSTRUCT (this is an
>>>> extension of (a) and the idea of some of the *software* solutions
>>>> suggested, but parts of the mapping effort are shifted from the
>>>> source-specific converters into the query (as in (b); this could also be
>>>> more portable than (a) as the format-/domain-specific can be covered by
>>>> generic converters/mapping)
>>>>
>>>> A fourth type, (d) source to raw RDF + SPARQL Update would fall out of
>>>> your classification -- but all of them would normally be considered
>>>> declarative.
>>>>
>>>> Best,
>>>> Christian
>>>>
>>>
Received on Friday, 25 February 2022 20:58:28 UTC