Re: Blank nodes must DIE! [ was Re: Blank nodes semantics - existential variables?] from Eric Prud'hommeaux on 2020-07-24 (semantic-web@w3.org from July 2020)

From: Eric Prud'hommeaux <eric@w3.org>
Date: Fri, 24 Jul 2020 23:22:18 +0200
To: Antoine Zimmermann <antoine.zimmermann@emse.fr>
Cc: semantic-web@w3.org, Maxime Lefrançois <maxime.lefrancois@emse.fr>
Message-ID: <20200724212218.GC428186@w3.org>
On Fri, Jul 24, 2020 at 08:23:10PM +0200, Antoine Zimmermann wrote:
> This is a frequently asked questions. When I mention cdt:ucum to semantic
> web folks, this suggestion gets made quite often. I had to justify the
> choice multiple times. Every time I have an argument with someone about
> this, I get out of the debate with an even stronger conviction that our
> design is, by far, the best option. I respect those who disagree, though.
> 
> Note that I mentioned this work in a thread that talks about making RDF more
> approachable by people. If you say to an engineer that they have to write
> ten kilowatt like this: "10 kW", you get a pretty strong point against an
> approach that tells them to write "10, and you have to understand it as a
> number of kilowatts". And if it's one hundred Watt? "100 W". And a million
> kilowatt? "1e9 W" (or "1e6 kW", or "1 GW"). This is simple for anyone who's
> using physical quantities in their work. They just write it the way they are
> used to.

Fair point. This way you get to leverage UCUM's social engineering on
picking names and conventions that align as best possible with
existing practice.


> On a more conceptual level, it is also important to understand that power,
> length, weight, mass are not numbers. They truly are not.
> There is no "10" in ten inches. In "ten inches", there are 0.254 meters.

A math pedant might argue that you are multiplying a one inch basis
vector by the ration "10/1" or "10".


> Also, with an infinite number of IRIs, you have to parse the IRI to
> understand what we are talking about. It is not in the spirit of RDF to have
> to parse IRIs to understand what they mean. It truly is the spirit of RDF
> datatypes to have to parse the lexical form to understand what a literal
> means.

I agree that it's distasteful to parse IRIs but I find it equally or
more awkward to have to parse units out of a literal. As RDF
conventions have progressed to date, we've been able to both match
URLs against an enumerable list and parse scalar values out of lexical
forms. Unless we add another slot, one of those comfortable
assumptions will have to be bent.


> The other problem is that you have tons of IRIs that are equivalent. Even
> the same datatype. ucum:N.m and ucum:J would represent the same datatype
> (same lexical space, same value space, same lexical-to-value mapping). This
> leads to complicated consequences in terms of reasoning.
> 
> Concerning your argument (in a later email) that you need microparsing
> anyway, I could return the argument: if you turn "100 W"^^cdt:ucum into
> "100"^^ucum:W to avoid parsing the "100 W", you still have to parse the
> ucum:W. So why not just get rid of all the infinite set of IRIs and embrace
> cdt:ucum as it is? (plus, parsing UCUM codes is done by just importing one
> of the many implementations of UCUM).

You're already normalizing before you do entailment and comparison.
Operationally, does it matter if you microparse the end of a URL
instead of the end of a lexical form? I concede that the current
D-entailment definitions might work more easily with a pre-digested
value space.


> Of course, all of this does not *prove* that our approach is better. It is
> just my very strong conviction that it is.

I am equally convinced that I've not proven my point. So there!

Aesthetically, I prefer units-in-datatype but I think it would be
every interesting to implement both and see if one imposed more
challenge than the other. (Now you respond with "no one's stopping
you" and I counter with "there's this conspiracy between my employers,
my family and the length of a day...".)


> --AZ
> 
> 
> Le 24/07/2020 à 00:06, Eric Prud'hommeaux a écrit :
> > On Tue, Jul 21, 2020 at 02:35:02PM +0200, Antoine Zimmermann wrote:
> > > Regarding physical quantities, such as "5 inches", etc., my colleague Maxime
> > > Lefrançois and myself coauthored a specification for a datatype for physical
> > > quantities [1]. It is quite simple: we reuse the Unified Code for Units of
> > > Measurement (UCUM), a standard that is used in many scientific applications,
> > > and combine it with a number:
> > > 
> > > <QUANTITY> ::= <NUMBER> <SPACES> <UCUMCODE>
> > > <NUMBER> ::= xsd:decimal(('e'|'E')xsd:integer)?
> > > 
> > > Since UCUM has a well defined semantics, so does our datatype. Better, since
> > > UCUM is implemented in many programming languages, my colleague Maxime could
> > > easily integrate it into Jena and its SPARQL engine [2].
> > > 
> > > So, with our Jena fork, one can write:
> > > 
> > > SELECT ?planet WHERE {
> > >    ?planet a ex:Planet;
> > >      ex:diameter ?s .
> > >    FILTER(?s > "2e11 mm"^^cdt:ucum)
> > > }
> > 
> > I applaud the work to extend XSD's numeric types so that RDF can have standard  measurement types. But why not leverage your work by adding SPARQL support for UCUM types? e.g.
> > 
> > SELECT ?planet WHERE {
> >    ?planet a ex:Planet;
> >      ex:diameter ?s .
> >    FILTER(?s > "2e11"^^ucum:mm)
> > }
> > 
> > It feels cleaner to me to embed the entire type of the data in the literal's datatype rather than spreading it across an aggregator type (cdt:ucum) and the lexical value (" mm").
> > 
> > In either case we probably have a union type in the lexical value so we'd have to micro-parse doubles, decimals and integers, but the parsing is easier if the measurement unit is broken out into the end of the datatype URL.
> > 
> > There are a few UCUM units that aren't viable localnames (e.g. "m/s.s"), but I think we can encode around that (e.g. "m_s.s") in a way that still makes ucum: a practical namespace for datatypes.
> > 
> > 
> > > This works if the size of the planet is encoded as a cdt:ucum, no matter
> > > what unit one is using. One can even use "link for Gunter's chain" (unit
> > > "[lk_us]"), or "cubic meters per acre" (unit "m3/[acr_us]") [3], which are
> > > both units of length.
> > > 
> > > With some of our industrial partners, we are using this for energy data, and
> > > they seem to be very pleased with this approach, compared to an
> > > ontology-based approach.
> > > 
> > > 
> > > [1] https://w3id.org/lindt/custom_datatypes#ucum
> > > [2] You can try it at https://ci.mines-stetienne.fr/lindt/playground.html
> > > [3] Try this query in the playground:
> > > 
> > > """
> > > PREFIX iter: <http://w3id.org/sparql-generate/iter/>
> > > PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
> > > PREFIX cdt: <http://w3id.org/lindt/custom_datatypes#>
> > > PREFIX ex: <http://example.org/>
> > > 
> > > SELECT ?length ?normalized
> > > 
> > > WHERE{
> > > 
> > >    VALUES ?position { "2.7e3 m3/[acr_us]"^^cdt:ucum }
> > >    # convert to meters
> > >    BIND("0 m"^^cdt:ucum + ?position AS ?normalized )
> > > 
> > > }
> > > """
> > > 
> > > --AZ
> > > 
> > > Le 17/07/2020 à 01:57, Cox, Simon (L&W, Clayton) a écrit :
> > > > Yeah, the atomicity of the chunk is the point. This even applies to
> > > > quantities. 25.4mm is *identical* to 1” – they are the same thing. Any
> > > > engine that operates with quantities needs to understand that. ’25.4’
> > > > and ‘mm’ cannot be separated. Coordinates are slightly more complex but
> > > > it comes down to the same thing. A single element within a set of
> > > > coordinates that describes a position in space is not independent of the
> > > > other numbers in the tuple, or of the coordinate reference system within
> > > > which they are expressed. One value should *never* be used independent
> > > > of the others. Exactly the same position on the earth will be denoted by
> > > > three different numbers if embedded in a different coordinate reference
> > > > system. You can only ‘reason’ over them as a group, not individually.
> > > > 
> > > > *From:*Dan Brickley <danbri@danbri.org>
> > > > *Sent:* Thursday, 16 July, 2020 23:58
> > > > *To:* Jeen Broekstra <jeen@fastmail.com>
> > > > *Cc:* Semantic Web <semantic-web@w3.org>
> > > > *Subject:* Re: Blank nodes must DIE! [ was Re: Blank nodes semantics -
> > > > existential variables?]
> > > > 
> > > > …
> > > > 
> > > > I believe the big appeal of putting it all into the zone we call
> > > > "literals" is that you get a kind of atomicity; that chunk of data is
> > > > either there, or not there; it is asserted, or not asserted. With a
> > > > triples-based (description of a ) data structure you have to be
> > > > constantly on your guard that every subset of the full graph pattern is
> > > > at least sensible and harmless, even when subsetting these chunks is
> > > > often confusing or misleading for data consumers. I can't help wondering
> > > > whether notions of graph shapes from shacl, shex (and sparql) could be
> > > > exploited to create an RDF-based data format which had atomicity at the
> > > > level of entire shapes.
> > > > 
> > > > Dan
> > > > 
> > > >      Jeen
> > > > 
> > > 
> > > -- 
> > > Antoine Zimmermann
> > > Institut Henri Fayol
> > > École des Mines de Saint-Étienne
> > > 158 cours Fauriel
> > > CS 62362
> > > 42023 Saint-Étienne Cedex 2
> > > France
> > > Tél:+33(0)4 77 42 66 03
> > > Fax:+33(0)4 77 42 66 66
> > > http://www.emse.fr/~zimmermann/
> > > Member of team Connected Intelligence, Laboratoire Hubert Curien
> > > 
>
Received on Friday, 24 July 2020 21:22:25 UTC