Re: RDF *already* supports literal subjects - a thought experiment from Pat Hayes on 2010-07-12 (semantic-web@w3.org from July 2010)

From: Pat Hayes <phayes@ihmc.us>
Date: Mon, 12 Jul 2010 18:11:35 -0500
To: Graham Klyne <GK@ninebynine.org>
Cc: Sandro Hawke <sandro@w3.org>, Semantic Web <semantic-web@w3.org>
Message-Id: <0B41E202-BE5F-498B-AFC0-9376FBEC3B7A@ihmc.us>
On Jul 12, 2010, at 4:11 PM, Graham Klyne wrote:

> Sandro Hawke wrote:
>>> Hi Graham
>>>
>>>> So far, all this should lead to intended-literals in subject   
>>>> position that can
>>>> be read by any existing RDF/XML consuming application.
>>>>
>>>> What I'm less sure about is fixing the semantics:  as it stands,  
>>>> the  RDF
>>>> semantics is expressed in terms of allowing arbitrary   
>>>> interpretations --
>>>> mappings to things in the domain of discourse -- for all URI  
>>>> nodes  in a graph.
>>>> Would it be unreasonable or problematic to say that, for this   
>>>> particular form of
>>>> URI, the  denotation is fixed by the same general rules that  
>>>> govern  the
>>>> denotation of literals?
>>>
>>> No, but it would be a semantic extension to RDF, so the folk who  
>>> have  invested so much into implementing RDF as of 2004 will not  
>>> support it.  So if this is standardized, their engines will not  
>>> work properly  without changing some code. So they will not be  
>>> happy, for the same  reasons they are not happy with the current  
>>> suggestion.  LIke most  such suggestions along these lines, it  
>>> will produce problems of its  own, the most obvious being that we  
>>> would then have two syntactically  distinct but semantically  
>>> equivalent ways to write every literal in  the places where  
>>> literals are permitted, requiring engines to check  for all these  
>>> different forms all the time (in fact, to check *every*  URI in  
>>> any RDF just in case it is a hidden literal.)  In the case of   
>>> plain literals, we would actually have four such ways to write  
>>> them  instead of the two we have now.
>>>
>>> Although its ingenious, I think this is laying land-mines for  
>>> future  developers.
>> Still, it might be a good way to grandfather old systems and old
>> syntaxes, at some point.   The duplication could be avoided just by
>> saying don't do that.  (That is: never serialize as a data-uri- 
>> literal
>> when you can syntactically use a real literal instead.)
>
> Hi Pat, Sandro,
>
> I think Sandro's response crystalizes what I was trying to suggest.
>
> To rewind a little, one of the biggest problems of standards  
> deployment, once one has an installed base, is to plot a suitable  
> migration path.  That is, deployment of a new feature should not  
> break old systems.
>
> Maybe my view is limited, but my perception is that most deployed  
> software toolkits don't actually implement the formal semantics.  (I  
> don't mean to imply the formal semantics are not important - I think  
> they are but, at the current state of development, more of a guide  
> to developers and data model designers than enforced in software.)   
> With such a view, a change in the formal semantics to fix (as in  
> constrain, not repair) a family of URIs would have little if any  
> practical effect on deployed software.
>
> Taking a slightly different approach:  introducing the data: URIs as  
> suggested and not changing the RDF semantics would be entirely  
> consistent with todays RDF semantics; some of the intended  
> inferences would not be required by current semantics, though would  
> not be disallowed or inconsistent.  Thus, completeness of RDF  
> semantics based inferences with respect to the intended semantics  
> would be sacrificed, but soundness would not.
>
> ...
>
> So, if one truly does feel a need to introduce literals-as-subjects  
> into RDF' (RDF-prime), how is one to deal with existing RDF  
> processing systems.  Providing a URI-compatible form for literals  
> seems a reasonable bridging option.  But how does one minimize the  
> cost of alternate forms for literals?
>
> I think the answer may lie in avoiding alternative forms in the  
> abstract syntax (with respect to which the formal semantics is  
> defined).  Thus, in the abstract syntax, the suggested data: URIs  
> would be singled out for prohibition, to be replaced by the  
> corresponding literals (a stronger version of Sandro's "Don't do  
> that").  Software elements that need to apply the formal semantics  
> would be required to deal with only the literal node forms.  And  
> each serialization syntax would have its own mapping to the abstract  
> syntax, permitting data: URIs or literals or both, as befits the  
> circumstances.
>
> Jeremy noted that many of the potential costs are associated with  
> user interfaces that have been built on an assumption of subjects-as- 
> URIs (or bNodes).  I can't see the full range of problems here, but  
> from my experience, many of these interfaces are set up to use  
> rdfs:label values to represent such nodes - an approach that could  
> apply just as well to data: URIs, with the added possibility of  
> "inferring" a suitable rdfs:label property (which IIRC is  
> semantically void) for any data: URI.  A harder problem here, maybe,  
> is that data: URIs don't in general lend themselves to presentation  
> as qnames, which are commonly used for presenting URIs compactly  
> (which also restricts their possible use as predicates in RDF/XML).
>
> ...
>
> In summary, what Sandro said:  the suggested use of data: URIs be  
> used as a transitional measure, whose use is restricted to  
> particular RDF serialization forms, and mapped to a common abstract  
> syntax so their use doesn't pollute future generations of RDF  
> representation and processing software.
>
> #g

Let me try to state as crisply as possible what I see as wrong with  
this idea. In sum, it is about as bad an idea as anyone could propose,  
IMO: it does not solve the problem, it creates more confusion and  
complexity to work around a bug that should never have been allowed to  
happen in the first place, and it won't actually work, in practice,  
for utterly predictable social reasons. (Sorry, Graham, and nothing  
personal.)

First, as others have noted, we do already have a workable, if ugly,  
way to state what anyone might need to state with a literal subject in  
RDF already: instead of writing the obvious

<literal> :p :o .

one can write

_:x :same <literal> .
_:x :p :o .

using whatever form of :same one prefers, such as owl:sameAs. So we  
don't need another complicated work-around. The point of allowing  
literals as subjects was to avoid having to use a work-around, not to  
invent a new one; and also, in fact, to simplify RDF and make it more  
elegant, also not a purpose which is served by yet another work- 
around. So this idea doesn't really help.

But worse, it creates a whole new set of awkwardnesses. While having  
something which is syntactically a URI but semantically a literal does  
sneak the literal past the parsers, it does not get it past any  
inference engines that might be waiting at the other side. And those  
engines now have a truly awful task. Some of the URIs they are looking  
at are actually literals in disguise, and those have to be treated  
specially, differently from other URIs. In fact they have to be  
treated like literals, because they are literals in disguise (LIDs).  
But which of the many URIs are LIDs? The only way to find out is to  
micro-parse the URIs themselves, and so you have to do that to all of  
them (in subject or object position). And when you do find a LID, what  
do you do? Its impossible to completely exhaust all the inferences  
that might be relevant to these LID things at one time, as new  
information might crop up later; and in any case, the same value might  
also occur as actual literals in object positions, and the engine  
needs to be smart enough to do to LIDS anything that it can do to a  
typed literal. So you have to somehow mark them as being LIDs with a  
literal value, and record that value in a form that allows  
interoperation with literals. In fact, the smartest thing to do would  
probably be to just replace them with the corresponding literal. Which  
gets us back to a familiar issue, one might recall.

Worse still, this proposal drives a truck through the RDF model and  
semantics. The basic model of RDF is that URI references (IRIs, now)  
are basically names. Each of them identifies something, and that is  
all that they do. Then all the RDF meaning is defined by what they  
identify, and that is how the interpretation-based semantics works.  
This is entirely conventional and based on nice, standard, classical  
theory all out of the old textbooks. But if we allow the meanings of  
some (but not all) of these names to be determined by their micro- 
lexical syntax, this completely changes the game. Those LIDs aren't  
just names any more. I'm not saying it cannot be done - it can - but  
it would require re-writing (and re-thinking) the entire RDF syntax  
and semantic model from the ground up. This is WAY worse then just  
allowing literals in subject position, which is really almost no  
change at all to RDF itself (even if it does break some existing  
software.)

Finally, just on social engineering grounds, Sandro's "Don't do that"  
idea is guaranteed to fail. People will do that (and who can blame  
them?) and then you will get RDF infected with these LIDs all over the  
place, and other RDF with actual literals, and then some mixed RDF  
with LIDs and literals. And (as folk will no doubt tell you, with some  
asperity) the semantics treats these as interchangeable and  
equivalent, so what is the problem? So actual deployed engines will,  
in fact, find it necessary to handle both kinds of literal forms in  
all kinds of positions, just to be able to survive in a world of real- 
life RDF.

We have already been here, in fact, in a small way. RDF cognoscenti  
should be reminded at this point of the issue that we already had when  
we allowed plain RDF literals to be exactly equivalent to typed  
literals with the xsd:string datatype. This seemed harmless when we  
did it, and semantically it is trivial; but it gives all kinds of  
problems to inference engines, and so has turned out to be a nightmare  
of such horror that it has been seriously proposed to back-engineer  
RDF so that plain literals are considered to be retrospectively typed  
with a special RDF plain: datatype. BUt that raises problems of its  
own, as it violates the RDF specs in subtle but important ways. And  
there has to be a syntactic marker for the enclosed strings, and....  
Blech. That was ONE datatype. Now amplify this mess by at least a  
dozen, going on a hundred, datatypes, all with what will be, in  
effect, literals in two forms, syntactically incompatible but  
semantically identical. If I were an RDF developer, I would go look  
for a different job at this point.

Pat



------------------------------------------------------------
IHMC                                     (850)434 8903 or (650)494 3973
40 South Alcaniz St.           (850)202 4416   office
Pensacola                            (850)202 4440   fax
FL 32502                              (850)291 0667   mobile
phayesAT-SIGNihmc.us       http://www.ihmc.us/users/phayes
Received on Monday, 12 July 2010 23:12:38 UTC