Re: JJC's take on I18N concerns from Martin Duerst on 2003-08-15 (w3c-rdfcore-wg@w3.org from August 2003)

From: Martin Duerst <duerst@w3.org>
Date: Fri, 15 Aug 2003 14:47:25 -0400
To: pat hayes <phayes@ihmc.us>
Cc: Jeremy Carroll <jjc@hpl.hp.com>, w3c-i18n-ig@w3.org, w3c-rdfcore-wg@w3.org
Message-Id: <4.2.0.58.J.20030815134640.0608f3d8@localhost>
Hello Pat,

Many thanks for your participation in this discussion.


At 10:16 03/08/15 -0700, pat hayes wrote:

>>Hello Jeremy,
>>
>>Just to make sure, here some responses:
>>
>>At 21:32 03/08/13 +0300, Jeremy Carroll wrote:
>
>I agree with Jeremy; further comments below.
>
>>
>>>This is a reply to
>>>
>>>http://lists.w3.org/Archives/Member/w3c-i18n-ig/2003Aug/0022
>>>
>>>which was (AFAICT) endorsed by I18N at their recent telecon.
>>>
>>>I am also copying the recipients of
>>>http://lists.w3.org/Archives/Member/w3c-archive/2003Aug/0027
>>>(other than those who I believe are already on the To lists)

>>>[[
>>>2. I18N feels that the currently proposed implementation is overly
>>>complicated for the user, and that this will introduce a strong risk
>>>that users do not implement language information properly.
>>>]]
>>>
>>>RDF Core had feedback against other implementions on grounds of their
>>>complexity. This was a tradeoff decision.
>>
>>I think it is complexity for the user (somebody writing RDF (RDF/XML
>>or otherwise) or scraping,...) vs. complexity for implementers of
>>core software.
>
>You seem to be operating under assumptions about RDF/XML which are 
>inappropriate. We do not anticipate that the typical RDF user will ever 
>write or even see RDF/XML. In the RDF design it plays the role of a 
>machine-oriented data transmission notation; it will be as remote from 
>user experience as Javascript code is from the typical user of a web 
>browser. The readability of RDF/XML is relevant only to implementers who 
>are concerned with debugging their software.

I'm sorry that the definition of 'user' was not completely clear.
We don't mean the total end user. We mean all the people scraping
data from the Web,... The example of Javascript is probably about
right, there are a lot of people writing Javascript, and it would
be difficult to get most of these people to do the right thing.



>>>[[
>>>3.  The current approach assumes the existence of constructs to describe
>>>language and carry language information in the native markup associated
>>>with a fragment.  Such constructs may not exist, in which case it seems
>>>impossible to ascribe such information at a meta level.
>
>It is certainly not impossible, though it may indeed be ad-hoc. For 
>example, one could use a construction like this:
>
>_:x  ex:textIs "somepieceofXMLscrapedfromsomewhere"^^rdf:XMLliteral .
>_:x  ex:language "fr" .
>
>where the lang tag is coded as a plain literal.

The point referred to constructs in some markup language.

Your argument is that there are mechanisms outside XML Literals that
would be able to indicate the language of an XML Literal.
We agree with this, but note that unless this is the solution
adopted by the specification, this kind of proposal is ad-hoc
(as you say) and therefore does not meet our requirement that
language information be easy to pick up by generic tools.
It would be similar to say that in XML, there is no need for
xml:lang because each markup language can define its own
mechanism (e.g. attribute) to indicate language.



>>>[[
>>>4. It seems to I18N that it will be difficult to convert rdf created
>>>using the old syntax to the new syntax. Where legacy documents simply
>>>declared xml:lang at the top of the file, they will now have to declare
>>>it for every XML literal.  Also, there is no provision for automatic
>>>conversion from the old to the new syntax.
>>>]]
>>>
>>>Old style was vague, no indication that xhtml namespace needed declaring
>>>(predated xhtml?). Not really useable because of such problems, 
>>>certainly not
>>>in a portable fashion. The old spec is sufficiently bad to make this problem
>>>a non-starter since it is not clear what old style xml literal are supposed
>>>to mean, particularly the treatment of namespaces. Also old spec was 
>>>somewhat
>>>unclear how language was supposed to be treated.
>>
>>There definitely were some vaguenesses, but we agreed on what these
>>were in the area of language tagging at the Technical Plenary in Cannes.
>>The lastcall draft has clarified these.
>
>But there seems to be a basic assumption lying behind your point, that the 
>current spec must be backwards compatible with M&S. This is impossible: 
>indeed, our WG charter is to clarify and rationalize the M&S spec, which 
>was known to be confused when the WG was founded.

I agree with clarify, this was explained to us at the joint meeting
in Cannes. I don't remember rationalize. It doesn't sound like a bad
thing to do, but I don't think this affects the argument.

The basic assumption behind my argument is not backwards compatibility
(although this is also an issue separately). It is that it is not
productive in inter-WG discussions to have a joint meeting and agree
(without a need for long discussion) on what exactly the real unclarities
are (and how to address them), and to come back much later claiming
that everything is unclear anyway.


>>>[[
>>>5. I18N considers that it should be possible to conclude that a plain
>>>literal and an XML literal without markup are the same text.
>
>They are clearly the same *text*; the literal strings will be identical. 
>So, taken literally, your point here is already satisfied. If you meant to 
>say that a plain literal, and an XML literal without markup, could denote 
>the same value,

Yes, that's what we mean.


>then this raises a more complex issue concerning datatyping.  I would 
>suggest that you raise this matter with the XML Schema group, since they 
>have devoted considerable time and effort to considerations of identity 
>between items in the value spaces of datatypes.  The consensus in XML 
>Schema 1.0 was that all such value spaces should be considered to be 
>disjoint. I gather that the story in 1.1 will be more complex, and involve 
>a rather subtle distinction between equality and identity.

To the extent that the basic equality/identity for plain literals and
XML Literals without language information is concerned, we are satisfied
with the solution and text that you and Jeremy have worked on recently.

[I will take note of the issue and try to get back to XML
Schema in due time.]


>>>Introducing
>>>language markup as proposed in the current solution makes this
>>>impossible, since it is never clear whether the markup was in the
>>>original text or not.
>
>That seems to be a weak argument. For example, the use of a distinctive 
>wrapper (as suggested by Jeremy at one point) would probably suffice to 
>keep the distinction clear, in practice.

If the RDF/XML spec requires such a wrapper, then that might satisfy
us. If the RDF/XML parser adds such a wrapper when parsing, that will
satisfy us. Just having applications maybe add a wrapper doesn't help.


>But in any case, surely this would apply equally well to XML literals with 
>language tags?

If language information is handled in the same way for plain literals
and XML literals, then this satisfies our concerns, because then it
will be as easy to compare these two if they have language information
as it will be to compare them if they don't have language information.


>>>[[
>>>6. I18N has not been convinced that either of the alternative proposals
>>>for including language information are problematic, and feels they are
>>>more intuitive and workable than the current proposal because they do
>>>not entail the problems cited above.
>>>]]
>>>I think this is answered by Sandro
>>>http://lists.w3.org/Archives/Public/www-archive/2003Aug/0004
>>>[[
>>>There's a serious concern that people who don't care about XML wont
>>>bother to implement these bits if they are bolted onto to the side
>>>like that.  As just another datatype, it fits in smoothly, with no
>>>particular extra work required.  (except for that language tag...)
>>>Would you rather many implementations not support XML at all?
>>>(Perhaps not really a fair question....)
>>>]]
>>
>>Implementing XML Literals right is basically just a combination of
>>plain literals and datatyped literals. So it's not that difficult
>>to implement.
>
>It is trickier than you might think when your implementation has to 
>interoperate with other SW standards, such as OWL. For example, does the 
>owl:Class containing
>
>"nomarkuphere"
>"nomarkuphere"^^rdf:XMLLiteral
>
>have one or two items in it?

My understanding based on the very recently approved changes
would be that in a standard interpretation, it does have two
items, but in an interpretation that decides to treat text-only
XML Literals equivalent with plain literals, it would have one
item.


>How about the class containing
>
>"nomarkuphere"
>"nomarkuphere"^^ex:someDatatype
>
>? You must provide an answer.

As far as I understand OWL, this would depend on whether the
ex:someDatatype is supported, and how it is defined.



>Does the answer change when you later discover that
>
>ex:someDatatype owl:sameAs rdf:XMLliteral .
>
>? It should not change.

If ex:someDatatype is supported and defined as being the same as
rdf:XMLLiteral, then the implementation will do the right thing (and
ex:someDatatype owl:sameAs rdf:XMLliteral .
will be information it already knows).

If ex:someDatatype is supported and defined as being different from
rdf:XMLLiteral, then the implementation will do the right thing (and
ex:someDatatype owl:sameAs rdf:XMLliteral .
is a wrong statement that will produce a contradiction).

If ex:someDatatype is not supported, then I have absolutely
no idea how any OWL implementation could be able to predict
ex:someDatatype owl:sameAs rdf:XMLliteral .
(or ex:someDatatype owl:differentFrom rdf:XMLliteral .
assuming owl:differentFrom exists).
If OWL implementations are able to make such predictions,
I guess OWL will be a real killer application for predictions
in all kinds of fields.


>These are the kinds of question that are settled by a formal semantics; so 
>in order to make sure that a feature is implementable it is not sufficient 
>to implement one app; you have to provide a formal semantics for the feature.

If the formal semantics require RDF or OWL to do anything serious
with unknown datatypes, then it seems to me that they either
posses very interesting predictive powers, or that something
is somewhat wrong.


>BTW, there is a real problem in having entailments depend on the detailed 
>syntactic form of literal strings (which they would if you want XML 
>literals without markup to be identical to plain literals). The whole 
>point of datatyping is that each datatype provides criteria for identity 
>between lexical forms which can be checked efficiently and locally.

This can be done. The function checking for the identity between
plain literals and XML Literals without markup is slightly more
complicated than strcmp(), but still extremely easy. There are only
two things to add, both on the XML side:
- If a '<' is seen, return not equal (to be equivalent to strcmp,
   one would need to decide whether markup is smaller than any
   character, or larger, but this is irrelevant here)
- If a '&' is seen, parse for &amp;, &lt;, &gt; and friends.

If you need the actual function, I can send it to you.


>If the criteria vary depending on the form of the string, then one has 
>effectively created multiple datatypes. If we were to say that XML strings 
>without markup were identical to simple character strings, but those with 
>markup were a distinct category, then this amounts to having two 
>datatypes, rdf:XMLiteralWithMarkup and rdf:XMLLiteralWithoutMarkup, the 
>latter being a subclass of xsd:string and the former being disjoint from 
>it. We could have done this, but (apart from the obvious artificiality) it 
>seems to be of little utility, since one could use simple literals or 
>xsd:string to type the lexical forms which have no markup and gotten the 
>very same entailments, in this case.

You are right that there is now a subset relationship between the two
types. And you are right that it would be possible to define a separate
name for the part of the superset that is not in the subset (i.e.
rdf:XMLLiteralWithMarkup). But it's not clear why this would be necessary.

As for just using xsd:string (or plain literals), yes, that's an option,
but discussions with Jeremy Caroll and Jim Hendler, who both use
XML Literals for documentation purposes, have revealed that implementers
often prefer to just write out everything as XML Literals.


>I confess to finding this entire issue of XML compliance rather boring, 
>but I have always understood, from the WG discussions on this issue, that 
>the whole point of XML literal typing was to distinguish text which came 
>from an XML 'source' from text which simply happened to be like XML in 
>some way, but was not itself genuine XML. Your insistence, that genuine 
>XML  should be indistinguishable from faux XML in the case that neither 
>contains MXL markup, seems to be fundamentally at odds with this basic 
>idea of distinguishing 'real XML' from mere text.

This may seem so at first sight. But seeing pieces of XML as sequences
of characters with potentially interspersed markup easily explains things.
(This is similar to having boxes of apples and boxes of apples and oranges.
A box with apples and oranges that happens by chance to contain only apples
doesn't seem to be different from a box with apples.)


Regards,    Martin.
Received on Friday, 15 August 2003 14:47:44 UTC