Re: JJC's take on I18N concerns from pat hayes on 2003-08-22 (w3c-rdfcore-wg@w3.org from August 2003)

From: pat hayes <phayes@ihmc.us>
Date: Fri, 22 Aug 2003 07:26:05 -0700
To: Martin Duerst <duerst@w3.org>
Cc: w3c-rdfcore-wg@w3.org
Message-Id: <p06001a05bb64d49a31a4@[10.0.1.2]>
>Hello Pat,
>
>Many thanks for your participation in this discussion.
>
>
>At 10:16 03/08/15 -0700, pat hayes wrote:
>
>>>Hello Jeremy,
>>>
>>>Just to make sure, here some responses:
>>>
>>>At 21:32 03/08/13 +0300, Jeremy Carroll wrote:
>>
>>I agree with Jeremy; further comments below.
>>
>>>
>>>>This is a reply to
>>>>
>>>>http://lists.w3.org/Archives/Member/w3c-i18n-ig/2003Aug/0022
>>>>
>>>>which was (AFAICT) endorsed by I18N at their recent telecon.
>>>>
>>>>I am also copying the recipients of
>>>>http://lists.w3.org/Archives/Member/w3c-archive/2003Aug/0027
>>>>(other than those who I believe are already on the To lists)
>
>>>>[[
>>>>2. I18N feels that the currently proposed implementation is overly
>>>>complicated for the user, and that this will introduce a strong risk
>>>>that users do not implement language information properly.
>>>>]]
>>>>
>>>>RDF Core had feedback against other implementions on grounds of their
>>>>complexity. This was a tradeoff decision.
>>>
>>>I think it is complexity for the user (somebody writing RDF (RDF/XML
>>>or otherwise) or scraping,...) vs. complexity for implementers of
>>>core software.
>>
>>You seem to be operating under assumptions 
>>about RDF/XML which are inappropriate. We do 
>>not anticipate that the typical RDF user will 
>>ever write or even see RDF/XML. In the RDF 
>>design it plays the role of a machine-oriented 
>>data transmission notation; it will be as 
>>remote from user experience as Javascript code 
>>is from the typical user of a web browser. The 
>>readability of RDF/XML is relevant only to 
>>implementers who are concerned with debugging 
>>their software.
>
>I'm sorry that the definition of 'user' was not completely clear.
>We don't mean the total end user. We mean all the people scraping
>data from the Web,... The example of Javascript is probably about
>right, there are a lot of people writing Javascript, and it would
>be difficult to get most of these people to do the right thing.
>
>>>>[[
>>>>3.  The current approach assumes the existence of constructs to describe
>>>>language and carry language information in the native markup associated
>>>>with a fragment.  Such constructs may not exist, in which case it seems
>>>>impossible to ascribe such information at a meta level.
>>
>>It is certainly not impossible, though it may 
>>indeed be ad-hoc. For example, one could use a 
>>construction like this:
>>
>>_:x  ex:textIs "somepieceofXMLscrapedfromsomewhere"^^rdf:XMLliteral .
>>_:x  ex:language "fr" .
>>
>>where the lang tag is coded as a plain literal.
>
>The point referred to constructs in some markup language.
>
>Your argument is that there are mechanisms outside XML Literals that
>would be able to indicate the language of an XML Literal.
>We agree with this, but note that unless this is the solution
>adopted by the specification, this kind of proposal is ad-hoc
>(as you say) and therefore does not meet our requirement that
>language information be easy to pick up by generic tools.

But this 'requirement' seems unreasonable to me. 
Language information is one kind of information 
which is relevant to one (very special) kind of 
topic, viz. assertions about pieces of NL text. 
This is not a centrally important issue for RDF, 
and will be completely irrelevant to many RDF 
applications. RDF is a generic description 
language - basically a simple logic - which can 
be used in many ways. It is neither feasible nor 
desireable for the RDF spec to impose particular 
representational 'styles' on the user community; 
we we expect that the business of writing 
ontologies in RDF will be done by others, and 
that standards will emerge in techniques and 
styles for particular purposes. The Dublin Core 
ontology, for example, for all its many faults, 
is already emerging as a kind of standard 
vocabulary for many purposes, and OWL can be seen 
as a style of RDF. If you (or others) would like 
to propose a recommended technique for 
representing language information in RDF, you are 
free to do so. But that is your affair, not ours.

Now, if what you had said was true, that it 
really was *impossible* to express language 
information in RDF, then you would have a 
legitimate complaint. But it is not true: there 
are many ways, of which the one I suggested is 
the most obvious, to express this information in 
RDF. In fact, it is *easy* to express it in RDF: 
compared to other expressive tasks, this one is 
relatively trivial.  So your complaint amounts to 
a request that RDF provide a special syntactic 
form just to express this kind of information. 
But the same case can be made for many other 
kinds of information: some people feel that 
temporal information is centrally important, for 
example, and regret that RDF does not have 
special syntax for encoding transaction times; 
other people  feel that provenance information is 
central, and regret that RDF provides no way for 
one ontology to indicate a relationship with 
another (the use of owl:imports is one step in 
that direction).  If we were to accede to all 
these desires and regrets, RDF would have a 
syntax with the complexity of a major programming 
language; this kind of feature-bloat is 
completely at odds with the design goals of RDF.

>It would be similar to say that in XML, there is no need for
>xml:lang because each markup language can define its own
>mechanism (e.g. attribute) to indicate language.

If XML is seen as a generic structure-describing 
language, I think this is actually a good 
argument. Of course if XML is seen instead as a 
generic text-markup notation, then language 
information may indeed be reasonably considered 
to be a  central issue.

>>>>[[
>>>>4. It seems to I18N that it will be difficult to convert rdf created
>>>>using the old syntax to the new syntax. Where legacy documents simply
>>>>declared xml:lang at the top of the file, they will now have to declare
>>>>it for every XML literal.  Also, there is no provision for automatic
>>>>conversion from the old to the new syntax.
>>>>]]
>>>>
>>>>Old style was vague, no indication that xhtml namespace needed declaring
>>>>(predated xhtml?). Not really useable because 
>>>>of such problems, certainly not
>>>>in a portable fashion. The old spec is 
>>>>sufficiently bad to make this problem
>>>>a non-starter since it is not clear what old style xml literal are supposed
>>>>to mean, particularly the treatment of 
>>>>namespaces. Also old spec was somewhat
>>>>unclear how language was supposed to be treated.
>>>
>>>There definitely were some vaguenesses, but we agreed on what these
>>>were in the area of language tagging at the Technical Plenary in Cannes.
>>>The lastcall draft has clarified these.
>>
>>But there seems to be a basic assumption lying 
>>behind your point, that the current spec must 
>>be backwards compatible with M&S. This is 
>>impossible: indeed, our WG charter is to 
>>clarify and rationalize the M&S spec, which was 
>>known to be confused when the WG was founded.
>
>I agree with clarify, this was explained to us at the joint meeting
>in Cannes. I don't remember rationalize. It doesn't sound like a bad
>thing to do, but I don't think this affects the argument.
>
>The basic assumption behind my argument is not backwards compatibility
>(although this is also an issue separately). It is that it is not
>productive in inter-WG discussions to have a joint meeting and agree
>(without a need for long discussion) on what exactly the real unclarities
>are (and how to address them), and to come back much later claiming
>that everything is unclear anyway.

As I was not at the Cannes meeting, I has better not comment further on this.

>>>>[[
>>>>5. I18N considers that it should be possible to conclude that a plain
>>>>literal and an XML literal without markup are the same text.
>>
>>They are clearly the same *text*; the literal 
>>strings will be identical. So, taken literally, 
>>your point here is already satisfied. If you 
>>meant to say that a plain literal, and an XML 
>>literal without markup, could denote the same 
>>value,
>
>Yes, that's what we mean.

OK, although I fail to see what possible reason 
there could be for wanting this to be the case.

>>then this raises a more complex issue 
>>concerning datatyping.  I would suggest that 
>>you raise this matter with the XML Schema 
>>group, since they have devoted considerable 
>>time and effort to considerations of identity 
>>between items in the value spaces of datatypes. 
>>The consensus in XML Schema 1.0 was that all 
>>such value spaces should be considered to be 
>>disjoint. I gather that the story in 1.1 will 
>>be more complex, and involve a rather subtle 
>>distinction between equality and identity.
>
>To the extent that the basic equality/identity for plain literals and
>XML Literals without language information is concerned, we are satisfied
>with the solution and text that you and Jeremy have worked on recently.

?? But that solution makes it impossible for 
plain literals and XML literals to ever denote 
the same entities.

>[I will take note of the issue and try to get back to XML
>Schema in due time.]
>
>>>>Introducing
>>>>language markup as proposed in the current solution makes this
>>>>impossible, since it is never clear whether the markup was in the
>>>>original text or not.
>>
>>That seems to be a weak argument. For example, 
>>the use of a distinctive wrapper (as suggested 
>>by Jeremy at one point) would probably suffice 
>>to keep the distinction clear, in practice.
>
>If the RDF/XML spec requires such a wrapper, then that might satisfy
>us. If the RDF/XML parser adds such a wrapper when parsing, that will
>satisfy us. Just having applications maybe add a wrapper doesn't help.
>
>>But in any case, surely this would apply 
>>equally well to XML literals with language tags?
>
>If language information is handled in the same way for plain literals
>and XML literals, then this satisfies our concerns, because then it
>will be as easy to compare these two if they have language information
>as it will be to compare them if they don't have language information.
>
>>>>[[
>>>>6. I18N has not been convinced that either of the alternative proposals
>>>>for including language information are problematic, and feels they are
>>>>more intuitive and workable than the current proposal because they do
>>>>not entail the problems cited above.
>>>>]]
>>>>I think this is answered by Sandro
>>>>http://lists.w3.org/Archives/Public/www-archive/2003Aug/0004
>>>>[[
>>>>There's a serious concern that people who don't care about XML wont
>>>>bother to implement these bits if they are bolted onto to the side
>>>>like that.  As just another datatype, it fits in smoothly, with no
>>>>particular extra work required.  (except for that language tag...)
>>>>Would you rather many implementations not support XML at all?
>>>>(Perhaps not really a fair question....)
>>>>]]
>>>
>>>Implementing XML Literals right is basically just a combination of
>>>plain literals and datatyped literals. So it's not that difficult
>>>to implement.
>>
>>It is trickier than you might think when your 
>>implementation has to interoperate with other 
>>SW standards, such as OWL. For example, does 
>>the owl:Class containing
>>
>>"nomarkuphere"
>>"nomarkuphere"^^rdf:XMLLiteral
>>
>>have one or two items in it?
>
>My understanding based on the very recently approved changes
>would be that in a standard interpretation, it does have two
>items, but in an interpretation that decides to treat text-only
>XML Literals equivalent with plain literals, it would have one
>item.

Er....that does not make sense. If the answer can 
vary between interpretations then the inference 
is unknown; and it should not be unknown. 
Identity of literals (using known datatypes, and 
rdf:XMLLiteral is always a known datatype) is 
determined by the forms of the literals 
themselves, so does not vary with interpretation. 
With the current version of the semantics, the 
answer will always be 2: that is, there are no 
interpretations in which an XML literal can 
denote the same as a plain literal.

>
>>How about the class containing
>>
>>"nomarkuphere"
>>"nomarkuphere"^^ex:someDatatype
>>
>>? You must provide an answer.
>
>As far as I understand OWL, this would depend on whether the
>ex:someDatatype is supported, and how it is defined.

OK, fair enough.

>>Does the answer change when you later discover that
>>
>>ex:someDatatype owl:sameAs rdf:XMLliteral .
>>
>>? It should not change.
>
>If ex:someDatatype is supported and defined as being the same as
>rdf:XMLLiteral

It is not *defined* that way, but a reasoner 
might be able to infer this, eg by discovering 
that the cardinality of a class containing 
{ex:someDatatype, rdf:XMLLiteral} was 1.

>, then the implementation will do the right thing

Well, will it? That is precisely the question. 
The point is that being 'supported' means 
something like 'having some information about', 
but WHAT information is not known ahead of time 
and may vary as inferences are made.

>  (and
>ex:someDatatype owl:sameAs rdf:XMLliteral .
>will be information it already knows).
>
>If ex:someDatatype is supported and defined as being different from
>rdf:XMLLiteral, then the implementation will do the right thing

Again, that is not obvious. 

>  (and
>ex:someDatatype owl:sameAs rdf:XMLliteral .
>is a wrong statement that will produce a contradiction).
>
>If ex:someDatatype is not supported, then I have absolutely
>no idea how any OWL implementation could be able to predict
>ex:someDatatype owl:sameAs rdf:XMLliteral .
>(or ex:someDatatype owl:differentFrom rdf:XMLliteral .
>assuming owl:differentFrom exists).
>If OWL implementations are able to make such predictions,
>I guess OWL will be a real killer application for predictions
>in all kinds of fields.

It is quite possible for a datatype to be 
'partially supported' in the sense that an 
inference engine knows some information about it 
but not all the information about it. Under these 
circumstances, we expect that any inferences 
which are valid under this partial information 
will remain valid when further information is 
discovered, possibly as a result of other 
inference processes. The ways these processes can 
interact is complex and very hard to predict in 
detail; it is usually only possible to define a 
notion of semantic validity and require that 
implementations preserve it.

>
>
>>These are the kinds of question that are 
>>settled by a formal semantics; so in order to 
>>make sure that a feature is implementable it is 
>>not sufficient to implement one app; you have 
>>to provide a formal semantics for the feature.
>
>If the formal semantics require RDF or OWL to do anything serious
>with unknown datatypes, then it seems to me that they either
>posses very interesting predictive powers, or that something
>is somewhat wrong.

They are required to at least not break, and not 
to make assumptions which might be falsified by 
further information. Taken together this are a 
reasonably serious requirement.

>>BTW, there is a real problem in having 
>>entailments depend on the detailed syntactic 
>>form of literal strings (which they would if 
>>you want XML literals without markup to be 
>>identical to plain literals). The whole point 
>>of datatyping is that each datatype provides 
>>criteria for identity between lexical forms 
>>which can be checked efficiently and locally.
>
>This can be done. The function checking for the identity between
>plain literals and XML Literals without markup is slightly more
>complicated than strcmp(), but still extremely easy. There are only
>two things to add, both on the XML side

They have to be done for all strings, since in one case the inference

aaa ppp "sss" .
|-
aaa ppp "sss"^^rdf:XMLLiteral .

is valid, but in the other it is not. So it is 
not sufficient to use strcmp() even on simple 
literals, or even on xsd:strings, if we were to 
adopt this convention.

>:
>- If a '<' is seen, return not equal (to be equivalent to strcmp,
>   one would need to decide whether markup is smaller than any
>   character, or larger, but this is irrelevant here)
>- If a '&' is seen, parse for &amp;, &lt;, &gt; and friends.
>
>If you need the actual function, I can send it to you.

I am aware of the function, but I am also aware 
of the disastrous effect it would have on
inferential search processes. As the advantages 
of this change seem to me to be largely 
aesthetic, I do not feel that it is a good choice.

>>If the criteria vary depending on the form of 
>>the string, then one has effectively created 
>>multiple datatypes. If we were to say that XML 
>>strings without markup were identical to simple 
>>character strings, but those with markup were a 
>>distinct category, then this amounts to having 
>>two datatypes, rdf:XMLiteralWithMarkup and 
>>rdf:XMLLiteralWithoutMarkup, the latter being a 
>>subclass of xsd:string and the former being 
>>disjoint from it. We could have done this, but 
>>(apart from the obvious artificiality) it seems 
>>to be of little utility, since one could use 
>>simple literals or xsd:string to type the 
>>lexical forms which have no markup and gotten 
>>the very same entailments, in this case.
>
>You are right that there is now a subset relationship between the two
>types. And you are right that it would be possible to define a separate
>name for the part of the superset that is not in the subset (i.e.
>rdf:XMLLiteralWithMarkup). But it's not clear why this would be necessary.
>
>As for just using xsd:string (or plain literals), yes, that's an option,
>but discussions with Jeremy Caroll and Jim Hendler, who both use
>XML Literals for documentation purposes, have revealed that implementers
>often prefer to just write out everything as XML Literals.
>
>>I confess to finding this entire issue of XML 
>>compliance rather boring, but I have always 
>>understood, from the WG discussions on this 
>>issue, that the whole point of XML literal 
>>typing was to distinguish text which came from 
>>an XML 'source' from text which simply happened 
>>to be like XML in some way, but was not itself 
>>genuine XML. Your insistence, that genuine XML 
>>should be indistinguishable from faux XML in 
>>the case that neither contains MXL markup, 
>>seems to be fundamentally at odds with this 
>>basic idea of distinguishing 'real XML' from 
>>mere text.
>
>This may seem so at first sight. But seeing pieces of XML as sequences
>of characters with potentially interspersed markup easily explains things.
>(This is similar to having boxes of apples and boxes of apples and oranges.
>A box with apples and oranges that happens by chance to contain only apples
>doesn't seem to be different from a box with apples.)

I can see that this way of thinking is coherent 
and has a rationale. So does the other way of 
thinking, however. And the other way of thinking 
has several other technical merits: it conforms 
to the basic rationale of 
datatyping-as-classification as embodied in the 
XSD structure, and it allows for a natural 
round-tripping between RDF and other XML 
processors.

BTW, I don't agree that the complexities of XML 
are 'easily explained' by thinking of it as text 
with markup. That view fails, for example, to 
account for the centrally important fact that 
text with markup is in fact text, and can be 
processed by text-handling tools and described 
using text-describing vocabulary with perfect 
accuracy.

Pat

-- 
---------------------------------------------------------------------
IHMC	(850)434 8903 or (650)494 3973   home
40 South Alcaniz St.	(850)202 4416   office
Pensacola			(850)202 4440   fax
FL 32501			(850)291 0667    cell
phayes@ihmc.us       http://www.ihmc.us/users/phayes
Received on Saturday, 23 August 2003 05:38:51 UTC