Re: Fwd "a comment on NFC" from Martin Duerst on 2003-10-02 (w3c-rdfcore-wg@w3.org from October 2003)

From: Martin Duerst <duerst@w3.org>
Date: Thu, 02 Oct 2003 10:16:56 -0400
To: Brian McBride <bwm@hplb.hpl.hp.com>, Jeremy Carroll <jjc@hplb.hpl.hp.com>
Cc: w3c-i18n-ig@w3.org, w3c-rdfcore-wg@w3.org
Message-Id: <4.2.0.58.J.20031002094312.04a2fb30@localhost>
At 11:00 03/10/02 +0100, Brian McBride wrote:

>Well, the overnight developments I had hoped for aren't going to happen.

If that's referring to a response from our side, sorry.
I was busy working on a paper yesterday.


>First a context setting ramble, than two concrete suggestions.
>
>I believe that it is Peter's intention to formally object to the 
>current  RDF handling of normal form C.
>
>I feel I don't really understand the issue very well, but I'll try to 
>summarize my understanding, such as it is.  Please correct my misunderstanings.
>
>RDFCore is following CharMod and I18N advice in requiring literals to 
>be  in normal form C.  XML 1.0 and XSD datatypes do not require this. Thus 
>there are legal fragments of XML 1.0 that are not in normal form C, legal 
>xsd:string's that are not in normal form C and legal xsd:anyURI's that are 
>not in normal form C and these cannot be used in an RDF graph.

It is my understanding that there are other small discrepancies,
back and forth, between XML 1.0 and RDF, but I don't know the
details. Looking at these details may help putting this issue
into perspective.


>I think that the issue arises rarely in practice, e.g. when a string or 
>xml fragment contains a combining character with nothing to combine with.

I agree with 'rarely in practice'. NFC was designed to align with current
practice where possible.

The example you give is not exactly typical, here are some:
- A string contains a base character and a combining character, and there
   is a precomposed character for this combination
- A string contains a character (e.g. Angstrom) that is a canonical
   equivalent (i.e. exact copy) of another (A-ring).
- A string starts with a combining character with nothing to combine with


>One issue of concern to Peter is that the current specs prohibit us saying 
>in say Owl that some string (which is not in normal form C) is not in 
>normal form C.  I think this is wrong, in that it is possible to invent a 
>datatype whose lexical space consists of strings in normal form C, but 
>whose value space is not, that would allow the representation of all 
>strings.  The same could be done for XML fragments, though would then 
>loose the benefit of the parseType="Literal" convenience syntax.
>Thus whilst the RDF specs would not be providing a standard way of 
>representing non-NFC strings, it would not be preventing their expression.

I'm a bit confused here, but I'll try to use my own words.

RDF would always be able to represent non-NFC strings, e.g. by
defining them as a collection/sequence of integers represented
by a graph. There is in my understanding nothing one can or should
do or be able to do to prevent that if somebody really wants to
do that.

What you propose above seems to be somewhat different, i.e.
normalized strings would represent unnormalized strings.
But this would run into all kinds of problems, because there
are potentially a large number of unnormalized strings for
a given normalized string, and it would be difficult to indicate
which unnormalized string is denoted.
(If the relationship between normalized and unnormalized
strings were simply 1-to-1, we would have a much simpler
life in the first place.)
So I would like to see something like a 'proof by construction'
for the idea of "it is possible to invent a datatype whose lexical
space consists of strings in normal form C, but whose value
space is not".



>That said, it does seem odd to me that we are precluding RDF from 
>representing some legal fragments of XML 1.0 as XML Literals.  Please 
>interpret "odd" as massive English understatement.
>
>This situation has arisen because we have been striving to be good 
>citizens, especially with respect to internationalization and have adopted 
>good practice earlier than some other specs.  This does not play well when 
>we embed fragments of language conforming to those other specs in our 
>language.  This is a situation when one has to consider the wisdom of 
>trying to be "ahead of the pack".

I think for RDF specifically, doing matching to build up the graph,
and having this very clearly defined, was one of the basic requirements,
and reasons for 'early' adoption.
 From a user point of view, different normalizations should match.
But from a machine point of view, this can be a lot of work.
Requiring clean data to start with is what we have proposed, and
you have adopted. A possible alternative would be to not strictly
require clean data, but to clearly blame any responsibility for
matching problems on the side providing the dirty data.


Regards,    Martin.



>I am tempted by an idea I will attribute to pfps, though I'm not sure he 
>is advocating it, that we should report these difficulties we have 
>encountered trying to deploy charmod to I18N and seek their advice on 
>managing the transition, specifically given that we embed fragments of 
>non-conforming languages in ours.
>
>I also wonder whether this issue might be addressed by toning down the 
>language from MUST to SHOULD e.g.
>
>[...]
>
>>which includes the additional following para:
>>[[
>>The string in both plain and typed literals is required to
>>be in Unicode Normal Form C [NFC]. This requirement is motivated
>>by [Charmod] particularly section 4 Early Uniform Normalization.
>>]]
>
>becomes something like
>
>[[
>The string in both plain and typed literals SHOULD be in Unicode Normal 
>Form C [NFC].  This is motivated by anticipation that [Charmod], 
>particularly section 4 Early Uniform Normalization will become 
>standardized practice.  Implementations SHOULD accept strings which are 
>not in Normal Form C and MAY issue a warning in such circumstances.
>]]
>
>Brian
>
Received on Thursday, 2 October 2003 10:17:32 UTC