Re: Fwd "a comment on NFC" from Jeremy Carroll on 2003-10-02 (w3c-rdfcore-wg@w3.org from October 2003)

From: Jeremy Carroll <jjc@hplb.hpl.hp.com>
Date: Thu, 02 Oct 2003 16:51:27 +0100
To: Brian McBride <bwm@hplb.hpl.hp.com>
Cc: Martin Duerst <duerst@w3.org>, "Peter F. Patel-Schneider" <pfps@research.bell-labs.com>, w3c-i18n-ig@w3.org, w3c-rdfcore-wg@w3.org
Message-ID: <3F7C497F.6090303@hplb.hpl.hp.com>
I will try and propose what it would take, in about 18 hours time, (i.e. 
tomorrow morning in Europe).

It seems that Martin wanted a bit more than Brian's text - discussion of 
matching problem etc. I suspect a MUST NOT normalize might help too.

Jeremy



Brian McBride wrote:

> 
> 
> Martin Duerst wrote:
> 
>>
>> At 11:00 03/10/02 +0100, Brian McBride wrote:
>>
>>> Well, the overnight developments I had hoped for aren't going to happen.
>>
>>
>>
>> If that's referring to a response from our side, sorry.
> 
> 
> It wasn't.  I'm sorry that wasn't clear Martin.  I was not casting any 
> aspertions about timeliness in your direction, and in fact, I'm grateful 
> for this quick response.
> 
> [...]
> 
>>
>>
>> I agree with 'rarely in practice'. NFC was designed to align with current
>> practice where possible.
>>
>> The example you give is not exactly typical, here are some:
>> - A string contains a base character and a combining character, and there
>>   is a precomposed character for this combination
>> - A string contains a character (e.g. Angstrom) that is a canonical
>>   equivalent (i.e. exact copy) of another (A-ring).
>> - A string starts with a combining character with nothing to combine with
> 
> 
> Thanks Martin, thats helpful.
> 
>>
>>
>>> One issue of concern to Peter is that the current specs prohibit us 
>>> saying in say Owl that some string (which is not in normal form C) is 
>>> not in normal form C.  I think this is wrong, in that it is possible 
>>> to invent a datatype whose lexical space consists of strings in 
>>> normal form C, but whose value space is not, that would allow the 
>>> representation of all strings.  The same could be done for XML 
>>> fragments, though would then loose the benefit of the 
>>> parseType="Literal" convenience syntax.
>>> Thus whilst the RDF specs would not be providing a standard way of 
>>> representing non-NFC strings, it would not be preventing their 
>>> expression.
>>
>>
>>
>> I'm a bit confused here, but I'll try to use my own words.
>>
>> RDF would always be able to represent non-NFC strings, e.g. by
>> defining them as a collection/sequence of integers represented
>> by a graph. There is in my understanding nothing one can or should
>> do or be able to do to prevent that if somebody really wants to
>> do that.
> 
> 
> Right.  I think that's the essential point - there are other ways of 
> representing non-nfc strings if you really want to.
> 
>>
>> What you propose above seems to be somewhat different, i.e.
>> normalized strings would represent unnormalized strings.
>> But this would run into all kinds of problems, 
> 
> 
> Then I won't pursue it - I merely meant it as an example of one 
> approach.  If that doesn't work no matter, as the one you have suggested 
> does.
> 
> [...]
> 
>>
>>> That said, it does seem odd to me that we are precluding RDF from 
>>> representing some legal fragments of XML 1.0 as XML Literals.  Please 
>>> interpret "odd" as massive English understatement.
>>>
>>> This situation has arisen because we have been striving to be good 
>>> citizens, especially with respect to internationalization and have 
>>> adopted good practice earlier than some other specs.  This does not 
>>> play well when we embed fragments of language conforming to those 
>>> other specs in our language.  This is a situation when one has to 
>>> consider the wisdom of trying to be "ahead of the pack".
>>
>>
>>
>> I think for RDF specifically, doing matching to build up the graph,
>> and having this very clearly defined, was one of the basic requirements,
>> and reasons for 'early' adoption.
>>  From a user point of view, different normalizations should match.
>> But from a machine point of view, this can be a lot of work.
>> Requiring clean data to start with is what we have proposed, and
>> you have adopted.
> 
> 
> I see.
> 
>  A possible alternative would be to not strictly
> 
>> require clean data, but to clearly blame any responsibility for
>> matching problems on the side providing the dirty data.
> 
> 
> That looks like a possible compromise - language of the form "SHOULD be 
> in NFC" rather than "MUST be in NFC, as I suggested later in my email:
> 
> [...]
> 
>>>
>>> I also wonder whether this issue might be addressed by toning down 
>>> the language from MUST to SHOULD e.g.
>>>
>>> [...]
>>>
>>>> which includes the additional following para:
>>>> [[
>>>> The string in both plain and typed literals is required to
>>>> be in Unicode Normal Form C [NFC]. This requirement is motivated
>>>> by [Charmod] particularly section 4 Early Uniform Normalization.
>>>> ]]
>>>
>>>
>>>
>>> becomes something like
>>>
>>> [[
>>> The string in both plain and typed literals SHOULD be in Unicode 
>>> Normal Form C [NFC].  This is motivated by anticipation that 
>>> [Charmod], particularly section 4 Early Uniform Normalization will 
>>> become standardized practice.  Implementations SHOULD accept strings 
>>> which are not in Normal Form C and MAY issue a warning in such 
>>> circumstances.
>>> ]]
>>
> 
> I think I heard you say that you think such an approach would be 
> acceptable to I18N.  Right?
> 
> Peter, would it work for you?
> 
> Brian
> 
>
Received on Thursday, 2 October 2003 12:06:23 UTC