Re: Fwd "a comment on NFC"

Martin Duerst wrote:
> 
> At 11:00 03/10/02 +0100, Brian McBride wrote:
> 
>> Well, the overnight developments I had hoped for aren't going to happen.
> 
> 
> If that's referring to a response from our side, sorry.

It wasn't.  I'm sorry that wasn't clear Martin.  I was not casting any 
aspertions about timeliness in your direction, and in fact, I'm grateful 
for this quick response.

[...]

> 
> 
> I agree with 'rarely in practice'. NFC was designed to align with current
> practice where possible.
> 
> The example you give is not exactly typical, here are some:
> - A string contains a base character and a combining character, and there
>   is a precomposed character for this combination
> - A string contains a character (e.g. Angstrom) that is a canonical
>   equivalent (i.e. exact copy) of another (A-ring).
> - A string starts with a combining character with nothing to combine with

Thanks Martin, thats helpful.
> 
> 
>> One issue of concern to Peter is that the current specs prohibit us 
>> saying in say Owl that some string (which is not in normal form C) is 
>> not in normal form C.  I think this is wrong, in that it is possible 
>> to invent a datatype whose lexical space consists of strings in normal 
>> form C, but whose value space is not, that would allow the 
>> representation of all strings.  The same could be done for XML 
>> fragments, though would then loose the benefit of the 
>> parseType="Literal" convenience syntax.
>> Thus whilst the RDF specs would not be providing a standard way of 
>> representing non-NFC strings, it would not be preventing their 
>> expression.
> 
> 
> I'm a bit confused here, but I'll try to use my own words.
> 
> RDF would always be able to represent non-NFC strings, e.g. by
> defining them as a collection/sequence of integers represented
> by a graph. There is in my understanding nothing one can or should
> do or be able to do to prevent that if somebody really wants to
> do that.

Right.  I think that's the essential point - there are other ways of 
representing non-nfc strings if you really want to.

> 
> What you propose above seems to be somewhat different, i.e.
> normalized strings would represent unnormalized strings.
> But this would run into all kinds of problems, 

Then I won't pursue it - I merely meant it as an example of one 
approach.  If that doesn't work no matter, as the one you have suggested 
does.

[...]

> 
>> That said, it does seem odd to me that we are precluding RDF from 
>> representing some legal fragments of XML 1.0 as XML Literals.  Please 
>> interpret "odd" as massive English understatement.
>>
>> This situation has arisen because we have been striving to be good 
>> citizens, especially with respect to internationalization and have 
>> adopted good practice earlier than some other specs.  This does not 
>> play well when we embed fragments of language conforming to those 
>> other specs in our language.  This is a situation when one has to 
>> consider the wisdom of trying to be "ahead of the pack".
> 
> 
> I think for RDF specifically, doing matching to build up the graph,
> and having this very clearly defined, was one of the basic requirements,
> and reasons for 'early' adoption.
>  From a user point of view, different normalizations should match.
> But from a machine point of view, this can be a lot of work.
> Requiring clean data to start with is what we have proposed, and
> you have adopted.

I see.

  A possible alternative would be to not strictly
> require clean data, but to clearly blame any responsibility for
> matching problems on the side providing the dirty data.

That looks like a possible compromise - language of the form "SHOULD be 
in NFC" rather than "MUST be in NFC, as I suggested later in my email:

[...]

>>
>> I also wonder whether this issue might be addressed by toning down the 
>> language from MUST to SHOULD e.g.
>>
>> [...]
>>
>>> which includes the additional following para:
>>> [[
>>> The string in both plain and typed literals is required to
>>> be in Unicode Normal Form C [NFC]. This requirement is motivated
>>> by [Charmod] particularly section 4 Early Uniform Normalization.
>>> ]]
>>
>>
>> becomes something like
>>
>> [[
>> The string in both plain and typed literals SHOULD be in Unicode 
>> Normal Form C [NFC].  This is motivated by anticipation that 
>> [Charmod], particularly section 4 Early Uniform Normalization will 
>> become standardized practice.  Implementations SHOULD accept strings 
>> which are not in Normal Form C and MAY issue a warning in such 
>> circumstances.
>> ]]

I think I heard you say that you think such an approach would be 
acceptable to I18N.  Right?

Peter, would it work for you?

Brian

Received on Thursday, 2 October 2003 11:06:35 UTC