Re: Fwd "a comment on NFC"

From: Brian McBride <bwm@hplb.hpl.hp.com>
Subject: Re: Fwd "a comment on NFC"
Date: Thu, 02 Oct 2003 15:43:13 +0100

> Martin Duerst wrote:
> > 
> > At 11:00 03/10/02 +0100, Brian McBride wrote:

[...]

> >> One issue of concern to Peter is that the current specs prohibit us 
> >> saying in say Owl that some string (which is not in normal form C) is 
> >> not in normal form C.  I think this is wrong, in that it is possible 
> >> to invent a datatype whose lexical space consists of strings in normal 
> >> form C, but whose value space is not, that would allow the 
> >> representation of all strings.  The same could be done for XML 
> >> fragments, though would then loose the benefit of the 
> >> parseType="Literal" convenience syntax.
> >> Thus whilst the RDF specs would not be providing a standard way of 
> >> representing non-NFC strings, it would not be preventing their 
> >> expression.
> > 
> > 
> > I'm a bit confused here, but I'll try to use my own words.
> > 
> > RDF would always be able to represent non-NFC strings, e.g. by
> > defining them as a collection/sequence of integers represented
> > by a graph. There is in my understanding nothing one can or should
> > do or be able to do to prevent that if somebody really wants to
> > do that.
> 
> Right.  I think that's the essential point - there are other ways of 
> representing non-nfc strings if you really want to.

Hmm.  I am not aware of any other way of representing strings in RDF
besides untyped literals and typed literals with datatypes related to
xsd:string.  None of these methods provides any way of representing non-NFC
strings.  

In any case, this seems to be a rather silly way of representing non-nfc
strings.  Why should anyone who wants to represent a non-NFC Unicode string
be forbidden to do so?  Yes, in many circumstances this is a bad thing, but
RDF is not about forbidding people from doing such bad things.  

[...]

>   A possible alternative would be to not strictly
> > require clean data, but to clearly blame any responsibility for
> > matching problems on the side providing the dirty data.
> 
> That looks like a possible compromise - language of the form "SHOULD be 
> in NFC" rather than "MUST be in NFC, as I suggested later in my email:
> 
> [...]
> 
> >>
> >> I also wonder whether this issue might be addressed by toning down the 
> >> language from MUST to SHOULD e.g.
> >>
> >> [...]
> >>
> >>> which includes the additional following para:
> >>> [[
> >>> The string in both plain and typed literals is required to
> >>> be in Unicode Normal Form C [NFC]. This requirement is motivated
> >>> by [Charmod] particularly section 4 Early Uniform Normalization.
> >>> ]]
> >>
> >>
> >> becomes something like
> >>
> >> [[
> >> The string in both plain and typed literals SHOULD be in Unicode 
> >> Normal Form C [NFC].  This is motivated by anticipation that 
> >> [Charmod], particularly section 4 Early Uniform Normalization will 
> >> become standardized practice.  Implementations SHOULD accept strings 
> >> which are not in Normal Form C and MAY issue a warning in such 
> >> circumstances.
> >> ]]
> 
> I think I heard you say that you think such an approach would be 
> acceptable to I18N.  Right?
> 
> Peter, would it work for you?

I see no problems with this approach.   

> Brian

peter

Received on Thursday, 2 October 2003 17:52:31 UTC