Re: Fwd "a comment on NFC"

>At 11:00 03/10/02 +0100, Brian McBride wrote:



>>One issue of concern to Peter is that the current specs prohibit us 
>>saying in say Owl that some string (which is not in normal form C) 
>>is not in normal form C.  I think this is wrong, in that it is 
>>possible to invent a datatype whose lexical space consists of 
>>strings in normal form C, but whose value space is not, that would 
>>allow the representation of all strings.  The same could be done 
>>for XML fragments, though would then loose the benefit of the 
>>parseType="Literal" convenience syntax.
>>Thus whilst the RDF specs would not be providing a standard way of 
>>representing non-NFC strings, it would not be preventing their 
>>expression.
>
>I'm a bit confused here, but I'll try to use my own words.
>
>RDF would always be able to represent non-NFC strings, e.g. by
>defining them as a collection/sequence of integers represented
>by a graph.

RDF and OWL are not programming languages which can invent their own 
encodings. They are descriptive languages which describe things 
denoted by the expressions they use, according to semantic rules. So 
the point here is that if RDF syntax requires that non-normal-form-C 
(NC) strings are syntactically illegal, and if (as is the case) that 
strings are denoted by themselves, then there is no RDF expression 
which is capable of *denoting* an NC string, and hence no such legal 
OWL expression (since OWL, in the interests of compatibility, uses 
RDF encoding for strings); unless the denotation is provided by some 
mechanism external to RDF/OWL, as Brian suggested.

>There is in my understanding nothing one can or should
>do or be able to do to prevent that if somebody really wants to
>do that.

It is not a matter of prevention: there is no way it could be done in 
OWL. OWL is not expressive enough to *describe* alternative encodings 
for strings, for example.

>What you propose above seems to be somewhat different, i.e.
>normalized strings would represent unnormalized strings.
>But this would run into all kinds of problems, because there
>are potentially a large number of unnormalized strings for
>a given normalized string, and it would be difficult to indicate
>which unnormalized string is denoted.

It could be done for example by using the Unicode code point for 
every character encoded as a normalized string using the %UUUU style. 
I think this is not really an adequate response to Peter's point, 
however.

>(If the relationship between normalized and unnormalized
>strings were simply 1-to-1, we would have a much simpler
>life in the first place.)
>So I would like to see something like a 'proof by construction'
>for the idea of "it is possible to invent a datatype whose lexical
>space consists of strings in normal form C, but whose value
>space is not".
>
>
>>That said, it does seem odd to me that we are precluding RDF from 
>>representing some legal fragments of XML 1.0 as XML Literals. 
>>Please interpret "odd" as massive English understatement.
>>
>>This situation has arisen because we have been striving to be good 
>>citizens, especially with respect to internationalization and have 
>>adopted good practice earlier than some other specs.  This does not 
>>play well when we embed fragments of language conforming to those 
>>other specs in our language.  This is a situation when one has to 
>>consider the wisdom of trying to be "ahead of the pack".
>
>I think for RDF specifically, doing matching to build up the graph,
>and having this very clearly defined, was one of the basic requirements,
>and reasons for 'early' adoption.
>From a user point of view, different normalizations should match.
>But from a machine point of view, this can be a lot of work.
>Requiring clean data to start with is what we have proposed, and
>you have adopted. A possible alternative would be to not strictly
>require clean data, but to clearly blame any responsibility for
>matching problems on the side providing the dirty data.

I agree that seems like the best kind of compromise.

Pat

>
>
>Regards,    Martin.
>
>
>>I am tempted by an idea I will attribute to pfps, though I'm not 
>>sure he is advocating it, that we should report these difficulties 
>>we have encountered trying to deploy charmod to I18N and seek their 
>>advice on managing the transition, specifically given that we embed 
>>fragments of non-conforming languages in ours.
>>
>>I also wonder whether this issue might be addressed by toning down 
>>the language from MUST to SHOULD e.g.
>>
>>[...]
>>
>>>which includes the additional following para:
>>>[[
>>>The string in both plain and typed literals is required to
>>>be in Unicode Normal Form C [NFC]. This requirement is motivated
>>>by [Charmod] particularly section 4 Early Uniform Normalization.
>>>]]
>>
>>becomes something like
>>
>>[[
>>The string in both plain and typed literals SHOULD be in Unicode 
>>Normal Form C [NFC].  This is motivated by anticipation that 
>>[Charmod], particularly section 4 Early Uniform Normalization will 
>>become standardized practice.  Implementations SHOULD accept 
>>strings which are not in Normal Form C and MAY issue a warning in 
>>such circumstances.
>>]]
>>
>>Brian


-- 
---------------------------------------------------------------------
IHMC	(850)434 8903 or (650)494 3973   home
40 South Alcaniz St.	(850)202 4416   office
Pensacola			(850)202 4440   fax
FL 32501			(850)291 0667    cell
phayes@ihmc.us       http://www.ihmc.us/users/phayes

Received on Thursday, 2 October 2003 12:41:04 UTC