- From: Jeremy Carroll <jjc@hplb.hpl.hp.com>
- Date: Wed, 26 Sep 2001 18:46:15 +0100
- To: w3c-rdfcore-wg@w3.org
Dan Connolly wrote: > > Er... whoa! I'm having trouble following this thread... > I doubt I'll be prepared to decide on this by Friday. I had misunderstood the earlier silence. > > I need direct implementation experience with whatever > proposal we come up with in order to be comfortable. > (test cases are really important for that, btw.) I think we need to agree something, and then review it in light of implementation experience. Without some initial agreement it is very hard to know what we should be getting experience of. URI or lang tag =============== I had a little trouble understanding Dan's comments about the (string, URI) pair. What the text was trying to say is that we were still considering that possibility and for now we agree that at least we have a (string, lang-tag) pair, and that since lang-tags can be given URI's like in Graham's rfc draft, we see that work we do now based on (string, lang-tag) is sufficiently likely to be extensible to (string, URI) pair. I personally didn't feel able to do the (string,URI) pair idea justice, and would like someone else to have a stab at it. For now, I am suggesting we agree on a (string, lang-tag) baseline without ruling out, the (string,URI) extension. Perhaps paras [2] and [3] need rewording to emphasize that possibility. If you feel that would improve this aspect of the document sufficiently, I can have a go. Datatypes ========= > > [3] > > NOTE: The RDF Core Working Group has yet to consider whether > > such an approach would be useful for integrating XML schema > > datatyping with RDF. > > I'm not comfortable with that either (as I've said). We do need to consider how XML Schema datatypes fits, and the literals document deliberately does not. This is simply to partition the issues. Once we have a proposal for XML Schema datatypes and RDF we will need to review the Literal representation. On the other hand, I *am* trying to rule out ideas like the Literal is really a complex datatype and not a string. Jan, if I have understood correctly, would like a Literal to be almost an arbitrary object, in an OO sort of way (sorry if I misrepresent that position). I think it has been considered whether an xml literal value should be represented in RDF by its infoset (not a string). I have tried to argue the "string" corner here. We can represent a complex object like an xml nodeset as a string, we just need to agree on canonicalization rules. I hope, naively, that this same approach can be extended to other datatypes, so that from an RDF point of view all Literals are just strings, with one other piece of information (e.g. the URI) that allows us to interpret them appropriately in an application setting. rdf:parseType="Literal" etc. ======================= DanC: > > For the rest, I'd have to have my implementation-source-code > open in the other window to review it carefully. No > time for that just now. (I'm in another telcon :-( ). It would be very useful, the intent with rdf:parseType="Literal" is that, modulo bugs, any reasonable interpretation of M&S should remain legal. It was quite hard to say that and constrain the RDF processor at all! W3C & NFC Normalization (last item - message can be abandoned when you get bored with this bit!) ======================= > > > [1a] > > The Unicode String in an RDF Literal is normalized according > > to Unicode Normalization Form C [NFC, NFC-Corrigendum], using > > a framework of early uniform normalization. > > Yikes! I haven't seen enough motivation to change the RDF 1.0 > spec in that way. (maybe the motivation is there and I > just haven't read it.) I will start with two quotes from charmod about early normalization: "Current receiving components (browsers, XML parsers, etc.) implicitly assume early normalization by not performing normalization themselves. This is a vast legacy." and "Not all components of the Web that implement functions such as string matching can reasonably be expected to do normalization. This, in particular, applies to very small components and components in the lower layers of the architecture." I think RDF is one such component (lower layer) and the stuff I wrote on normalization has the intent of being that RDF implementators do not do any normalization, since it is all done elsewhere. I am not deeply wedded to para [1a], it is however a consequence of the rest of the document. I'll replay the argument. The short version is: + Under charmod document producers must produce W3C-Normalized documents, i.e. in NFC when transcoded and after charcater reference expansion. + More simply, the documents MUST be W3C-Normalized (this is implicitly a requirement on the producer). + A W3C-Normalized document is, when in Unicode, in NFC. + An NFC RDF document will necessrily only contain NFC literals. Here's the long version: In the old M&S there is a reference to the difficulty of equality for unicode strings. vis: (P217) Two RDF strings are deemed to be the same if their ISO/IEC 10646 representations match. Each RDF application must specify which one of the following definitions of 'match' it uses: (P218) the two representations are identical, or (P219) the two representations are canonically equivalent as defined by The Unicode Standard [Unicode]. Note: The W3C I18N WG is working on a definition for string identity matching. This definition will most probably be based on canonical equivalences according to the Unicode standard and on the principle of early uniform normalization. Users of RDF should not rely on any applications matching using the canonical equivalents, but should try to make sure that their data is in the normalized form according to the upcoming definitions. I feel that my text is an attempt at clarifying this. In particular (P219) above is deleted, and the Note upgraded to being normative. I admit that putting Normal Form C right at the beginning makes it look like a burden on implementators when in fact it is less burdensome than M&S. 2: charmod contains the definition of equality referred to, this definition is based on a (proposed) basic underlying principle of a world-wide web architecture: "Early Uniform Normalization". Quick intro to normalization ---------------------------- The example given in charmod about unicode normalization concerns precomposed or decomposed characters, vis: http://www.w3.org/International/Group/charmod-edit#sec-normalization-examples (charmod section 4.2) > The string "suçon", expressed as the sequence of five characters > U+0073 U+0075 U+00E7 U+006F U+006E and encoded in a Unicode encoding > form, is both Unicode-normalized and W3C-normalized. > The string "suçon", expressed as the sequence of six characters U+0073 > U+0075 U+0063 U+0327 U+006F U+006E (U+0327 is the COMBINING CEDILLA) > and encoded in a Unicode encoding form, is [not] Unicode-normalized > (since the combining sequence U+0063 U+0327 is replaceable by the > precomposed U+00E7) W3C normalization is expressed as concerning the role of charcter references etc in the document. For RDF literal values I believe this concept needs extension since "suc<!-- A comment separating two composable characters -->̧on" is W3C-normalized according to the current draft, but comment stripping makes it unnormalized. Early uniform normalization --------------------------- http://www.w3.org/International/Group/charmod-edit#sec-Normalization The key idea is that as soon as an text is converted into unicode (any UCS-based encoding) then it should be normalized, and that any application that does not *create* new unicode text does not need to do much more than treat unicode string as uninterpreted binary. The motivations for this *early* normalization are given in charmod (section 4.1) Normalization responsibilities ------------------------------ Charmod section 4.3. In trying to understand the relevance of this section to RDF we have to equate terms like producer, recipient and proxy with the elements of RDF processing. I treat an RDF/XML processor (in my view the parser) as a recipient, and my [11] RDF/XML processors MUST NOT normalize input from an XML document that is encoded in a UCS-based encoding. c.f. [CHARMOD] for rationale. follows from "The recipients of text data [...] MUST NOT normalize it." in charmod. The requirement to use a normalizing transcoder, my para [10] and [14] [10] When parsing RDF/XML the XML processor, if necessary, converts the XML document to the UCS character domain. When doing this from any encoding that is not UCS-based this conversion SHOULD use Unicode Normalization Form C [NFC, NFC-Corrigendum]. [14] Summary of text normalization for RDF/XML processors. RDF/XML processors MUST use a normalizing transcoder from non-UCS-based encodings. RDF/XML processors MUST NOT do any other text normalization. (cf. http://www.w3.org/TR/charmod/#def-normalizing-transcoder ) follows from charmod: "Recipients which transcode text data from a legacy encoding to a Unicode encoding form MUST use a normalizing-transcoder." My paras [9] and [12] [9] RDF/XML documents MUST be W3C-normalized. An early uniform normalization framework is used. See [CHARMOD] for definition. [12] RDF/XML documents SHOULD be W3C-normalized as specified in [CHARMOD]. Moreover, after the stripping of comments and processing instructions an RDF/XML document SHOULD still be W3C-normalized. It is the responsibility of the document creator to fulfil this requirement. RDF/XML processors MUST NOT correct input that is not W3C-normalized. (RDF/XML processors may detect lack of W3C-normalization in an input document, and issue a diagnostic). follow from charmod's "Producers MUST produce text data in normalized form." Having got to [9] that RDF/XML documents must be W3C-normalized (which was Graham step), I noted that then the Literals in them are also Unicode Normalized form C, and it seemed sensible to propogate this up to the top level. It does leave the problem of then specifying that rdf/xml processor must not normalize unnormalized input, and prohibiting unnormlized input, which is ugly. Jeremy
Received on Wednesday, 26 September 2001 13:42:09 UTC