Re: 2001-09-07#5 Literals

Dan Connolly wrote:
> 
> Er... whoa! I'm having trouble following this thread...
> I doubt I'll be prepared to decide on this by Friday.

I had misunderstood the earlier silence.

> 
> I need direct implementation experience with whatever
> proposal we come up with in order to be comfortable.
> (test cases are really important for that, btw.)

I think we need to agree something, and then review it in light of
implementation experience.

Without some initial agreement it is very hard to know what we should be
getting experience of.


URI or lang tag
===============

I had a little trouble understanding Dan's comments about the (string,
URI) pair.

What the text was trying to say is that we were still considering that
possibility and for now we agree that at least we have a (string,
lang-tag) pair, and that since lang-tags can be given URI's like in
Graham's rfc draft, we see that work we do now based on (string,
lang-tag) is sufficiently likely to be extensible to (string, URI) pair.

I personally didn't feel able to do the (string,URI) pair idea justice,
and would like someone else to have a stab at it. For now, I am
suggesting we agree on a (string, lang-tag) baseline without ruling out,
the (string,URI) extension.

Perhaps paras [2] and [3] need rewording to emphasize that possibility.
If you feel that would improve this aspect of the document sufficiently,
I can have a go.

Datatypes
=========
> > [3]
> >    NOTE: The RDF Core Working Group has yet to consider whether
> >    such an approach would be useful for integrating XML schema
> >    datatyping with RDF.
> 
> I'm not comfortable with that either (as I've said).

We do need to consider how XML Schema datatypes fits, and  the literals
document deliberately does not.
This is simply to partition the issues. Once we have a proposal for XML
Schema datatypes and RDF we will need to review the Literal
representation.

On the other hand, I *am* trying to rule out ideas like the Literal is
really a complex datatype and not a string. Jan, if I have understood
correctly, would like a Literal to be almost an arbitrary object, in an
OO sort of way (sorry if I misrepresent that position). I think it has
been considered whether an xml literal value should be represented in
RDF by its infoset (not a string).

I have tried to argue the "string" corner here. We can represent a
complex object like an xml nodeset as a string, we just need to agree on
canonicalization rules.

I hope, naively, that this same approach can be extended to other
datatypes, so that from an RDF point of view all Literals are just
strings, with one other piece of information (e.g. the URI) that allows
us to interpret them appropriately in an application setting.


rdf:parseType="Literal" etc.
=======================
DanC:
> 
> For the rest, I'd have to have my implementation-source-code
> open in the other window to review it carefully. No
> time for that just now. (I'm in another telcon :-( ).

It would be very useful, the intent with rdf:parseType="Literal" is
that, modulo bugs, any reasonable interpretation of M&S should remain
legal. It was quite hard to say that and constrain the RDF processor at
all!


W3C & NFC Normalization  (last item - message can be abandoned when you
get bored with this bit!)
=======================

> 
> > [1a]
> > The Unicode String in an RDF Literal is normalized according
> > to Unicode Normalization Form C [NFC, NFC-Corrigendum], using
> > a framework of early uniform normalization.
> 
> Yikes! I haven't seen enough motivation to change the RDF 1.0
> spec in that way. (maybe the motivation is there and I
> just haven't read it.)

I will start with two quotes from charmod about early normalization:

"Current receiving components (browsers, XML parsers, etc.) implicitly
assume early normalization by not performing normalization themselves.
This is a vast legacy."

and

"Not all components of the Web that implement functions such as string
matching can reasonably be expected to do normalization. This, in
particular, applies to very small components and components in the lower
layers of the architecture."

I think RDF is one such component (lower layer) and the stuff I wrote on
normalization has the intent of being that RDF implementators do not do
any normalization, since it is all done elsewhere.

I am not deeply wedded to para [1a], it is however a consequence of the
rest of the document.
I'll replay the argument.

The short version is:
+ Under charmod document producers  must produce W3C-Normalized
documents, i.e. in NFC when transcoded and after charcater reference
expansion.
+ More simply, the documents MUST be W3C-Normalized (this is implicitly
a requirement on the producer).
+ A W3C-Normalized document is, when in Unicode, in NFC.
+ An NFC RDF document will necessrily only contain NFC literals.

Here's the long version:

In the old M&S there is a reference to the difficulty of equality for
unicode strings. vis:

   (P217) Two RDF strings are deemed to be the same if their ISO/IEC 
   10646 representations match. Each RDF application must specify 
   which one of the following definitions of 'match' it uses:

    (P218) the two representations are identical, or 
    (P219) the two representations are canonically equivalent 
           as defined by The Unicode Standard [Unicode]. 

    Note: The W3C I18N WG is working on a definition for string identity 
    matching. This definition will most probably be based on canonical 
    equivalences according to the Unicode standard and on the principle 
    of early uniform normalization. Users of RDF should not rely on any 
    applications matching using the canonical equivalents, but should
try 
    to make sure that their data is in the normalized form according to 
    the upcoming definitions. 

I feel that my text is an attempt at clarifying this. In particular
(P219) above is deleted, and the Note upgraded to being normative. I
admit that putting Normal Form C right at the beginning makes it look
like a burden on implementators when in fact it is less burdensome than
M&S.

2: charmod contains the definition of equality referred to, this
definition is based on a (proposed) basic underlying principle of a
world-wide web architecture: "Early Uniform Normalization".

Quick intro to normalization
----------------------------
The example given in charmod about unicode normalization concerns
precomposed or decomposed characters, vis: 


http://www.w3.org/International/Group/charmod-edit#sec-normalization-examples

(charmod section 4.2)
> The string "suçon", expressed as the sequence of five characters 
> U+0073 U+0075 U+00E7 U+006F U+006E and encoded in a Unicode encoding 
> form, is both Unicode-normalized and W3C-normalized. 

> The string "suçon", expressed as the sequence of six characters U+0073 
> U+0075 U+0063 U+0327 U+006F U+006E (U+0327 is the COMBINING CEDILLA) 
> and encoded in a Unicode encoding form, is [not] Unicode-normalized 
> (since the combining sequence U+0063 U+0327 is replaceable by the 
> precomposed U+00E7) 


W3C normalization is expressed as concerning the role of charcter
references etc in the document. For RDF literal values I believe this
concept needs extension since
"suc<!-- A comment separating two composable characters -->&#x327;on" 
is W3C-normalized according to the current draft, but comment stripping
makes it unnormalized.


Early uniform normalization
---------------------------
  http://www.w3.org/International/Group/charmod-edit#sec-Normalization

The key idea is that as soon as an text is converted into unicode (any
UCS-based encoding) then it should be normalized, and that any
application that does not *create* new unicode text does not need to do
much more than treat unicode string as uninterpreted binary.

The motivations for this *early* normalization are given in charmod
(section 4.1)

Normalization responsibilities
------------------------------
  
Charmod section 4.3.

In trying to understand the relevance of this section to RDF we have to
equate terms like producer, recipient and proxy with the elements of RDF
processing.

I treat an RDF/XML processor (in my view the parser) as a recipient, and
my

  [11]
  RDF/XML processors MUST NOT normalize input from an XML document 
  that is encoded in a UCS-based encoding. c.f. [CHARMOD] for 
  rationale.

follows from

   "The recipients of text data [...] MUST NOT normalize it." 

in charmod.

The requirement to use a normalizing transcoder, my para [10] and [14] 


  [10]
  When parsing RDF/XML the XML processor, if necessary, converts
  the XML document to the UCS character domain. When doing this
  from any encoding that is not UCS-based this conversion SHOULD
  use Unicode Normalization Form C [NFC, NFC-Corrigendum].

  [14]
  Summary of text normalization for RDF/XML processors.
  RDF/XML processors MUST use a normalizing transcoder
  from non-UCS-based encodings.
  RDF/XML processors MUST NOT do any other text normalization.
  (cf. http://www.w3.org/TR/charmod/#def-normalizing-transcoder )

follows from charmod:

"Recipients which transcode text data from a legacy encoding to a
Unicode encoding form MUST use a normalizing-transcoder."


My paras [9] and [12] 
  [9]
  RDF/XML documents MUST be W3C-normalized. 
  An early uniform normalization framework is used.
  See [CHARMOD] for definition.

  [12]
  RDF/XML documents SHOULD be W3C-normalized as specified in
  [CHARMOD]. Moreover, after the stripping of comments and
  processing instructions an RDF/XML document SHOULD still be
  W3C-normalized. It is the responsibility of the document
  creator to fulfil this requirement. RDF/XML processors MUST NOT
  correct input that is not W3C-normalized. (RDF/XML processors 
  may detect lack of W3C-normalization in an input document, and 
  issue a diagnostic).

follow from charmod's 
  "Producers MUST produce text data in normalized form." 


Having got to [9] that RDF/XML documents must be W3C-normalized (which
was Graham step), I noted that then the Literals in them are also
Unicode Normalized form C, and it seemed sensible to propogate this up
to the top level.

It does leave the problem of then specifying that rdf/xml processor must
not normalize unnormalized input, and prohibiting unnormlized input,
which is ugly.




Jeremy

Received on Wednesday, 26 September 2001 13:42:09 UTC