Full normalization from Jeremy Carroll on 2001-10-02 (www-i18n-comments@w3.org from October 2001)

From: Jeremy Carroll <jjc@hplb.hpl.hp.com>
Date: Tue, 2 Oct 2001 12:11:00 +0100
To: <www-i18n-comments@w3.org>
Message-ID: <JAEBJCLMIFLKLOJGMELDAEEECCAA.jjc@hplb.hpl.hp.com>
Background
==========

I saw your request not to review the current draft of charmod; but took it
to mean to not *specifically* review it. I had a few new comments against
the previous draft that still seem pertinent.

My comments are based on "implementation experience" i.e. trying to
contribute bits of text to the RDF Core WG designed to go into a spec.
conforming with (the previous version of) charmod. I will make these
comments in the language of the new version. If you prefer, I would be happy
to repost these comments  when the next public comment period.

I split separate issues under separate headings in this e-mail.

FYI my attempt to explore RDF literals and charmod conformance is currently

http://lists.w3.org/Archives/Public/w3c-rdfcore-wg/2001Sep/0341.html

[In the RDF Core WG, I currently have an action to separate out the
charmod-and-literals issues from the other literals issues, I'll be happy to
post a pointer to that to this list when I have done it. Please e-mail me if
that will be helpful.]

Issue XML comments
==================

Example:  "suc<!-- comment -->&#x0327;on"

By charmod, "suc<!-- comment -->&#x0327;on" is fully normalized. An XML
processor that strips comments then ends up with a non-normalized string,
which appears to require further normalization. Early uniform normalization
may be better represented by including comments in the defn of full
normalization.


Issue XPath string-value
========================

More generally, for those many XML specs based around an XPath Nodeset data
model the XPath string-value is the crucial representation of strings from
the XML document.

For XML elements this is defined at:
http://www.w3.org/TR/xpath.html#element-nodes
as "the concatenation of the string-values of all text node descendants of
the element node in document order." Unfortunately requiring all of these
string-value's to be in NFC may be burdensome on document authoring tools.

Issue Full Normalization as document syntax dependent.
=====================================================

The second note in subsection "4.2.2. Fully Normalized Text" acknowledges
that "Full normalization is specified against the context of a markup
language". I wonder whether this should be upgraded to a requirements on a
specification, that if it defines a class of documents, it should define
full normalization for those documents. (e.g. if there is a new syntax then
that introduces full normalization, but also a new interpretation may stress
certain string concatenations which then are included in the definition of
full normalization). e.g. for RDF/XML the formation of literals (which are a
particular XPath string-value) is a stressed concatenation, whereas other
XPath string-values are unimportant and can safely be left unnormalized.

There is also merit in defining full normalization once and for all for XML.
Notice my lack of commitment one way or the other.

Issue Full Normalization as a Web Content requirement
=====================================================

See particularly:
http://lists.w3.org/Archives/Public/w3c-rdfcore-wg/2001Sep/0347.html

where Graham Klyne says:
> OK.  If it's important, then why not "documents MUST be W3C-normalized"?

I had no answer to that, it does seem to summarise early uniform
normalization more concisely than "4.3 Responsiblity for Normalization",
which I had used as my template.

This turns the requirements on recipients and producers [I] to equivalent
requirements on specifications of documents and documents themselves [S][C]



Aside: I find the [S] [I] [C] labels a very significant improvement to
charmod.


Issue Responsibilities "Proxy" versus "Recipient"
=================================================

Considering section 4.3 Responsibility for Normalization,
When considering an RDF Processor (whatever that is) I have difficulty in
deciding whether the "proxy" rules or the "recipient" rules apply. In
particular, the requirement that proxies MUST NOT reject un-normalized data
forces a decision as to the role of a component which may be unnatural.
(consider a web site mirror for example, that could be considered a proxy or
a recipient and a producer: one sort of RDF processor may be quite like a
mirror except it picks up document *fragments* from around the web and
merges them into a single document).

I think that "Proxies MAY reject unnormalized data" would be consistent with
the early uniform normalization framework, and resolve this issue.



----

Congratulations on your latest working draft.
I have found your work genuinely helpful.

Jeremy Carroll
Received on Tuesday, 2 October 2001 07:11:45 UTC