Comments on Character Model from Cliff Schmidt on 2002-06-06 (www-i18n-comments@w3.org from June 2002)

From: Cliff Schmidt <cschmidt@microsoft.com>
Date: Wed, 5 Jun 2002 18:56:33 -0700
To: <www-i18n-comments@w3.org>
Message-ID: <56791EC22E349D428D0C05C89EAB8A2405DCD1FA@red-msg-05.redmond.corp.microsoft.com>
OVERVIEW

Most of the substantive issues described below share a common concern:

In the effort to improve interoperability of text exchange across open
applications on the Web, the Character Model should not restrict the
ability for closed systems to leverage the Web and Web-based
technologies.  The term "closed system", as used in this document,
refers to a system designed to support organizations communicating among
themselves based on a contract into which all parties have explicitly
entered.

The background of this spec states that "the Web may be seen as a
single, very large application...rather than as a collection of small
independent applications."  Based on this premise, it is understandable
why the CharMod spec chooses to require early normalization in a single
canonical form. However, the Web and technologies that have developed to
support the Web have also provided enormous value to closed systems,
including intranet and extranet scenarios.  The relationship between the
evolution of the World Wide Web and its use in closed/private systems
has been a mutually beneficial one.  Private systems have benefited from
the efficiencies of applying Web-developed standards and tools, which
has in-turn increased the demand and support for these Web-enabling
components.  The current Character Model spec threatens to break this
relationship by forcing restrictions on tools that are commonly used in
closed systems, in order to exclusively support the goals of the open
system Web. 

It is apparent that the I18N WG has solid reasons for preferring
Normalization Form C as an interchange format for the Web; however, it
is not likely to be the optimal choice for all applications.  There are
many legacy systems (both applications and operating systems) that use a
decomposed character normalization.  It will be difficult for
organizations to justify why they should adopt CharMod-based
technologies (such as XML 1.1 over XML 1.0), which require transcoding
to a less optimal normalization form with no benefit for their closed
system.  This is likely to lead to fractured use of technologies such as
XML 1.0/1.1.    

Finally, XML plays an important role as a data-interchange format in
scalable, loosely coupled systems.  The Character Model reduces XML to a
format applicable only to natural language communication in one
particular normalization form.  This is unfortunate considering that
vastly more bytes of machine-to-machine XML are transmitted than are
people-to-people or people-machine bytes.

The restrictions mandated by the Character Model limit the use of the
Web and Web-based technologies for a large base of users.  While
supporting the vision for the Web as a "single, very large application",
the limitations to other uses of the Web does not appear to support the
Character Model's goal  to "facilitate the use of the Web by all
people".  




ISSUES LIST

ISSUE 1(a): Inconsistent/Redundant Requirements for W3C Spec Conformance

IMPACT: Editorial
SECTION: 2 Conformance
(http://www.w3.org/TR/2002/WD-charmod-20020430/#sec-Conformance)
------------------------------------------------------------------------
----
"[S] Every W3C specification MUST:
1. conform to the requirements applicable to specifications,
2. specify that implementations MUST conform to the requirements
applicable to software, and 
3. specify that content created according to that specification MUST
conform to the requirements applicable to content.

[S] If an existing W3C specification does not conform to the
requirements in this document, then the next version of that
specification SHOULD be modified in order to conform."
------------------------------------------------------------------------
----
CONCERN:
Stating that all specs "MUST" conform, but that non-conforming specs
"SHOULD" be modified appears to be inconsistent.  

RECOMMENDATION:
The conformance model should only apply to future specs (including
future versions of current specs), instead of specifying different
conformance levels for existing and future versions.  


ISSUE 1(b): W3C Spec Conformance 
IMPACT: Substantive
SECTION: 2 Conformance
(http://www.w3.org/TR/2002/WD-charmod-20020430/#sec-Conformance)
------------------------------------------------------------------------
----
(same section as above)
------------------------------------------------------------------------
----
CONCERN:
The CharMod's requirement that all specs and related implementations
conform to the entire CharMod spec will force non-NFC based applications
to perform round trip transcoding to/from NFC in order to use the Web,
even in closed system scenarios (e.g. extranets).  This will also affect
intranet scenarios as corporate systems are forced to jump through hoops
in order to satisfy text processors (such as XML parsers) that are
required to reject non-NFC text.  The costs certainly outweigh the
benefits for closed systems.  However, it is clear that a recommended
conformance level would improve open system interoperability. 

RECOMMENDATION:
Replace conformance paragraph and included list with the following
sentence: "[S] Future W3C specifications (including future versions of
existing specifications) MUST reference this specification as W3C
recommended guidance for interoperable Web applications."



ISSUE 2: Full Range of Unicode Code Points Not Allowed in XML
IMPACT: Editorial
SECTION: 3.5 Reference Processing Model
(http://www.w3.org/TR/2002/WD-charmod-20020430/#sec-RefProcModel)
------------------------------------------------------------------------
----
"[S] Specifications SHOULD allow the use of the full range of Unicode
code points from U+0000 to U+10FFFF inclusive; code points above
U+10FFFF MUST NOT be used."
------------------------------------------------------------------------
----
CONCERN:
If this is truly a goal for text on the Web, users should understand why
XML is unable to achieve this.  As a high profile W3C spec, readers are
likely to notice the inconsistent message.  Does this mean that I18N
believes that XML (1.1 or some later version) should support the
characters 0x0-0x1F?

RECOMMENDATION:
If XML 1.1 is unable to achieve this goal, the Character Model spec
should either remove this requirement or explain the discrepancy.



ISSUE 3: Definition of "Fully-Normalized"
IMPACT:  Editorial
SECTION: 4.2.3 Fully-normalized text
(http://www.w3.org/TR/2002/WD-charmod-20020430/#sec-FullyNormalized)
------------------------------------------------------------------------
----
"Text is fully-normalized if: 
1. the text is in a Unicode encoding form, is include-normalized and
none of the constructs comprising the text begin with a composing
character or a character escape representing a composing character; or 
2. the text is in a legacy encoding and, if it were transcoded to a
Unicode encoding form by a normalizing transcoder, the resulting text
would satisfy clause 1 above."
------------------------------------------------------------------------
----
CONCERN:
Based on previous definitions, "Unicode-normalized" may be a more
precise term than "Unicode encoding form" (if the implication is that
full normalization requires include-normalization, which requires
Unicode normalization as defined in 4.2.1
(http://www.w3.org/TR/2002/WD-charmod-20020430/#sec-UnicodeNormalized).

RECOMMENDATION: 
Refer to text that is "Unicode-normalized" (possibly linked to the
definition at
http://www.w3.org/TR/2002/WD-charmod-20020430/#sec-UnicodeNormalized),
instead of "Unicode encoding form".



ISSUE 4: Mandating NFC for All Web Content
IMPACT: Substantive
SECTION: 4.4 Responsibility for Normalization
(http://www.w3.org/TR/2002/WD-charmod-20020430/#sec-NormalizationApplica
tion)
------------------------------------------------------------------------
----
"[C] In order to conform to this specification, all text content on the
Web MUST be in include-normalized form and SHOULD be in fully-normalized
form. 
 [S] Specifications of text-based formats and protocols MUST, as part of
their syntax definition, require that the text be in normalized form."
------------------------------------------------------------------------
----
CONCERN:
This restriction is currently applied to "all text content on the Web"
when only content intended for interoperability in open systems will
necessarily benefit from it.  Although the first requirement for content
producers ([C]) could be interpreted to have no impact on intranet
scenarios, the second requirement for specifications ([S]) will impact
the tools that intranet scenarios have been depending on.  

RECOMMENDATION: 
Replace above text with text similar to:
"[C] In order to conform to this specification, all text content on the
Web intended for consumption by foreign systems MUST be in
include-normalized form and SHOULD be in fully-normalized form. 
 [S] Specifications of text-based formats and protocols MUST, as part of
their syntax definition, reference the above requirement."



ISSUE 5: Text-Processors MUST Perform Normalization Checking
IMPACT: Substantive
SECTION: 4.4 Responsibility for Normalization
(http://www.w3.org/TR/2002/WD-charmod-20020430/#sec-NormalizationApplica
tion)
------------------------------------------------------------------------
----
"[S] [I] A text-processing component that receives suspect text MUST NOT
perform any normalization-sensitive operations unless it has first
confirmed through inspection that the text is in normalized form, and
MUST NOT normalize the suspect text.  Private agreements MAY, however,
be created within private systems which are not subject to these rules,
but any externally observable results MUST be the same as if the rules
had been obeyed."
------------------------------------------------------------------------
----
CONCERN:
This requirement will force technologies such as XML parsers to be tied
to the latest list of NFC disallowed diacritic characters in order to
check normalization.  Additionally, in some cases ("MAYBE" cases) NFC
checks require text processors to scan backwards through a text stream
in order to confirm normalization status.  This will require major
architectural changes for any processors designed to break a text stream
into separate smaller windows for efficient processing, because no
previously processed buffer can be thrown away until it is no longer
needed to confirm the validity of any diacritic code points at the start
of the next buffer.  Text processors that expand character entities
today at least have the ability to note the '&' flag.  It is also worth
noting that optimizers of normalization checks will observe that all
code points < 0x341 are always allowable.  This would result in
non-English based texts being disproportionately impacted by
normalization checks.  Finally, this requirement forces the redefinition
of XML to allow for only NFC text.

RECOMMENDATION:
Replace the above text with text similar to:
"[S] [I] Text-processing components MAY include an option to verify that
suspect text is in normalized form.  Text-processing components MUST NOT
normalize the suspect text without specific direction."



ISSUE 6: Content Producers and Proxies
IMPACT: Substantive
SECTION: 4.4 Responsibility for Normalization
(http://www.w3.org/TR/2002/WD-charmod-20020430/#sec-NormalizationApplica
tion)
------------------------------------------------------------------------
----
"NOTE: As an optimization, it is perfectly acceptable for a system to
define the producer to be the actual producer (e.g. a small device)
together with a remote component (e.g. a server serving as a kind of
proxy) to which normalization is delegated. In such a case, the
communications channel between the device and proxy server is considered
to be internal to the system, not part of the Web.  Only data normalized
by the proxy server is to be exposed to the Web at large, as shown in
the illustration below:"
------------------------------------------------------------------------
----
CONCERN:
Although this note seems to allow closed systems to define their own
boundaries, these systems will still be prevented from leveraging
technologies based on CharMod, without first normalizing.

RECOMMENDATION:
If the CharMod spec continues to mandate that all text processors must
check normalization, this note should point out that such processors
could not be used between devices and their proxies.



ISSUE 7: Web Repositories 
IMPACT: Substantive
SECTION; 4.4 Responsibility for Normalization
(http://www.w3.org/TR/2002/WD-charmod-20020430/#sec-NormalizationApplica
tion)
------------------------------------------------------------------------
----
"A similar case would be that of a Web repository receiving content from
a user and noticing that the content is not properly normalized. If the
user so requests, it would certainly be proper for the repository to
normalize the content on behalf of the user, the repository becoming
effectively part of the producer for the duration of that operation."
------------------------------------------------------------------------
----
CONCERN:
As noted in issue, "Content Producers and Proxies", this scenario is not
possible if XML is to be used between the user and the Web repository.
This scenario also seems to imply that users may very liberally
interpret the boundaries of content production. Could one also claim
that the transfer from one repository to another repository was also
contained within the realm of content production? What if the second
repository was the final destination for the content; does this mean the
content never in fact needed to be normalized? 

This scenario seems to encourage users to get around normalization
requirements where impractical or inappropriate, yet leaves them with a
confusing message and no legal tools to work with (since tools such as
XML parsers will only accept normalized text anyway).

RECOMMENDATION:
Delete this paragraph.



ISSUE 8: IRIs
IMPACT: Substantive
SECTION: 8 Character Encoding in URI References
(http://www.w3.org/TR/2002/WD-charmod-20020430/#sec-URIs)
------------------------------------------------------------------------
----
"[S] W3C specifications that define protocol or format elements (e.g.
HTTP headers, XML attributes, etc.) which are to be interpreted as URI
references (or specific subsets of URI references, such as absolute URI
references, URIs, etc.) SHOULD use Internationalized Resource
Identifiers (IRI) [I-D IRI] (or an appropriate subset thereof)."
------------------------------------------------------------------------
----
CONCERN:
Although other W3C specifications support goals similar to those of the
IRI proposal, we hesitate to endorse this section of the CharMod spec
until the IRI draft has undergone further review.  

RECOMMENDATION:
Considering the W3C practice to keep the maturity level of a technical
report within one level of any technical report on which it depends, the
Character Model should not be considered for Recommendation until the
IRI proposal has reached RFC status.  Although the precedent has
typically referred to W3C dependencies, it seems reasonable that any
dependency on a spec outside the W3C should be judged by criteria at
least as strong as those imposed on the W3C.
Received on Wednesday, 5 June 2002 21:57:06 UTC