- From: Cliff Schmidt <cschmidt@microsoft.com>
- Date: Wed, 5 Jun 2002 18:56:33 -0700
- To: <www-i18n-comments@w3.org>
OVERVIEW Most of the substantive issues described below share a common concern: In the effort to improve interoperability of text exchange across open applications on the Web, the Character Model should not restrict the ability for closed systems to leverage the Web and Web-based technologies. The term "closed system", as used in this document, refers to a system designed to support organizations communicating among themselves based on a contract into which all parties have explicitly entered. The background of this spec states that "the Web may be seen as a single, very large application...rather than as a collection of small independent applications." Based on this premise, it is understandable why the CharMod spec chooses to require early normalization in a single canonical form. However, the Web and technologies that have developed to support the Web have also provided enormous value to closed systems, including intranet and extranet scenarios. The relationship between the evolution of the World Wide Web and its use in closed/private systems has been a mutually beneficial one. Private systems have benefited from the efficiencies of applying Web-developed standards and tools, which has in-turn increased the demand and support for these Web-enabling components. The current Character Model spec threatens to break this relationship by forcing restrictions on tools that are commonly used in closed systems, in order to exclusively support the goals of the open system Web. It is apparent that the I18N WG has solid reasons for preferring Normalization Form C as an interchange format for the Web; however, it is not likely to be the optimal choice for all applications. There are many legacy systems (both applications and operating systems) that use a decomposed character normalization. It will be difficult for organizations to justify why they should adopt CharMod-based technologies (such as XML 1.1 over XML 1.0), which require transcoding to a less optimal normalization form with no benefit for their closed system. This is likely to lead to fractured use of technologies such as XML 1.0/1.1. Finally, XML plays an important role as a data-interchange format in scalable, loosely coupled systems. The Character Model reduces XML to a format applicable only to natural language communication in one particular normalization form. This is unfortunate considering that vastly more bytes of machine-to-machine XML are transmitted than are people-to-people or people-machine bytes. The restrictions mandated by the Character Model limit the use of the Web and Web-based technologies for a large base of users. While supporting the vision for the Web as a "single, very large application", the limitations to other uses of the Web does not appear to support the Character Model's goal to "facilitate the use of the Web by all people". ISSUES LIST ISSUE 1(a): Inconsistent/Redundant Requirements for W3C Spec Conformance IMPACT: Editorial SECTION: 2 Conformance (http://www.w3.org/TR/2002/WD-charmod-20020430/#sec-Conformance) ------------------------------------------------------------------------ ---- "[S] Every W3C specification MUST: 1. conform to the requirements applicable to specifications, 2. specify that implementations MUST conform to the requirements applicable to software, and 3. specify that content created according to that specification MUST conform to the requirements applicable to content. [S] If an existing W3C specification does not conform to the requirements in this document, then the next version of that specification SHOULD be modified in order to conform." ------------------------------------------------------------------------ ---- CONCERN: Stating that all specs "MUST" conform, but that non-conforming specs "SHOULD" be modified appears to be inconsistent. RECOMMENDATION: The conformance model should only apply to future specs (including future versions of current specs), instead of specifying different conformance levels for existing and future versions. ISSUE 1(b): W3C Spec Conformance IMPACT: Substantive SECTION: 2 Conformance (http://www.w3.org/TR/2002/WD-charmod-20020430/#sec-Conformance) ------------------------------------------------------------------------ ---- (same section as above) ------------------------------------------------------------------------ ---- CONCERN: The CharMod's requirement that all specs and related implementations conform to the entire CharMod spec will force non-NFC based applications to perform round trip transcoding to/from NFC in order to use the Web, even in closed system scenarios (e.g. extranets). This will also affect intranet scenarios as corporate systems are forced to jump through hoops in order to satisfy text processors (such as XML parsers) that are required to reject non-NFC text. The costs certainly outweigh the benefits for closed systems. However, it is clear that a recommended conformance level would improve open system interoperability. RECOMMENDATION: Replace conformance paragraph and included list with the following sentence: "[S] Future W3C specifications (including future versions of existing specifications) MUST reference this specification as W3C recommended guidance for interoperable Web applications." ISSUE 2: Full Range of Unicode Code Points Not Allowed in XML IMPACT: Editorial SECTION: 3.5 Reference Processing Model (http://www.w3.org/TR/2002/WD-charmod-20020430/#sec-RefProcModel) ------------------------------------------------------------------------ ---- "[S] Specifications SHOULD allow the use of the full range of Unicode code points from U+0000 to U+10FFFF inclusive; code points above U+10FFFF MUST NOT be used." ------------------------------------------------------------------------ ---- CONCERN: If this is truly a goal for text on the Web, users should understand why XML is unable to achieve this. As a high profile W3C spec, readers are likely to notice the inconsistent message. Does this mean that I18N believes that XML (1.1 or some later version) should support the characters 0x0-0x1F? RECOMMENDATION: If XML 1.1 is unable to achieve this goal, the Character Model spec should either remove this requirement or explain the discrepancy. ISSUE 3: Definition of "Fully-Normalized" IMPACT: Editorial SECTION: 4.2.3 Fully-normalized text (http://www.w3.org/TR/2002/WD-charmod-20020430/#sec-FullyNormalized) ------------------------------------------------------------------------ ---- "Text is fully-normalized if: 1. the text is in a Unicode encoding form, is include-normalized and none of the constructs comprising the text begin with a composing character or a character escape representing a composing character; or 2. the text is in a legacy encoding and, if it were transcoded to a Unicode encoding form by a normalizing transcoder, the resulting text would satisfy clause 1 above." ------------------------------------------------------------------------ ---- CONCERN: Based on previous definitions, "Unicode-normalized" may be a more precise term than "Unicode encoding form" (if the implication is that full normalization requires include-normalization, which requires Unicode normalization as defined in 4.2.1 (http://www.w3.org/TR/2002/WD-charmod-20020430/#sec-UnicodeNormalized). RECOMMENDATION: Refer to text that is "Unicode-normalized" (possibly linked to the definition at http://www.w3.org/TR/2002/WD-charmod-20020430/#sec-UnicodeNormalized), instead of "Unicode encoding form". ISSUE 4: Mandating NFC for All Web Content IMPACT: Substantive SECTION: 4.4 Responsibility for Normalization (http://www.w3.org/TR/2002/WD-charmod-20020430/#sec-NormalizationApplica tion) ------------------------------------------------------------------------ ---- "[C] In order to conform to this specification, all text content on the Web MUST be in include-normalized form and SHOULD be in fully-normalized form. [S] Specifications of text-based formats and protocols MUST, as part of their syntax definition, require that the text be in normalized form." ------------------------------------------------------------------------ ---- CONCERN: This restriction is currently applied to "all text content on the Web" when only content intended for interoperability in open systems will necessarily benefit from it. Although the first requirement for content producers ([C]) could be interpreted to have no impact on intranet scenarios, the second requirement for specifications ([S]) will impact the tools that intranet scenarios have been depending on. RECOMMENDATION: Replace above text with text similar to: "[C] In order to conform to this specification, all text content on the Web intended for consumption by foreign systems MUST be in include-normalized form and SHOULD be in fully-normalized form. [S] Specifications of text-based formats and protocols MUST, as part of their syntax definition, reference the above requirement." ISSUE 5: Text-Processors MUST Perform Normalization Checking IMPACT: Substantive SECTION: 4.4 Responsibility for Normalization (http://www.w3.org/TR/2002/WD-charmod-20020430/#sec-NormalizationApplica tion) ------------------------------------------------------------------------ ---- "[S] [I] A text-processing component that receives suspect text MUST NOT perform any normalization-sensitive operations unless it has first confirmed through inspection that the text is in normalized form, and MUST NOT normalize the suspect text. Private agreements MAY, however, be created within private systems which are not subject to these rules, but any externally observable results MUST be the same as if the rules had been obeyed." ------------------------------------------------------------------------ ---- CONCERN: This requirement will force technologies such as XML parsers to be tied to the latest list of NFC disallowed diacritic characters in order to check normalization. Additionally, in some cases ("MAYBE" cases) NFC checks require text processors to scan backwards through a text stream in order to confirm normalization status. This will require major architectural changes for any processors designed to break a text stream into separate smaller windows for efficient processing, because no previously processed buffer can be thrown away until it is no longer needed to confirm the validity of any diacritic code points at the start of the next buffer. Text processors that expand character entities today at least have the ability to note the '&' flag. It is also worth noting that optimizers of normalization checks will observe that all code points < 0x341 are always allowable. This would result in non-English based texts being disproportionately impacted by normalization checks. Finally, this requirement forces the redefinition of XML to allow for only NFC text. RECOMMENDATION: Replace the above text with text similar to: "[S] [I] Text-processing components MAY include an option to verify that suspect text is in normalized form. Text-processing components MUST NOT normalize the suspect text without specific direction." ISSUE 6: Content Producers and Proxies IMPACT: Substantive SECTION: 4.4 Responsibility for Normalization (http://www.w3.org/TR/2002/WD-charmod-20020430/#sec-NormalizationApplica tion) ------------------------------------------------------------------------ ---- "NOTE: As an optimization, it is perfectly acceptable for a system to define the producer to be the actual producer (e.g. a small device) together with a remote component (e.g. a server serving as a kind of proxy) to which normalization is delegated. In such a case, the communications channel between the device and proxy server is considered to be internal to the system, not part of the Web. Only data normalized by the proxy server is to be exposed to the Web at large, as shown in the illustration below:" ------------------------------------------------------------------------ ---- CONCERN: Although this note seems to allow closed systems to define their own boundaries, these systems will still be prevented from leveraging technologies based on CharMod, without first normalizing. RECOMMENDATION: If the CharMod spec continues to mandate that all text processors must check normalization, this note should point out that such processors could not be used between devices and their proxies. ISSUE 7: Web Repositories IMPACT: Substantive SECTION; 4.4 Responsibility for Normalization (http://www.w3.org/TR/2002/WD-charmod-20020430/#sec-NormalizationApplica tion) ------------------------------------------------------------------------ ---- "A similar case would be that of a Web repository receiving content from a user and noticing that the content is not properly normalized. If the user so requests, it would certainly be proper for the repository to normalize the content on behalf of the user, the repository becoming effectively part of the producer for the duration of that operation." ------------------------------------------------------------------------ ---- CONCERN: As noted in issue, "Content Producers and Proxies", this scenario is not possible if XML is to be used between the user and the Web repository. This scenario also seems to imply that users may very liberally interpret the boundaries of content production. Could one also claim that the transfer from one repository to another repository was also contained within the realm of content production? What if the second repository was the final destination for the content; does this mean the content never in fact needed to be normalized? This scenario seems to encourage users to get around normalization requirements where impractical or inappropriate, yet leaves them with a confusing message and no legal tools to work with (since tools such as XML parsers will only accept normalized text anyway). RECOMMENDATION: Delete this paragraph. ISSUE 8: IRIs IMPACT: Substantive SECTION: 8 Character Encoding in URI References (http://www.w3.org/TR/2002/WD-charmod-20020430/#sec-URIs) ------------------------------------------------------------------------ ---- "[S] W3C specifications that define protocol or format elements (e.g. HTTP headers, XML attributes, etc.) which are to be interpreted as URI references (or specific subsets of URI references, such as absolute URI references, URIs, etc.) SHOULD use Internationalized Resource Identifiers (IRI) [I-D IRI] (or an appropriate subset thereof)." ------------------------------------------------------------------------ ---- CONCERN: Although other W3C specifications support goals similar to those of the IRI proposal, we hesitate to endorse this section of the CharMod spec until the IRI draft has undergone further review. RECOMMENDATION: Considering the W3C practice to keep the maturity level of a technical report within one level of any technical report on which it depends, the Character Model should not be considered for Recommendation until the IRI proposal has reached RFC status. Although the precedent has typically referred to W3C dependencies, it seems reasonable that any dependency on a spec outside the W3C should be judged by criteria at least as strong as those imposed on the W3C.
Received on Wednesday, 5 June 2002 21:57:06 UTC