Text-Processors MUST Perform Normalization Checking from Cliff Schmidt on 2002-06-06 (www-i18n-comments@w3.org from June 2002)

From: Cliff Schmidt <cschmidt@microsoft.com>
Date: Fri, 7 Jun 2002 01:31 +0900
To: www-i18n-comments@w3.org
Cc: cschmidt@microsoft.com (Cliff Schmidt)
Message-Id: <20020606163111.05BF91403@toro.w3.mag.keio.ac.jp>
This is a last call comment from Cliff Schmidt (cschmidt@microsoft.com) on
the Character Model for the World Wide Web 1.0
(http://www.w3.org/TR/2002/WD-charmod-20020430/).

Semi-structured version of the comment:

Submitted by: Cliff Schmidt (cschmidt@microsoft.com)
Submitted on behalf of (maybe empty): Microsoft
Comment type: substantive
Chapter/section the comment applies to: 4.4 Responsibility for Normalization
The comment will be visible to: public
Comment title: Text-Processors MUST Perform Normalization Checking
Comment:
--------------------------------------------------------------------------
"[S] [I] A text-processing component that receives suspect text MUST NOT perform any normalization-sensitive operations unless it has first confirmed through inspection that the text is in normalized form, and MUST NOT normalize the suspect text.  Private agreements MAY, however, be created within private systems which are not subject to these rules, but any externally observable results MUST be the same as if the rules had been obeyed."
--------------------------------------------------------------------------
CONCERN:
This requirement will force technologies such as XML parsers to be tied to the latest list of NFC disallowed diacritic characters in order to check normalization.  Additionally, in some cases ("MAYBE" cases) NFC checks require text processors to scan backwards through a text stream in order to confirm normalization status.  This will require major architectural changes for any processors designed to break a text stream into separate smaller windows for efficient processing, because no previously processed buffer can be thrown away until it is no longer needed to confirm the validity of any diacritic code points at the start of the next buffer.  Text processors that expand character entities today at least have the ability to note the ‘&’ flag.  It is also worth noting that optimizers of normalization checks will observe that all code points < 0x341 are always allowable.  This would result in non-English based texts being disproportionately impacted by normalization checks.  Finally, this requirement fores the redefinition of XML to allow for only NFC text.

RECOMMENDATION:
Replace the above text with text similar to:
"[S] [I] Text-processing components MAY include an option to verify that suspect text is in normalized form.  Text-processing components MUST NOT normalize the suspect text without specific direction."



Structured version of  the comment:

<lc-comment
  visibility="public" status="pending"
  decision="pending" impact="substantive">
  <originator email="cschmidt@microsoft.com" represents="Microsoft"
      >Cliff Schmidt</originator>
  <charmod-section href='http://www.w3.org/TR/2002/WD-charmod-20020430/#sec-NormalizationApplication'
    >4.4</charmod-section>
  <title>Text-Processors MUST Perform Normalization Checking</title>
  <description>
    <comment>
      <dated-link date="2002-06-06"
        >Text-Processors MUST Perform Normalization Checking</dated-link>
      <para>--------------------------------------------------------------------------
"[S] [I] A text-processing component that receives suspect text MUST NOT perform any normalization-sensitive operations unless it has first confirmed through inspection that the text is in normalized form, and MUST NOT normalize the suspect text.  Private agreements MAY, however, be created within private systems which are not subject to these rules, but any externally observable results MUST be the same as if the rules had been obeyed."
--------------------------------------------------------------------------
CONCERN:
This requirement will force technologies such as XML parsers to be tied to the latest list of NFC disallowed diacritic characters in order to check normalization.  Additionally, in some cases ("MAYBE" cases) NFC checks require text processors to scan backwards through a text stream in order to confirm normalization status.  This will require major architectural changes for any processors designed to break a text stream into separate smaller windows for efficient processing, because no previously processed buffer can be thrown away until it is no longer needed to confirm the validity of any diacritic code points at the start of the next buffer.  Text processors that expand character entities today at least have the ability to note the ‘&’ flag.  It is also worth noting that optimizers of normalization checks will observe that all code points < 0x341 are always allowable.  This would result in non-English based texts being disproportionately impacted by normalization checks.  Finally, this requirement fores the redefinition of XML to allow for only NFC text.

RECOMMENDATION:
Replace the above text with text similar to:
"[S] [I] Text-processing components MAY include an option to verify that suspect text is in normalized form.  Text-processing components MUST NOT normalize the suspect text without specific direction."
</para>
    </comment>
  </description>
</lc-comment>
Received on Thursday, 6 June 2002 12:31:13 UTC