W3C home > Mailing lists > Public > www-i18n-comments@w3.org > May 2002

Prohibition against normalizing suspect text

From: Jim Melton <jim.melton@acm.org>
Date: Fri, 31 May 2002 18:55 +0900
To: www-i18n-comments@w3.org
Cc: jim.melton@acm.org (Jim Melton)
Message-Id: <20020531095503.32D721421@toro.w3.mag.keio.ac.jp>

This is a last call comment from Jim Melton (jim.melton@acm.org) on
the Character Model for the World Wide Web 1.0
(http://www.w3.org/TR/2002/WD-charmod-20020430/).

Semi-structured version of the comment:

Submitted by: Jim Melton (jim.melton@acm.org)
Submitted on behalf of (maybe empty): W3C XML Query Working Group
Comment type: substantive
Chapter/section the comment applies to: 4.4 Responsibility for Normalization
The comment will be visible to: public
Comment title: Prohibition against normalizing suspect text
Comment:


In Section 4.4, "Responsibility for Normalization", another requirement (on implementations and on specifications) states: "[S] [I] A text-processing component that receives suspect text MUST NOT perform any normalization-sensitive operations unless it has first confirmed through inspection that the text is in normalized form, and MUST NOT normalize the suspect text. Private agreements MAY, however, be created within private systems which are not subject to these rules, but any externally observable results MUST be the same as if the rules had been obeyed."

We do not object to the observation that normalization-sensitive operations are best performed on normalized text.  However, the requirement (as stated) clearly prohibits a consuming application from normalize non-normalized text that it receives.  We are quite opposed to this prohibition for a variety of reasons.

One problem is that the requirement doesn't make it clear what the consuming application's behavior must be, but it seems reasonable to conclude that the consuming application must reject the un-normalized text.  Of course, such an application that silently rejects such text is unlikely to be considered user-friendly, so we might guess that an error can be raised in some manner.  But that makes the application very unhelpful in general, since users (e.g., of web browsers) often wish to access text regardless of how rigidly it conforms to the Character Model's requirements.

The "Private agreements" clause starts off in a promising manner, but then requires that the results of such agreements remain unhelpful to users of the applications, since the application is not allowed to produce "observable results" based on handling un-normalized text.

So, what can be done to support applications (and specifications) that must deal with text cannot always be guaranteed to be normalized?  We very much want certain classes of applications to be allowed to do normalization on un-normalized text and we are willing to participate in discussions that identify those classes of applications.

A rather cynical way out of this dilemma that can be imagined is for an application (e.g., a database management system) to "read" suspect text and then "create" brand new normalized text that just happens to be character-for-character identical to the un-normalized text it received.  That obviously implies degenerating into games just to get around conformance requirements; instead, we must fix the specifications and requirements themselves.





Structured version of  the comment:

<lc-comment
  visibility="public" status="pending"
  decision="pending" impact="substantive">
  <originator email="jim.melton@acm.org" represents="W3C XML Query Working Group"
      >Jim Melton</originator>
  <charmod-section href='http://www.w3.org/TR/2002/WD-charmod-20020430/#sec-NormalizationApplication'
    >4.4</charmod-section>
  <title>Prohibition against normalizing suspect text</title>
  <description>
    <comment>
      <dated-link date="2002-05-31"
        >Prohibition against normalizing suspect text</dated-link>
      <para>

In Section 4.4, "Responsibility for Normalization", another requirement (on implementations and on specifications) states: "[S] [I] A text-processing component that receives suspect text MUST NOT perform any normalization-sensitive operations unless it has first confirmed through inspection that the text is in normalized form, and MUST NOT normalize the suspect text. Private agreements MAY, however, be created within private systems which are not subject to these rules, but any externally observable results MUST be the same as if the rules had been obeyed."

We do not object to the observation that normalization-sensitive operations are best performed on normalized text.  However, the requirement (as stated) clearly prohibits a consuming application from normalize non-normalized text that it receives.  We are quite opposed to this prohibition for a variety of reasons.

One problem is that the requirement doesn't make it clear what the consuming application's behavior must be, but it seems reasonable to conclude that the consuming application must reject the un-normalized text.  Of course, such an application that silently rejects such text is unlikely to be considered user-friendly, so we might guess that an error can be raised in some manner.  But that makes the application very unhelpful in general, since users (e.g., of web browsers) often wish to access text regardless of how rigidly it conforms to the Character Model's requirements.

The "Private agreements" clause starts off in a promising manner, but then requires that the results of such agreements remain unhelpful to users of the applications, since the application is not allowed to produce "observable results" based on handling un-normalized text.

So, what can be done to support applications (and specifications) that must deal with text cannot always be guaranteed to be normalized?  We very much want certain classes of applications to be allowed to do normalization on un-normalized text and we are willing to participate in discussions that identify those classes of applications.

A rather cynical way out of this dilemma that can be imagined is for an application (e.g., a database management system) to "read" suspect text and then "create" brand new normalized text that just happens to be character-for-character identical to the un-normalized text it received.  That obviously implies degenerating into games just to get around conformance requirements; instead, we must fix the specifications and requirements themselves.


</para>
    </comment>
  </description>
</lc-comment>
Received on Friday, 31 May 2002 05:55:46 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Tuesday, 27 October 2009 08:32:31 GMT