Reliability of character encoding identification from C. M. Sperberg-McQueen on 2002-07-12 (www-i18n-comments@w3.org from July 2002)

From: C. M. Sperberg-McQueen <cmsmcq@acm.org>
Date: Sat, 13 Jul 2002 02:50 +0900
To: www-i18n-comments@w3.org
Cc: cmsmcq@acm.org (C. M. Sperberg-McQueen)
Message-Id: <20020712175049.9F34F64C@toro.w3.mag.keio.ac.jp>
This is a last call comment from C. M. Sperberg-McQueen (cmsmcq@acm.org) on
the Character Model for the World Wide Web 1.0
(http://www.w3.org/TR/2002/WD-charmod-20020430/).

Semi-structured version of the comment:

Submitted by: C. M. Sperberg-McQueen (cmsmcq@acm.org)
Submitted on behalf of (maybe empty): XML Schema WG
Comment type: substantive
Chapter/section the comment applies to: 3.6 Choice and Identification of Character Encodings
The comment will be visible to: public
Comment title: Reliability of character encoding identification
Comment:
Section 3.6 specifies that "[S] Specifications MUST either specify a 
unique encoding, or provide character encoding identification mechanisms 
such that the encoding of text can always be reliably identified."

The XML Schema WG believes that this requirement, as formulated, is not
met by any existing specifications and is unlikely ever to be met by
any.  Document producers, software implementors, and server administrators,
working alone or in concert, have innumerable opportunities to render 
character-set labels false out of malice, ignorance, or indifference;
if character-set labels are false, the encoding of the text can only
rarely be reliably identified.

The word "always" seems to suggest that encoding identification mechanisms
must function even in the case of hostile users or misconfigured servers;
that's not possible.  Either the i18n WG should lower its expectations
or it should express its expectations more clearly.

We believe a more correct standard would be to require that specifications
provide mechanisms to ensure that it is POSSIBLE to get things right,
or to ensure that with correct operation / under normal circumstances
character encodings are reliably and correctly identified.  

N.B. This comment is substantially similar to comment C157
(http://www.w3.org/International/Group/2002/charmod-lc/Overview.html#C157)
and to comment 3.13 of our comments on the previous last-call draft
(http://www.w3.org/XML/Group/2002/03/charmodel.annotated.html#ab1b3b3c17c14).



Structured version of  the comment:

<lc-comment
  visibility="public" status="pending"
  decision="pending" impact="substantive">
  <originator email="cmsmcq@acm.org" represents="XML Schema WG"
      >C. M. Sperberg-McQueen</originator>
  <charmod-section href='http://www.w3.org/TR/2002/WD-charmod-20020430/#sec-Encodings'
    >3.6</charmod-section>
  <title>Reliability of character encoding identification</title>
  <description>
    <comment>
      <dated-link date="2002-07-12"
        >Reliability of character encoding identification</dated-link>
      <para>Section 3.6 specifies that "[S] Specifications MUST either specify a 
unique encoding, or provide character encoding identification mechanisms 
such that the encoding of text can always be reliably identified."

The XML Schema WG believes that this requirement, as formulated, is not
met by any existing specifications and is unlikely ever to be met by
any.  Document producers, software implementors, and server administrators,
working alone or in concert, have innumerable opportunities to render 
character-set labels false out of malice, ignorance, or indifference;
if character-set labels are false, the encoding of the text can only
rarely be reliably identified.

The word "always" seems to suggest that encoding identification mechanisms
must function even in the case of hostile users or misconfigured servers;
that's not possible.  Either the i18n WG should lower its expectations
or it should express its expectations more clearly.

We believe a more correct standard would be to require that specifications
provide mechanisms to ensure that it is POSSIBLE to get things right,
or to ensure that with correct operation / under normal circumstances
character encodings are reliably and correctly identified.  

N.B. This comment is substantially similar to comment C157
(http://www.w3.org/International/Group/2002/charmod-lc/Overview.html#C157)
and to comment 3.13 of our comments on the previous last-call draft
(http://www.w3.org/XML/Group/2002/03/charmodel.annotated.html#ab1b3b3c17c14).
</para>
    </comment>
  </description>
</lc-comment>
Received on Friday, 12 July 2002 13:50:53 UTC