charmod vs. UTF-16/32 from Markus Scherer on 2004-02-27 (www-i18n-comments@w3.org from February 2004)

From: Markus Scherer <duerst@w3.org>
Date: Fri, 27 Feb 2004 18:20:40 -0500
To: www-i18n-comments@w3.org
Message-Id: <4.2.0.58.J.20040227182022.054ed248@localhost>
This is a last call comment from Markus Scherer (markus.scherer@jtcsv.com) on
the Character Model for the World Wide Web 1.0
(http://www.w3.org/TR/2002/WD-charmod-20020430/).

Semi-structured version of the comment:

Submitted by: Markus Scherer (markus.scherer@jtcsv.com)
Submitted on behalf of (maybe empty):
Comment type: substantive
Chapter/section the comment applies to: Overall
The comment will be visible to: public
Comment title: charmod vs. UTF-16/32
Comment:
Comments on charmod:

- The names UTF-16 and UTF-32 are each used for an encoding form and an 
encoding scheme. charmod should mention this, and mention that the encoding 
scheme versions use Byte Order Marks (BOMs) while the encoding forms don't.

- It should be explicitly permissible to recognize that a document uses the 
UTF-16 encoding scheme by its BOM, if it is present. This is common 
practice for HTML and XML and has proven valuable because these encoding 
schemes are not compatible with ASCII byte streams.

- There are BOM-like signature byte sequences for other Unicode encodings 
as well, such as UTF-32 and SCSU. Justification as before; UTF-8 is not 
always the most desirable encoding.

- charmod C051/C052 prefers code point indexing (called "character string 
indexing"). This will lead to inefficiencies because most implementations 
will use UTF-16 strings. It would be better to recommend UTF-16 code unit 
indexing. (See UTN #12 http://www.unicode.org/notes/tn12/)

Best regards,
markus



Structured version of  the comment:

<lc-comment
   visibility="public" status="pending"
   decision="pending" impact="substantive" id="LC-">
   <originator email="markus.scherer@jtcsv.com"
       >Markus Scherer</originator>
   <represents email=""
       >-</represents>
   <charmod-section href='http://www.w3.org/TR/2004/WD-charmod-20040225/'
     >Overall</charmod-section>
   <title>charmod vs. UTF-16/32</title>
   <description>
     <comment>
       <dated-link date="2004-02-27"
          href="http://www.w3.org/mid/791355868.20040227220245@toro.w3.mag.k 
href="http://www.w3.org/mid/791355868.20040227220245@toro.w3.mag.keio.ac.jp"
         >charmod vs. UTF-16/32</dated-link>
       <para>Comments on charmod:

- The names UTF-16 and UTF-32 are each used for an encoding form and an 
encoding scheme. charmod should mention this, and mention that the encoding 
scheme versions use Byte Order Marks (BOMs) while the encoding forms 
don&#x27;t.

- It should be explicitly permissible to recognize that a document uses the 
UTF-16 encoding scheme by its BOM, if it is present. This is common 
practice for HTML and XML and has proven valuable because these encoding 
schemes are not compatible with ASCII byte streams.

- There are BOM-like signature byte sequences for other Unicode encodings 
as well, such as UTF-32 and SCSU. Justification as before; UTF-8 is not 
always the most desirable encoding.

- charmod C051/C052 prefers code point indexing (called &#x22;character 
string indexing&#x22;). This will lead to inefficiencies because most 
implementations will use UTF-16 strings. It would be better to recommend 
UTF-16 code unit indexing. (See UTN #12 http://www.unicode.org/notes/tn12/)

Best regards,
markus
</para>
     </comment>
   </description>
</lc-comment>
Received on Saturday, 28 February 2004 06:07:16 UTC