- From: Markus Scherer <duerst@w3.org>
- Date: Fri, 27 Feb 2004 18:20:40 -0500
- To: www-i18n-comments@w3.org
This is a last call comment from Markus Scherer (markus.scherer@jtcsv.com) on
the Character Model for the World Wide Web 1.0
(http://www.w3.org/TR/2002/WD-charmod-20020430/).
Semi-structured version of the comment:
Submitted by: Markus Scherer (markus.scherer@jtcsv.com)
Submitted on behalf of (maybe empty):
Comment type: substantive
Chapter/section the comment applies to: Overall
The comment will be visible to: public
Comment title: charmod vs. UTF-16/32
Comment:
Comments on charmod:
- The names UTF-16 and UTF-32 are each used for an encoding form and an
encoding scheme. charmod should mention this, and mention that the encoding
scheme versions use Byte Order Marks (BOMs) while the encoding forms don't.
- It should be explicitly permissible to recognize that a document uses the
UTF-16 encoding scheme by its BOM, if it is present. This is common
practice for HTML and XML and has proven valuable because these encoding
schemes are not compatible with ASCII byte streams.
- There are BOM-like signature byte sequences for other Unicode encodings
as well, such as UTF-32 and SCSU. Justification as before; UTF-8 is not
always the most desirable encoding.
- charmod C051/C052 prefers code point indexing (called "character string
indexing"). This will lead to inefficiencies because most implementations
will use UTF-16 strings. It would be better to recommend UTF-16 code unit
indexing. (See UTN #12 http://www.unicode.org/notes/tn12/)
Best regards,
markus
Structured version of the comment:
<lc-comment
visibility="public" status="pending"
decision="pending" impact="substantive" id="LC-">
<originator email="markus.scherer@jtcsv.com"
>Markus Scherer</originator>
<represents email=""
>-</represents>
<charmod-section href='http://www.w3.org/TR/2004/WD-charmod-20040225/'
>Overall</charmod-section>
<title>charmod vs. UTF-16/32</title>
<description>
<comment>
<dated-link date="2004-02-27"
href="http://www.w3.org/mid/791355868.20040227220245@toro.w3.mag.k
href="http://www.w3.org/mid/791355868.20040227220245@toro.w3.mag.keio.ac.jp"
>charmod vs. UTF-16/32</dated-link>
<para>Comments on charmod:
- The names UTF-16 and UTF-32 are each used for an encoding form and an
encoding scheme. charmod should mention this, and mention that the encoding
scheme versions use Byte Order Marks (BOMs) while the encoding forms
don't.
- It should be explicitly permissible to recognize that a document uses the
UTF-16 encoding scheme by its BOM, if it is present. This is common
practice for HTML and XML and has proven valuable because these encoding
schemes are not compatible with ASCII byte streams.
- There are BOM-like signature byte sequences for other Unicode encodings
as well, such as UTF-32 and SCSU. Justification as before; UTF-8 is not
always the most desirable encoding.
- charmod C051/C052 prefers code point indexing (called "character
string indexing"). This will lead to inefficiencies because most
implementations will use UTF-16 strings. It would be better to recommend
UTF-16 code unit indexing. (See UTN #12 http://www.unicode.org/notes/tn12/)
Best regards,
markus
</para>
</comment>
</description>
</lc-comment>
Received on Saturday, 28 February 2004 06:07:16 UTC