Re: BOCU-1, SCSU, etc. from Frank Ellermann on 2008-01-26 (public-html-comments@w3.org from January 2008)

From: Frank Ellermann <hmdmhdfmhdjmzdtjmzdtzktdkztdjz@gmail.com>
Date: Sat, 26 Jan 2008 16:59:05 +0100
To: "Henri Sivonen" <hsivonen@iki.fi>
Cc: <public-html-comments@w3.org>
Message-ID: <00d101c86034$63707760$1ea0b43e@xyzzy>
Henri Sivonen wrote:

> Disclaimer:  This email does not cite a WG decision and is not an
> official WG response.  Moreover, I'm not an editor of the spec.

Okay, in the IETF anybody claiming to speak for a WG who is neither
a (co-) Chair nor the responsible area director would be shot... :-)

Minus the "minor" points that I have no idea who the Chair here is,
or how different W3C rules actually are, I'm used to the idea that
folks only speak for themselves, or clearly indicate the rare cases
when that's not the case.

> in order to support existing content, new browsers will in
> practice have to support more than just UTF-8 and Windows-1252.
> I believe the list of encodings that are needed for existing
> content is pretty close to the contents of the encoding menu at
> http://validator.nu/

Maybe, but you are free to add more if and when that makes sense
from your POV for some users of your validator.  

I recall about three mails in five years on the W3C validator list
where folks wanted something in the direction of 437/850/858.  In
theory I liked the idea (as user of a 858-box), but IIRC only Lynx
got it right, and Lynx also supported UTF-8, and so I considered
it as a (mildly) pointless proposal - forcing windows-1252 allows
to validate such pages if they really exist.

I'm not going to create BOCU-1 XHTML pages, and I won't touch SCSU
without a long UTN #14 stick, but other folks have other needs.
It's at least possible that new "legacy" charsets for scripts not
covered by existing non-Unicode charsets are developed.  

For my own purposes, with a legacy text editor supporting any SBCS
I care about, I've found a Latin-1 friendly "UTF-4" (*1).  After
that I was sure that UTF-8 will be the only charset at some point
in time, and that "UTF-4" XHTML pages would be complete nonsense,
but not "harmful", about the same level as UTF-32.

> However, encoding proliferation is a  problem.

The prediction in RFC 2277 that we can't expect legacy charsets to
go away in less than 50 years is IMO decent.  That means still 40
years to go (minimum) from today.  I bet on "Harald got it right".

> supporting other encodings just wastes developer time, we should
> endeavor to put a stop to encoding proliferation and say a firm
> MUST NOT to encodings that are pure proliferation and not needed
> for existing content.

Sorry, "wasting developer time" does not justify a MUST NOT, for a
MUST NOT it has to break, burn, and crash.  Developers are *free*
to waste their time (or not) as it pleases them.  The opposite is
a problem, developers *forced* to waste time by pointless MUSTard.

> UTF-32 wastes developer time. It has a non-zero opportunity cost.  
> Wasting developer time is not harmless.

I didn't propose to add a MUST for UTF-32, I asked what the idea 
of the SHOULD NOT in the draft is, because I fear it can be wrong:
 
>> Depending on what happens in future Unicode versions banning
>> UTF-32 could backfire.
 
> If Unicode ever reverses the promise not to expand outside UTF-16,  
> UTF-8 can expand to six-byte sequences.

Nobody can change STD 63 if the IETF does not want this.  I am not
worried that aliens could visit us for the sole purpose of needing
more than 16 planes in Unicode for their alien glyphs... ;-)

But for some scripts and applications UTF-32 could be more straight
forward than UTF-16.  Nothing is technically wrong with UTF-32, in
comparison UTF-16 (or similar "UTF-4") is only a hack - at the end
of the day UTF-8 is *THE* one and only charset.  Nevertheless it is
trivial to implement UTF-32, "wasting developer time" doesn't enter
the picture, because you have the LE vs. BE logic anyway for UTF-16.
 
>> Please make sure that all *unregistered* charsets are SHOULD NOT.
>> Yes, I know the consequences for some proprietary charsets, they
>> are free to register them or to be ignored (CHARMOD C022).
 
> FWIW, Validator.nu already treats those as SHOULD NOT for HTML5 by  
> assuming Charmod is normative where not specifically overridden by  
> HTML 5.

Good.  Simple normative references instead of reinventing the wheel
would be nice, but admittedly I didn't manage that for the "unicode-
escapes" RFC, and maybe the HTML5 folks had similar difficulties (?)

 Frank
-- 
*1: see <http://purl.net/xyzzy/home/test/utf-4.xml> and utf-8.xml
Received on Saturday, 26 January 2008 15:59:00 UTC