Re: BOCU-1, SCSU, etc. from Henri Sivonen on 2008-01-26 (public-html-comments@w3.org from January 2008)

From: Henri Sivonen <hsivonen@iki.fi>
Date: Sat, 26 Jan 2008 20:03:00 +0200
To: Frank Ellermann <hmdmhdfmhdjmzdtjmzdtzktdkztdjz@gmail.com>
Cc: <public-html-comments@w3.org>
Message-Id: <416E8736-CAAE-4771-B208-E797CA407335@iki.fi>
On Jan 26, 2008, at 17:59, Frank Ellermann wrote:

> Henri Sivonen wrote:
>
>> Disclaimer:  This email does not cite a WG decision and is not an
>> official WG response.  Moreover, I'm not an editor of the spec.
>
> Okay, in the IETF anybody claiming to speak for a WG who is neither
> a (co-) Chair nor the responsible area director would be shot... :-)
>
> Minus the "minor" points that I have no idea who the Chair here is,
> or how different W3C rules actually are, I'm used to the idea that
> folks only speak for themselves, or clearly indicate the rare cases
> when that's not the case.

The chairs are Chris Wilson and Dan Connolly. Dan Connolly  
specifically instructed people who reply to emails on public-html- 
comments to include a disclaimer.

The disclaimer still applies.

>> in order to support existing content, new browsers will in
>> practice have to support more than just UTF-8 and Windows-1252.
>> I believe the list of encodings that are needed for existing
>> content is pretty close to the contents of the encoding menu at
>> http://validator.nu/
>
> Maybe, but you are free to add more if and when that makes sense
> from your POV for some users of your validator.

The main concern of the spec is what kind of encoding support in  
browsers in necessary and good for the Web. The set of encodings that  
makes sense as supported encodings for a validator is a subset (not  
necessary a proper subset) of the set of encodings that make sense for  
browsers.

> I recall about three mails in five years on the W3C validator list
> where folks wanted something in the direction of 437/850/858.  In
> theory I liked the idea (as user of a 858-box), but IIRC only Lynx
> got it right, and Lynx also supported UTF-8, and so I considered
> it as a (mildly) pointless proposal - forcing windows-1252 allows
> to validate such pages if they really exist.

I'd omit "(mildly)".

> I'm not going to create BOCU-1 XHTML pages, and I won't touch SCSU
> without a long UTN #14 stick, but other folks have other needs.

Responding to their alleged needs is not their private matter,  
however, since what browsers support affect other people, too.

> It's at least possible that new "legacy" charsets for scripts not  
> covered by existing non-Unicode charsets are developed.

It is possible, but I think that developing such encodings is the  
wrong thing to do. UTF-8 can express all Unicode characters, so new  
encodings will be incompatible with existing software with no  
improvements in Unicode expressiveness.

> For my own purposes, with a legacy text editor supporting any SBCS
> I care about, I've found a Latin-1 friendly "UTF-4" (*1).  After
> that I was sure that UTF-8 will be the only charset at some point
> in time, and that "UTF-4" XHTML pages would be complete nonsense,
> but not "harmful", about the same level as UTF-32.

Supporting complete nonsense is harmful, because developers didn't use  
their time for something else.

>> However, encoding proliferation is a  problem.
>
> The prediction in RFC 2277 that we can't expect legacy charsets to
> go away in less than 50 years is IMO decent.  That means still 40
> years to go (minimum) from today.  I bet on "Harald got it right".

Supporting character encodings used in actual legacy content is very  
different from adding new encodings that are not part of the legacy.

>> supporting other encodings just wastes developer time, we should
>> endeavor to put a stop to encoding proliferation and say a firm
>> MUST NOT to encodings that are pure proliferation and not needed
>> for existing content.
>
> Sorry, "wasting developer time" does not justify a MUST NOT, for a
> MUST NOT it has to break, burn, and crash.  Developers are *free*
> to waste their time (or not) as it pleases them.

Developers are free to waste their time on encodings when they do  
things in the RAM space of their own applications. Communications on  
the public Web affect other people, so developers who implement  
pointless stuff waste the time of other developers as well when they  
need to interoperate with the pointlessness.

>> UTF-32 wastes developer time. It has a non-zero opportunity cost.
>> Wasting developer time is not harmless.
>
> I didn't propose to add a MUST for UTF-32, I asked what the idea
> of the SHOULD NOT in the draft is, because I fear it can be wrong:

I think the fear is unjustified. It is clear that UTF-8 and UTF-32 can  
express the same value range but UTF-8 always makes more sense as a  
delivery encoding over the network.

>>> Depending on what happens in future Unicode versions banning
>>> UTF-32 could backfire.
>
>> If Unicode ever reverses the promise not to expand outside UTF-16,
>> UTF-8 can expand to six-byte sequences.
>
> Nobody can change STD 63 if the IETF does not want this.  I am not
> worried that aliens could visit us for the sole purpose of needing
> more than 16 planes in Unicode for their alien glyphs... ;-)

And UCS2 was never supposed to turn into UTF-16. ;-)

> But for some scripts and applications UTF-32 could be more straight
> forward than UTF-16.

In some cases UTF-32 might be preferable in RAM. UTF-32 is never  
preferable as an encoding for transferring over the network. HTML5  
encoded as UTF-8 is *always* more compact than the same document  
encoded as UTF-16 or UTF-32 regardless of the script of the content.  
Converting from UTF-32 to UTF-8 for IO is so straightforward that any  
perceived straight-forwardness benefit of UTF-32 is moot.

> Nothing is technically wrong with UTF-32, in
> comparison UTF-16 (or similar "UTF-4") is only a hack - at the end
> of the day UTF-8 is *THE* one and only charset.  Nevertheless it is
> trivial to implement UTF-32, "wasting developer time" doesn't enter
> the picture,

As a developer who has faced the issue of UTF-32, I assure you that  
the time taken by UTF-32 is non-zero but the benefit is  
indistinguishable from zero for practical purposes.

Also, the spec requirements about UTF-32 took their current form in  
response to a real developer request:
http://lists.whatwg.org/pipermail/whatwg-whatwg.org/2007-May/011310.html

> because you have the LE vs. BE logic anyway for UTF-16.

Sending UTF-16 over the network is a bad idea, too, but it is part of  
the legacy.

>>> Please make sure that all *unregistered* charsets are SHOULD NOT.
>>> Yes, I know the consequences for some proprietary charsets, they
>>> are free to register them or to be ignored (CHARMOD C022).
>
>> FWIW, Validator.nu already treats those as SHOULD NOT for HTML5 by
>> assuming Charmod is normative where not specifically overridden by
>> HTML 5.
>
> Good.  Simple normative references instead of reinventing the wheel
> would be nice, but admittedly I didn't manage that for the "unicode-
> escapes" RFC, and maybe the HTML5 folks had similar difficulties (?)


Normative references aren't done yet, because the referents are moving  
targets.

-- 
Henri Sivonen
hsivonen@iki.fi
http://hsivonen.iki.fi/
Received on Saturday, 26 January 2008 18:03:21 UTC