Re: BOCU-1, SCSU, etc. from Frank Ellermann on 2008-01-26 (public-html-comments@w3.org from January 2008)

From: Frank Ellermann <hmdmhdfmhdjmzdtjmzdtzktdkztdjz@gmail.com>
Date: Sat, 26 Jan 2008 22:14:19 +0100
To: "Henri Sivonen" <hsivonen@iki.fi>
Cc: <public-html-comments@w3.org>
Message-ID: <010501c86060$6c76a1a0$1ea0b43e@xyzzy>
Henri Sivonen wrote:

 [disclaimer]
> The chairs are Chris Wilson and Dan Connolly. Dan Connolly  
> specifically instructed people who reply to emails on 
> public-html-comments to include a disclaimer.

Okay, W3C magic, I'm no going to check what it is good for.
In the IETF they sometimes "hum" in a desperate attempt to
avoid anything that could be misinterpreted as "voting".

> The main concern of the spec is what kind of encoding
> support in browsers in necessary and good for the Web.

Necessary is MUST, good is SHOULD.  We are mainly talking
about SHOULD NOT at the opposite end of the spectrum.  

>> I considered it as a (mildly) pointless proposal
[...]
> I'd omit "(mildly)".

Forcing "windows-1252" for something else is only a crude
workaround, the effects for "show source" (or for showing
errors) must be odd.  Validators and (X)HTML aside, I'd
expect that browsers can display text/plain "850" (858).

<a type="text/plain" charset="PC-Multilingual-850+euro"
 etc. is about as much as I can do in this case (as user,
as server admin I would do more), with an "850" icon for
human visitors explaining what "850" is.

IOW I have a bunch of "850" text/plain files, and I have
links to them in XHTML files.

> UTF-8 can express all Unicode characters

Nobody wants or understands *all* Unicode characters, it's
a cute reference charset.  When we are down to send short
messages from a mobile device with 1 MB RAM at a cost of
10 cents per 50 octets UTF-8 is not necessarily the first
choice for a hypothetical poor Shavian community.  

That is admittedly beside the point for HTML5, but your
statement was apparently general, not limited to HTML5.

> Communications on the public Web affect other people,
> so developers who implement pointless stuff waste the
> time of other developers as well when they need to
> interoperate with the pointlessness.

What's a waste of time for you might be a feature for 
others, and vice versa.  E.g. I considered all tags in
HTML 4 as pointless, harmful, and waste of time, which
did not work with older browsers (two extreme examples,
thead + ins were okay, tfoot + del were ugly).
 
> And UCS2 was never supposed to turn into UTF-16. ;-)

Okay, some folks don't believe in "ought to be enough
for everybody".  ("UTF-4" could do 15*4=60 bits, don't
try that with UTF-8, 16, 32 :-)

> In some cases UTF-32 might be preferable in RAM. UTF-32
> is never preferable as an encoding for transferring
> over the network.

I would not dare say never (for this point).

> HTML5 encoded as UTF-8 is *always* more compact than
> the same document encoded as UTF-16 or UTF-32 regardless
> of the script of the content.

In a mathematical sense we could force "more compact" as
near to zero as we want if we agree on some use for the
code points where UTF-8 needs four octects.  And as human
user used to hex. (but not modulo 64) I'd prefer the pure
UTF-32 for code points that anyway make no sense for me.

It's easy to determine the number of UTF-32 code points
in a given string, "compact" is not always everything.

JFTR, we completely agree that it is usually a bad idea,
a SHOULD NOT for XHTML producers as in HTML5 3.7.5.4 is
okay, although I fail to see why that's limited to HTML5,
this could be a more general advice also for XHTML 1 etc.

  Digression:  I don't believe for a second that HTML5
  can "dictate" what other XHTML versions or XML do.

If 3.7.5.4 is a "SHOULD NOT generate", then 8.2.2.2 is
appararently a "SHOULD NOT accept", and that's IMO wrong.

> the spec requirements about UTF-32 took their current
> form in response to a real developer request:
> http://lists.whatwg.org/pipermail/whatwg-whatwg.org/2007-May/011310.html

Confusing an UTF-32 with an UTF-16 BOM is certainly bad.

But I'm no fan of a BOM outside of text/plain, UTF-32BE
and UTF-32LE have no BOM.  If the HTML5 WG is unable to
fix the broken table in 4.9.1 (3) it should be disbanded,
in the spirit of "don't waste the time of developers".

 Frank
Received on Saturday, 26 January 2008 21:14:15 UTC