Re: BOCU-1, SCSU, etc.

Disclaimer:
This email does not cite a WG decision and is not an official WG  
response. Moreover, I'm not an editor of the spec.

On Jan 25, 2008, at 16:35, Frank Ellermann wrote:

> What you need as a "minimum" for new browsers is UTF-8, US-ASCII
> (as popular proper subset of UTF-8), ISO-8859-1 (as HTML legacy),
> and windows-1252 for the reasons stated in the draft, supporting
> Latin-1 but not windows-1252 would be stupid.

Actually, in order to support existing content, new browsers will in  
practice have to support more than just UTF-8 and Windows-1252. I  
believe the list of encodings that are needed for existing content is  
pretty close to the contents of the encoding menu at http://validator.nu/

> BTW, I'm not aware that windows-1252 is a violation of CHARMOD,
> I asked a question about it and C049 in a Last Call of CHARMOD.

This has been mentioned to i18n.

> Please s/but may support more/but should support more/ - the
> minimum is only that, the minimum.

It is clear that browsers need to support more encodings in order to  
support existing content. However, encoding proliferation is a  
problem. Given that UTF-8 can express all of Unicode and supporting  
other encodings just wastes developer time, we should endeavor to put  
a stop to encoding proliferation and say a firm MUST NOT to encodings  
that are pure proliferation and not needed for existing content. Test  
suites should not count as existing content.

> | User agents must not support the CESU-8, UTF-7, BOCU-1 and SCSU
> | encodings
>
> I can see a MUST NOT for UTF-7 and CESU-8.  And IMO the only good
> excuse for legacy charsets is backwards compatibility.

Indeed.

> But that is at worst a "SHOULD NOT" for BOCU-1, as you have it for  
> UTF-32.

BOCU-1 is not supported by browsers today but UTF-32 is. That's why it  
is easier to prohibit BOCU-1 unconditionally at this point.

> I refuse to discuss SCSU, but MUST NOT is rather harsh, isn't it ?

In my opinion, the proliferation problem needs MUST NOTs. Harsh or not.

> In 3.7.5.4 you say:
>
> | Authors should not use JIS_X0212-1990, x-JIS0208, and encodings
> | based on EBCDIC.  Authors should not use UTF-32.
>
> What's the logic behind these recommendations ?  Of course EBCDIC
> is rare (as far as HTML is concerned I've never seen it), but it's
> AFAIK not worse than codepage 437, 850, 858, or similar charsets.

EBCDIC isn't needed for existing Web content, hence, it is easy to say  
authors shouldn't start using it now. Same for UTF-32.

JIS_X0212-1990 and x-JIS0208 are not rough supersets of ASCII in the  
HTML 5 sense.

> And UTF-32 is relatively harmless, not much worse than UTF-16, it
> belongs to the charsets recommended in CHARMOD.

UTF-32 wastes developer time. It has a non-zero opportunity cost.  
Wasting developer time is not harmless.

> Depending on what happens in future Unicode versions banning UTF-32  
> could backfire.

If Unicode ever reverses the promise not to expand outside UTF-16,  
UTF-8 can expand to six-byte sequences.

> There are lots of other charsets starting with UTF-1 that could be
> listed as SHOULD NOT or even MUST NOT.  Whatever you pick, state
> what your reasons are, not only the (apparently) arbitrary result.

I agree.

> Please make sure that all *unregistered* charsets are SHOULD NOT.
> Yes, I know the consequences for some proprietary charsets, they
> are free to register them or to be ignored (CHARMOD C022).


FWIW, Validator.nu already treats those as SHOULD NOT for HTML5 by  
assuming Charmod is normative where not specifically overridden by  
HTML 5.

-- 
Henri Sivonen
hsivonen@iki.fi
http://hsivonen.iki.fi/

Received on Saturday, 26 January 2008 09:27:33 UTC