[Bug 23646] "us-ascii" should not be an alias for "windows-1252" from bugzilla@jessica.w3.org on 2013-10-27 (www-international@w3.org from October to December 2013)

From: <bugzilla@jessica.w3.org>
Date: Sun, 27 Oct 2013 17:38:50 +0000
To: www-international@w3.org
Message-ID: <bug-23646-4285-vTGERdfBXO@http.www.w3.org/Bugs/Public/>

https://www.w3.org/Bugs/Public/show_bug.cgi?id=23646

--- Comment #4 from Addison Phillips <addison@lab126.com> ---
John, I agree generally.

The problem here is, when a document declares an encoding and/or if one has not
detected UTF-8, one can instantiate one and only one encoder/decoder to handle
the text.

Latin-1 is the more obvious one here. If you're decoding a page declared as
"iso-8859-1" and you see a byte like 0x80, you *could* treat it as a C1 control
character. But the C1 controls add no value to the page. It's very likely that
byte is actually U+20AC (EURO SIGN). In fact, browsers and major websites
already make that assumption and have done for quite a while. Hence the alias
appearing in this document.

US-ASCII is a little different. It is, after all, a subset of virtually all
encodings on the Web. But if you have a page declared in US-ASCII and
instantiate a true US-ASCII-7 transcoder, you have to do something with the
bytes from 0x80 to 0xFF. Making lots of U+FFFD is not a very useful. Using
Latin-1 makes sense as the converter for US-ASCII then. 

It might make more sense, in that case, if US-ASCII used the *true* ISO 8859-1
converter, since that encoding's mapping to Unicode is just to round trip the
bytes with the first 256 Unicode characters. That, in fact, is a common enough
trick for data of unknown origin and encoding where you don't want to lose the
original byte values. But for a Web page this isn't very useful. The C1
controls are still invisible or tofu junk. Converting to likely printable
characters is more useful. If it's wrong, at least you can see the mojibake,
and there is a reasonable likelihood that it'll be the right way to interpret
the bytes.

Still, that does call out: a true transcoder implementation (think iconv or
what have you), really *DOES* need to distinguish each of these encodings. If
you use the "Latin-1 encoding trick" I mention above but your transcoder treats
the bytes as windows-1252, you'll be downright unhappy and annoyed (I know I'd
be furious). But in a Web page, I want the browser to produce likely visible
characters and C1 controls are (almost) always wrong.

-- 
You are receiving this mail because:
You are on the CC list for the bug.

Received on Sunday, 27 October 2013 17:38:51 UTC