- From: <bugzilla@jessica.w3.org>
- Date: Sun, 27 Oct 2013 17:38:50 +0000
- To: www-international@w3.org
https://www.w3.org/Bugs/Public/show_bug.cgi?id=23646 --- Comment #4 from Addison Phillips <addison@lab126.com> --- John, I agree generally. The problem here is, when a document declares an encoding and/or if one has not detected UTF-8, one can instantiate one and only one encoder/decoder to handle the text. Latin-1 is the more obvious one here. If you're decoding a page declared as "iso-8859-1" and you see a byte like 0x80, you *could* treat it as a C1 control character. But the C1 controls add no value to the page. It's very likely that byte is actually U+20AC (EURO SIGN). In fact, browsers and major websites already make that assumption and have done for quite a while. Hence the alias appearing in this document. US-ASCII is a little different. It is, after all, a subset of virtually all encodings on the Web. But if you have a page declared in US-ASCII and instantiate a true US-ASCII-7 transcoder, you have to do something with the bytes from 0x80 to 0xFF. Making lots of U+FFFD is not a very useful. Using Latin-1 makes sense as the converter for US-ASCII then. It might make more sense, in that case, if US-ASCII used the *true* ISO 8859-1 converter, since that encoding's mapping to Unicode is just to round trip the bytes with the first 256 Unicode characters. That, in fact, is a common enough trick for data of unknown origin and encoding where you don't want to lose the original byte values. But for a Web page this isn't very useful. The C1 controls are still invisible or tofu junk. Converting to likely printable characters is more useful. If it's wrong, at least you can see the mojibake, and there is a reasonable likelihood that it'll be the right way to interpret the bytes. Still, that does call out: a true transcoder implementation (think iconv or what have you), really *DOES* need to distinguish each of these encodings. If you use the "Latin-1 encoding trick" I mention above but your transcoder treats the bytes as windows-1252, you'll be downright unhappy and annoyed (I know I'd be furious). But in a Web page, I want the browser to produce likely visible characters and C1 controls are (almost) always wrong. -- You are receiving this mail because: You are on the CC list for the bug.
Received on Sunday, 27 October 2013 17:38:51 UTC