- From: <bugzilla@jessica.w3.org>
- Date: Mon, 30 Jun 2014 19:34:38 +0000
- To: www-international@w3.org
https://www.w3.org/Bugs/Public/show_bug.cgi?id=23646 --- Comment #27 from Paul Eggert <eggert@cs.ucla.edu> --- (In reply to Paul Eggert from comment #23) > If you have a document that declares "us-ascii", but, in fact, contains non-ASCII byte values, what should happen to those byte values when the document is interpreted? If the byte values are UTF-8 text, they should be interpreted as UTF-8. We use UTF-8 for our other text files, and occasionally the UTF-8 inadvertently leaks into the HTML, so treating it as UTF-8 would be the most useful for us. I don't think we're alone in this. I realize that in this context many browsers interpret non-ASCII bytes using a unibyte encoding for legacy reasons, but some newer browers do treat it as UTF-8. I just now tried eww (which will be part of the next GNU Emacs release; see <http://www.emacswiki.org/emacs/eww>) and that's how it works. The standard should allow this behavior. More generally, the standard should allow the browser to heuristically decode invalid bytes in ways appropriate for the current user and context. So I guess I am asking for a change to the standard after all. Here's a proposed change, inspired by your wording. * In section 4.2 step 2, change "the corresponding encoding" to "a corresponding encoding". * In section 4.2's table, add the "us-ascii" label to the utf-8 encoding. * Append the following text after section 4.2's Note: In practice document authors tend to be imprecise in identifying the correct label, and the following table gives decoders advice and some leeway when dealing with incorrectly labeled documents. For example, because the "iso8859-1" and "us-ascii" labels both correspond to the windows-1252 encoding, a user-agent given a document with either label can treat the document as if it were windows-1252. Conversely, because the "us-ascii" label corresponds to both the utf-8 and the windows-1252 encodings, a user-agent given a document labeled "us-ascii" can decode it as either utf-8 or as windows-1252, depending on user preferences or other heuristics. Assuming that the above suggestion is acceptable, I suppose we could also add other labels to superset encodings as appropriate, e.g., add "us-ascii" to "euc-up". This is not needed for my use case, though. -- You are receiving this mail because: You are on the CC list for the bug.
Received on Monday, 30 June 2014 19:34:39 UTC