[Bug 23646] "us-ascii" should not be an alias for "windows-1252" from bugzilla@jessica.w3.org on 2014-06-30 (www-international@w3.org from April to June 2014)

From: <bugzilla@jessica.w3.org>
Date: Mon, 30 Jun 2014 19:34:38 +0000
To: www-international@w3.org
Message-ID: <bug-23646-4285-bu254pU6Xn@http.www.w3.org/Bugs/Public/>

https://www.w3.org/Bugs/Public/show_bug.cgi?id=23646

--- Comment #27 from Paul Eggert <eggert@cs.ucla.edu> ---
(In reply to Paul Eggert from comment #23)
> If you have a document that declares "us-ascii", but, in fact, contains non-ASCII byte values, what should happen to those byte values when the document is interpreted?

If the byte values are UTF-8 text, they should be interpreted as UTF-8.  We use
UTF-8 for our other text files, and occasionally the UTF-8 inadvertently leaks
into the HTML, so treating it as UTF-8 would be the most useful for us.  I
don't think we're alone in this.

I realize that in this context many browsers interpret non-ASCII bytes using a
unibyte encoding for legacy reasons, but some newer browers do treat it as
UTF-8.  I just now tried eww (which will be part of the next GNU Emacs release;
see <http://www.emacswiki.org/emacs/eww>) and that's how it works.  The
standard should allow this behavior.  More generally, the standard should allow
the browser to heuristically decode invalid bytes in ways appropriate for the
current user and context.

So I guess I am asking for a change to the standard after all.  Here's a
proposed change, inspired by your wording.

* In section 4.2 step 2, change "the corresponding encoding" to "a
corresponding encoding".

* In section 4.2's table, add the "us-ascii" label to the utf-8 encoding.

* Append the following text after section 4.2's Note:

In practice document authors tend to be imprecise in identifying the correct
label, and the following table gives decoders advice and some leeway when
dealing with incorrectly labeled documents.  For example, because the
"iso8859-1" and "us-ascii" labels both correspond to the windows-1252 encoding,
a user-agent given a document with either label can treat the document as if it
were windows-1252.  Conversely, because the "us-ascii" label corresponds to
both the utf-8 and the windows-1252 encodings, a user-agent given a document
labeled "us-ascii" can decode it as either utf-8 or as windows-1252, depending
on user preferences or other heuristics.


Assuming that the above suggestion is acceptable, I suppose we could also add
other labels to superset encodings as appropriate, e.g., add "us-ascii" to
"euc-up".  This is not needed for my use case, though.

-- 
You are receiving this mail because:
You are on the CC list for the bug.

Received on Monday, 30 June 2014 19:34:39 UTC