[whatwg] Encoding Standard (mostly complete) from Glenn Maynard on 2012-04-18 (public-whatwg-archive@w3.org from April 2012)

From: Glenn Maynard <glenn@zewt.org>
Date: Wed, 18 Apr 2012 17:34:12 -0500
Message-ID: <CABirCh9bmpR+4TW4iOp2fiC-xVjy=hh44qwDYicicQbRC2iYxw@mail.gmail.com>

On Wed, Apr 18, 2012 at 12:12 PM, Anne van Kesteren <annevk at opera.com>wrote:

>  "If code point is equal or greater than lower boundary" is more naturally
>> "greater than or equal to" (and "less than or equal to").  That said,
>> this would be much clearer with interval syntax:
>>
>> "If code point is in the range [*lower boundary*, 0x10FFFF] and is not in
>>
>> the range [0xD800, 0xDFFF], emit code point (and continue)."
>>
>> which I think is easier to read, and also makes it clear that the "0xD800
>> to 0xDFFF" is a closed interval (0xD800 and 0xDFFF are included).
>>
>
> Then we'd first have to introduce interval syntax to the English language.
> We could do that I suppose in the Terminology section if you think it would
> be better.

It would also apply to
http://dvcs.w3.org/hg/encoding/raw-file/tip/Overview.html#index-gb18030-code-point,
and it could apply to "select" ranges (eg. 7.1 step 5: "[0,0x7f]").  Maybe
it's not enough to be worth figuring out how to define it.

Encoding form data, at least, doesn't abort on the first error; any
>> unrepresentable codepoints are encoded as as &x1234;.  (It would sure be
>> nice if encoding to non-Unicode-based encodings would just *always* use
>> that syntax for non-ASCII, so the encoders could be dropped, but I guess
>> that would trigger bugs in pages that are currently masked...)  Is there
>> any encoding path in browsers that does give up on the first error?
>>
>
> It has been proposed for the API.
>
> And in URLs you do not get "&#...;" (though in WebKit you do) but you get
> "?" (IE at the network layer, Opera earlier on) or the utf-8 representation
> (Gecko is totally weird).
>

I was testing with POST, which (at least in Gecko) uses HTML escapes for
unrepresentable characters.

(It would be pretty neat if that could be changed to *always* using HTML
escapes for non-ASCII, except when encoding to UTF-8, since that's not
introducing anything new--you can already receive &x1234; escapes in POST
data--and it would alleviate the "form submit encoding depends on the
source page's encoding" problem.  I guess this must break pages somehow, or
vendors would have done this long ago.)

-- 
Glenn Maynard

Received on Wednesday, 18 April 2012 15:34:12 UTC