Re: New full Unicode for ES6 idea

Anne van Kesteren wrote:
> On Sun, 19 Feb 2012 21:29:48 +0100, David Bruant <bruant.d@gmail.com> 
> wrote:
>> I think a CSP-like solution should be explored.
>
> FWIW, the feedback on CORS (CSP-like) thus far has been that it's 
> quite hard to set up custom headers.

I've heard this for years, can believe it in old-school big-company 
settings, but have a not-to-be-shattered hope that with Node.js etc. it 
is easier for content authors to configure headers. Go on, break my heart!

> So for something as commonly used as JavaScript I'm not sure we'd want 
> to require that. And although more difficult, if we want <meta> it can 
> be made to work, it's just more complicated than simply defining a 
> name and a value. But maybe it should be something simpler, e.g.
>
> <html unicode>
>
> in the top-level browsing context's document.

That's pretty but is it misleading? This is the big-red-switch-for-JS, 
not for the whole doc. In particular what is the Content-Type, with what 
charset parameter, and how does this attribute interact? Perhaps it's 
just misnamed.

> What are libraries supposed to do by the way, check the length of "😁" 
> and adjust code accordingly?

Most JS libraries (I'd love to see couterexamples) do not process 
surrogate pairs at all. They too live in the '90s.

> As far as the DOM and Web IDL are concerned, I think we would need two 
> definitions for "code unit". One that means 16-bit code unit and one 
> that means "Unicode code unit"

I'm not a Unicode expert but I believe the latter is called "character".

> or some such. Looking at 
> http://dvcs.w3.org/hg/domcore/raw-file/tip/Overview.html#characterdata 
> the rest should follow quite naturally.
>
> What happens with surrogate code points in these new strings? I think 
> we do not want to change that each unit is an integer of some kind and 
> can be set to any value. And if that is the case, will it hold values 
> greater than U+10FFFF?

JS must keep the "\uXXXX" notation for uint16 storage units, and one can 
create invalid Unicode strings already. This hazard does not go away, we 
keep compatibility, but the BRS adds no new hazards and in practice, if 
well-used, should reduce the incidence of invalid-Unicode-string bugs.

The "\u{...}" notation is independent and should work whatever the BRS 
setting, IMHO. In "UCS-2" (default) setting, "\u{...}" can make pairs. 
In "UTF-16" setting, it makes only characters. And of course in the 
latter case indexing and length count characters.

/be

Received on Sunday, 19 February 2012 22:16:18 UTC