Re: New full Unicode for ES6 idea from Wes Garland on 2012-02-22 (public-script-coord@w3.org from January to March 2012)

From: Wes Garland <wes@page.ca>
Date: Wed, 22 Feb 2012 09:09:55 -0500
To: Norbert Lindenberg <ecmascript@norbertlindenberg.com>
Cc: Brendan Eich <brendan@mozilla.com>, "public-script-coord@w3.org" <public-script-coord@w3.org>, mranney@voxer.com, es-discuss <es-discuss@mozilla.org>
Message-ID: <CAHB0tE6FwucSUR_62NcThTm40Lfhf+8W=N=X6=vHVkRq0dsJRw@mail.gmail.com>

Interesting scenarios, Norbert -- well-thought-through.

The final goal (for me, at least) is to be able to tell my developers to
"Just write code" and forget about the details about how the characters in
strings are encoded. Your point about the bidi library is an important one,
but I think if we could somehow survey the web that we would find that the
vast majority of applications do The Wrong Thing now and that flipping the
BRS would magically fix a lot of them.  I think any group that is "with it"
w.r.t. Unicode in JS today will find a way to embrace BRS-on as long there
is a reasonable path to follow.

Some day, I hope developers will simply start all documents with something
like <!DOCTYPE HTML UNICODE> and never worry about character encoding
details again.  That is when we will start to see benefits, and these
benefits will snowball as organizations start to do this.

Of course, to get there, we have to somehow manage the transition. I think
your point about the static rejection of four-byte Unicode escapes is
really important.  During the transitional period, we need a way to write
JS libraries than can run with BRS on or off.

If four-byte escapes are statically rejected in BRS-on, we have a problem
-- we should be able to use old code that runs in either mode unchanged
when said code only uses characters in the BMP.

Accepting both 4 and 6 byte escapes is a problem, though -- what is
"\u123456".length?  1 or 3?

If we accept "\u1234" in BRS-on as a string with length 5 -- as we do today
in ES5 with "\u123".length===4 -- we give developers a way to feature-test
and conditionally execute code, allowing libraries to run with BRS-on and
BRS-off.

It's awkward, though: there is no way to recover static strings
programmatically since the \ has been eaten by the JS compiler.  And users
*will* want to programmatically convert arrays of strings  (think gettext)

So, it seems that for a good migration path we somehow need to mark string
literals so that the parser knows how to deal with them.  And we need to do
it in a way that "just works" in ES5 while preserving natural syntax with
BRS-on.

*Idea*: can we add a per-script attribute which allows a transitional
parsing scheme for string literals when BRS-on?  This transitional scheme
would parse string literals like BRS-off, *unless* the string literal had a
leading U.

Having a per-script attribute lets module system developers deal with the
problem easily when using DOM SCRIPT tag injection to load modules.  It
also allows users switching BRS-on to load old content from foreign sites,
which I believe is necessary for widespread BRS-on adoption.

Sample program demonstrating how this might work:

<!DOCTYPE HTML UNICODE>
<html>
  <script>
    var i;
    var a = [0];
    a.push("\u1234");
  </script>
  <script parser="unicodeTransitional">
    a.push("\u1234");
    a.push(U"\u1234");
    a.push(U"\u123456");
  <script>
    a.push("\u123456");
    for (i=0; i < a.length; i++) {
      console.log(i + " -> " + a[i].length);
    }
  </script>
</html>

Output:

0 -> 5
1 -> 1
2 -> 5
3 -> 1
4 -> 1

I think this is a sustainable solution that gives developers just enough
tools to retrofit without going off in lala-land by adding a bunch of extra
types and helper methods.

Wes

-- 
Wesley W. Garland
Director, Product Development
PageMail, Inc.
+1 613 542 2787 x 102

Received on Wednesday, 22 February 2012 14:10:27 UTC