[whatwg] Byte-wise tokenization algorithm from Edward Z. Yang on 2008-12-21 (public-whatwg-archive@w3.org from December 2008)

From: Edward Z. Yang <edwardzyang@thewritingpot.com>
Date: Sun, 21 Dec 2008 11:35:53 -0500
Message-ID: <494E7069.2020407@thewritingpot.com>

Ian Hickson wrote:
> Yes. (At least, that's the intent; if you find anything that contradicts 
> that, please let me know.)

Great. I'll be sure to ping you if I find out otherwise.

> Looking just at parsing, yes, probably...

I suppose the big pivot point is "as if". A byte-wise implementation
would replace character globally with byte, and any U+xxxx designation
with the UTF-8 encoded byte version. HTML 5 dictates end behavior, not
the actual algorithm implementation, no?

> But an HTML5 implementation, 
> according to the spec, must at a minimum support the UTF-8 and 
> Windows-1252 encodings, so the overall implementation might not depending 
> on exactly how this is done.

The plan is to convert Windows-1252 into UTF-8 before processing; with a
reasonably good iconv implementation, support for lots of encodings is
possible. The implementation might not be fully conforming if iconv
doesn't perform the proper (possibly context-sensitive; I haven't
checked) substitution when it doesn't recognize a character, but it
should be close.

Received on Sunday, 21 December 2008 08:35:53 UTC