W3C home > Mailing lists > Public > whatwg@whatwg.org > December 2008

[whatwg] Byte-wise tokenization algorithm

From: Edward Z. Yang <edwardzyang@thewritingpot.com>
Date: Sun, 21 Dec 2008 11:35:53 -0500
Message-ID: <494E7069.2020407@thewritingpot.com>
Ian Hickson wrote:
> Yes. (At least, that's the intent; if you find anything that contradicts 
> that, please let me know.)

Great. I'll be sure to ping you if I find out otherwise.

> Looking just at parsing, yes, probably...

I suppose the big pivot point is "as if". A byte-wise implementation
would replace character globally with byte, and any U+xxxx designation
with the UTF-8 encoded byte version. HTML 5 dictates end behavior, not
the actual algorithm implementation, no?

> But an HTML5 implementation, 
> according to the spec, must at a minimum support the UTF-8 and 
> Windows-1252 encodings, so the overall implementation might not depending 
> on exactly how this is done.

The plan is to convert Windows-1252 into UTF-8 before processing; with a
reasonably good iconv implementation, support for lots of encodings is
possible. The implementation might not be fully conforming if iconv
doesn't perform the proper (possibly context-sensitive; I haven't
checked) substitution when it doesn't recognize a character, but it
should be close.
Received on Sunday, 21 December 2008 08:35:53 UTC

This archive was generated by hypermail 2.3.1 : Monday, 13 April 2015 23:08:46 UTC