[whatwg] Byte-wise tokenization algorithm from Geoffrey Sneddon on 2008-12-21 (public-whatwg-archive@w3.org from December 2008)

From: Geoffrey Sneddon <foolistbar@googlemail.com>
Date: Sun, 21 Dec 2008 17:19:39 +0000
Message-ID: <596332F0-BDD9-49BB-ACA1-9F5E6794B78A@googlemail.com>

On 21 Dec 2008, at 16:35, Edward Z. Yang wrote:

> I suppose the big pivot point is "as if". A byte-wise implementation
> would replace character globally with byte, and any U+xxxx designation
> with the UTF-8 encoded byte version. HTML 5 dictates end behavior, not
> the actual algorithm implementation, no?

It states that what is done must be wholly equivalent to the given  
algorithm.

>> But an HTML5 implementation,
>> according to the spec, must at a minimum support the UTF-8 and
>> Windows-1252 encodings, so the overall implementation might not  
>> depending
>> on exactly how this is done.
>
> The plan is to convert Windows-1252 into UTF-8 before processing;  
> with a
> reasonably good iconv implementation, support for lots of encodings is
> possible. The implementation might not be fully conforming if iconv
> doesn't perform the proper (possibly context-sensitive; I haven't
> checked) substitution when it doesn't recognize a character, but it
> should be close.

I've never seen any way of getting iconv (at least via PHP) to do what  
HTML 5 requires (i.e., replacing invalid bytes with U+FFFD). It is,  
however, possible using mbstring (which also has the advantage of not  
being system dependant), as well as with PHP6's Unicode support.


--
Geoffrey Sneddon
<http://gsnedders.com/>

Received on Sunday, 21 December 2008 09:19:39 UTC