W3C home > Mailing lists > Public > whatwg@whatwg.org > December 2008

[whatwg] Byte-wise tokenization algorithm

From: Geoffrey Sneddon <foolistbar@googlemail.com>
Date: Sun, 21 Dec 2008 09:07:40 +0000
Message-ID: <8BDF784E-314E-4DCB-870D-4CED64282163@googlemail.com>

On 21 Dec 2008, at 05:41, Ian Hickson wrote:

>> 1. Given an input stream that is known to be valid UTF-8, is it  
>> possible
>> to implement the tokenization algorithm with byte-wise operations  
>> only?
>> I think it's possible, since all of the character matching parts of  
>> the
>> algorithm map to characters in ASCII space.
> Yes. (At least, that's the intent; if you find anything that  
> contradicts
> that, please let me know.)

Indeed it is possible (or at least it certainly was a year and a half  
ago, but I have seen nothing change that would stop it).

>> 2. Would such an implementation be conforming?
> Looking just at parsing, yes, probably... But an HTML5 implementation,
> according to the spec, must at a minimum support the UTF-8 and
> Windows-1252 encodings, so the overall implementation might not  
> depending
> on exactly how this is done.

That should be no problem: just convert Windows-1252 to UTF-8 using  
strtr() (as it is a SBCS this is simple enough ? doing the inverse is  
not) ? see the attached file. Then all you need to do is normalize the  
character set name to match all aliases of Windows-1252 and UTF-8, as  
well as mapping ISO-8859-1 and US-ASCII (and all their aliases) to  
Windows-1252. <http://bugs.simplepie.org/repositories/entry/sp1/trunk/create.php 
 > does that (the only dependancy is for getting the file via HTTP,  
that can just be replaced with cURL if you wish to just require that).

Geoffrey Sneddon
-------------- next part --------------
A non-text attachment was scrubbed...
Name: windows_1252_to_utf8.php
Type: text/php
Size: 4352 bytes
Desc: not available
URL: <http://lists.whatwg.org/pipermail/whatwg-whatwg.org/attachments/20081221/3333feb0/attachment.bin>
Received on Sunday, 21 December 2008 01:07:40 UTC

This archive was generated by hypermail 2.4.0 : Wednesday, 22 January 2020 16:59:08 UTC