- From: Geoffrey Sneddon <foolistbar@googlemail.com>
- Date: Sun, 21 Dec 2008 09:07:40 +0000
On 21 Dec 2008, at 05:41, Ian Hickson wrote: >> 1. Given an input stream that is known to be valid UTF-8, is it >> possible >> to implement the tokenization algorithm with byte-wise operations >> only? >> I think it's possible, since all of the character matching parts of >> the >> algorithm map to characters in ASCII space. > > Yes. (At least, that's the intent; if you find anything that > contradicts > that, please let me know.) Indeed it is possible (or at least it certainly was a year and a half ago, but I have seen nothing change that would stop it). >> 2. Would such an implementation be conforming? > > Looking just at parsing, yes, probably... But an HTML5 implementation, > according to the spec, must at a minimum support the UTF-8 and > Windows-1252 encodings, so the overall implementation might not > depending > on exactly how this is done. That should be no problem: just convert Windows-1252 to UTF-8 using strtr() (as it is a SBCS this is simple enough ? doing the inverse is not) ? see the attached file. Then all you need to do is normalize the character set name to match all aliases of Windows-1252 and UTF-8, as well as mapping ISO-8859-1 and US-ASCII (and all their aliases) to Windows-1252. <http://bugs.simplepie.org/repositories/entry/sp1/trunk/create.php > does that (the only dependancy is for getting the file via HTTP, that can just be replaced with cURL if you wish to just require that). -- Geoffrey Sneddon <http://gsnedders.com/> -------------- next part -------------- A non-text attachment was scrubbed... Name: windows_1252_to_utf8.php Type: text/php Size: 4352 bytes Desc: not available URL: <http://lists.whatwg.org/pipermail/whatwg-whatwg.org/attachments/20081221/3333feb0/attachment.bin>
Received on Sunday, 21 December 2008 01:07:40 UTC