- From: Henri Sivonen <hsivonen@iki.fi>
- Date: Sun, 12 Mar 2006 00:49:03 +0200
On Mar 11, 2006, at 17:10, Henri Sivonen wrote: > Initialize a character decoder that the bytes 0x20?0x7E (inclusive) > as well as 0x09, 0x0A and 0x0D decode to the Unicode code points of > the same (zero-extended) value and maps all other bytes to U+FFFD > and raises a REWIND flag On further reflection, it occurred to me that emitting the Windows-1252 characters instead of U+FFFD would be a good optimization for the common case where the encoding later turns out to be Windows-1252 or ISO-8859-1. This would require more that one bookkeeping flag, though. > If a start tag other than html or head is seen, emit an easy parse > error. Same with character data. > Encoding errors are easy parse errors. (Emit U+FFFD on bogus data.) Except for the ISO-8859-* family the easy error recovery should be emitting the characters according to the corresponding Windows-* family superset. -- Henri Sivonen hsivonen at iki.fi http://hsivonen.iki.fi/
Received on Saturday, 11 March 2006 14:49:03 UTC