[whatwg] Possible bug in the character encoding detection algorithm from James Graham on 2007-03-02 (public-whatwg-archive@w3.org from March 2007)

From: James Graham <jg307@cam.ac.uk>
Date: Fri, 02 Mar 2007 23:02:27 +0000
Message-ID: <45E8AD03.60602@cam.ac.uk>

Given the following line of input:
<a b='c'>
012345678  - byte numbers for reference

I believe the steps in the spec have the following effect:

Match <a
Advance position to 2
Get an attribute
Advance position to 3
Attribute Name = b
Advance position to 4
Jump to step labeled "value"
(Presumably at this point we want to advance to position 5; this is not 
mentioned)
b = '
Advance position to 6
Attribute Value = c
Advance position to 7
Stop looking for an attribute
Get an attribute
Attribute Name = '
Advance Position to 8
Stop Looking for an attribute
Retract position to 7
Stop looking for an attribute
Get an attribute...

this seems to lead to an infinite loop (IIRC the same thing happens for 
unquoted values). html5lib currently sidesteps the issue by not moving 
the position back one after finding an attribute. This fails to locate 
the character encoding in e.g.:
<meta http-equiv="Content-Type<meta charset="utf-8">
Obviously one possibility is to get all attributes and then, if the 
current byte is ASCII < move the position back one.

-- 
"The universe doesn't care what you believe. The wonderful thing about 
science is that it doesn't ask for your faith, it just asks for your 
eyes" --- http://xkcd.com/c154.html

Received on Friday, 2 March 2007 15:02:27 UTC