W3C home > Mailing lists > Public > public-html-bugzilla@w3.org > August 2011

[Bug 12576] Need clarification on tokenization of html 5 doc.

From: <bugzilla@jessica.w3.org>
Date: Wed, 17 Aug 2011 22:19:15 +0000
To: public-html-bugzilla@w3.org
Message-Id: <E1QtoSB-0002Wy-2d@jessica.w3.org>

--- Comment #9 from Ian 'Hixie' Hickson <ian@hixie.ch> 2011-08-17 22:19:14 UTC ---
As a general rule, in the future, please file one issue per bug.

> 1) In "Before attribute name state" (section right now), on
> encountering '<', a new attribute is started with '<' as first character.
> Shouldn't this not trigger a new element while reporting a parse error ?
> 5) Comment (1), if valid, affects pre-parser logic too (to find encoding).

It's an error, exactly what happens doesn't matter so much. I think the current
behaviour is more consistent with widespread legacy implementations and is
mildly more secure when it comes to XSS attacks.

> 2) In "Data state" (section right now), on encountering 'U+0000', the
> current input character is emitted. Everywhere else, it is replaced with
> U+FFFD. Is this on purpose ? Or a typo ?

It's on purposes, the tree construction takes care of it for those cases.

> 3) In "Bogus comment state" (section right now), it would be good if
> it could be reworded for clarity. As stated, it requires very careful reading
> to decipher its meaning.

Please file a separate bug for this with more detail about exactly what needs
clarifying. In general, very careful reading is to be encouraged. ;-)

> 4) In "Bogus comment state" (section right now), if we encounter an
> EOF, is it not a parse error ? (it delegates to DATA state, where it is not a
> parse error iirc).

Once you hit the bogus comment state you've already hit a parse error so it
doesn't matter.

> 6) In "Determining the character encoding" (section right now), under
> step 5 (the algo to find encoding from html content) :
> Under sub-step 1, case '<meta', point 12 which currently says -
> "If mode is true but got pragma is false, then jump to the second step of the
> overall "two step" algorithm."
> Here, 'mode' is undefined from what I saw : I assume it is supposed to be 'need
> pragma' ?

Fixed; see comment 4.

> 6.1) In point 13 from same snippet from (6) above, we have : 
> "If charset is a UTF-16 encoding, change the value of charset to UTF-8."
> What if it is explicitly set to utf-16LE or utf-16BE ? Should it be changed too
> ? Or only for 'utf-16' ?

UTF-16LE and UTF-16BE are both UTF-16 encodings.

> 7) In "get an attribute" (#concept-get-attributes-when-sniffing : section
> algo in main step 5) : currently a value can end on a whitespace or
> '>'. What about '/' ? Currently, the '/' will get added to the value ... This
> is applicable in two places in that algo : step 10 and step 11.

Could you show a concrete example of a Web page that would be processed
differently based on this difference? I don't fully understand the implications

I'm leaving this bug open for point 7. Please open separate bugs for the other
points if the above is not sufficient resolution.

Configure bugmail: http://www.w3.org/Bugs/Public/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the QA contact for the bug.
Received on Wednesday, 17 August 2011 22:19:17 UTC

This archive was generated by hypermail 2.4.0 : Friday, 17 January 2020 20:02:01 UTC