- From: <bugzilla@jessica.w3.org>
- Date: Sat, 30 Apr 2011 13:53:00 +0000
- To: public-html@w3.org
http://www.w3.org/Bugs/Public/show_bug.cgi?id=12576 Summary: Need clarification on tokenization of html 5 doc. Product: HTML WG Version: unspecified Platform: All OS/Version: All Status: NEW Severity: normal Priority: P2 Component: HTML5 spec (editor: Ian Hickson) AssignedTo: ian@hixie.ch ReportedBy: mridul@gmail.com QAContact: public-html-bugzilla@w3.org CC: mike@w3.org, public-html-wg-issue-tracking@w3.org, public-html@w3.org I was going over sections related to tokenization on html5 spec at http://dev.w3.org/html5/spec/Overview.html (the version as of today). Have the following queries/comments ... and clarity on the following would be great. 1) In "Before attribute name state" (section 8.2.4.34 right now), on encountering '<', a new attribute is started with '<' as first character. Shouldn't this not trigger a new element while reporting a parse error ? 2) In "Data state" (section 8.2.4.1 right now), on encountering 'U+0000', the current input character is emitted. Everywhere else, it is replaced with U+FFFD. Is this on purpose ? Or a typo ? 3) In "Bogus comment state" (section 8.2.4.44 right now), it would be good if it could be reworded for clarity. As stated, it requires very careful reading to decipher its meaning. 4) In "Bogus comment state" (section 8.2.4.44 right now), if we encounter an EOF, is it not a parse error ? (it delegates to DATA state, where it is not a parse error iirc). 5) Comment (1), if valid, affects pre-parser logic too (to find encoding). 6) In "Determining the character encoding" (section 8.2.2.1 right now), under step 5 (the algo to find encoding from html content) : Under sub-step 1, case '<meta', point 12 which currently says - "If mode is true but got pragma is false, then jump to the second step of the overall "two step" algorithm." Here, 'mode' is undefined from what I saw : I assume it is supposed to be 'need pragma' ? 6.1) In point 13 from same snippet from (6) above, we have : "If charset is a UTF-16 encoding, change the value of charset to UTF-8." What if it is explicitly set to utf-16LE or utf-16BE ? Should it be changed too ? Or only for 'utf-16' ? 7) In "get an attribute" (#concept-get-attributes-when-sniffing : section 8.2.2.1 algo in main step 5) : currently a value can end on a whitespace or '>'. What about '/' ? Currently, the '/' will get added to the value ... This is applicable in two places in that algo : step 10 and step 11. Thanks, Mridul -- Configure bugmail: http://www.w3.org/Bugs/Public/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
Received on Saturday, 30 April 2011 13:53:02 UTC