[Bug 12576] New: Need clarification on tokenization of html 5 doc.


           Summary: Need clarification on tokenization of html 5 doc.
           Product: HTML WG
           Version: unspecified
          Platform: All
        OS/Version: All
            Status: NEW
          Severity: normal
          Priority: P2
         Component: HTML5 spec (editor: Ian Hickson)
        AssignedTo: ian@hixie.ch
        ReportedBy: mridul@gmail.com
         QAContact: public-html-bugzilla@w3.org
                CC: mike@w3.org, public-html-wg-issue-tracking@w3.org,

I was going over sections related to tokenization on html5 spec at
http://dev.w3.org/html5/spec/Overview.html (the version as of today).
Have the following queries/comments ... and clarity on the following would be

1) In "Before attribute name state" (section right now), on
encountering '<', a new attribute is started with '<' as first character.
Shouldn't this not trigger a new element while reporting a parse error ?

2) In "Data state" (section right now), on encountering 'U+0000', the
current input character is emitted. Everywhere else, it is replaced with
U+FFFD. Is this on purpose ? Or a typo ?

3) In "Bogus comment state" (section right now), it would be good if
it could be reworded for clarity. As stated, it requires very careful reading
to decipher its meaning.

4) In "Bogus comment state" (section right now), if we encounter an
EOF, is it not a parse error ? (it delegates to DATA state, where it is not a
parse error iirc).

5) Comment (1), if valid, affects pre-parser logic too (to find encoding).

6) In "Determining the character encoding" (section right now), under
step 5 (the algo to find encoding from html content) :
Under sub-step 1, case '<meta', point 12 which currently says -
"If mode is true but got pragma is false, then jump to the second step of the
overall "two step" algorithm."
Here, 'mode' is undefined from what I saw : I assume it is supposed to be 'need
pragma' ?

6.1) In point 13 from same snippet from (6) above, we have : 
"If charset is a UTF-16 encoding, change the value of charset to UTF-8."
What if it is explicitly set to utf-16LE or utf-16BE ? Should it be changed too
? Or only for 'utf-16' ?

7) In "get an attribute" (#concept-get-attributes-when-sniffing : section algo in main step 5) : currently a value can end on a whitespace or
'>'. What about '/' ? Currently, the '/' will get added to the value ... This
is applicable in two places in that algo : step 10 and step 11.


Configure bugmail: http://www.w3.org/Bugs/Public/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the QA contact for the bug.

Received on Saturday, 30 April 2011 13:53:02 UTC