[Bug 12576] New: Need clarification on tokenization of html 5 doc. from bugzilla@jessica.w3.org on 2011-04-30 (public-html@w3.org from April 2011)

From: <bugzilla@jessica.w3.org>
Date: Sat, 30 Apr 2011 13:53:00 +0000
To: public-html@w3.org
Message-ID: <bug-12576-2495@http.www.w3.org/Bugs/Public/>

http://www.w3.org/Bugs/Public/show_bug.cgi?id=12576

           Summary: Need clarification on tokenization of html 5 doc.
           Product: HTML WG
           Version: unspecified
          Platform: All
        OS/Version: All
            Status: NEW
          Severity: normal
          Priority: P2
         Component: HTML5 spec (editor: Ian Hickson)
        AssignedTo: ian@hixie.ch
        ReportedBy: mridul@gmail.com
         QAContact: public-html-bugzilla@w3.org
                CC: mike@w3.org, public-html-wg-issue-tracking@w3.org,
                    public-html@w3.org


I was going over sections related to tokenization on html5 spec at
http://dev.w3.org/html5/spec/Overview.html (the version as of today).
Have the following queries/comments ... and clarity on the following would be
great.


1) In "Before attribute name state" (section 8.2.4.34 right now), on
encountering '<', a new attribute is started with '<' as first character.
Shouldn't this not trigger a new element while reporting a parse error ?


2) In "Data state" (section 8.2.4.1 right now), on encountering 'U+0000', the
current input character is emitted. Everywhere else, it is replaced with
U+FFFD. Is this on purpose ? Or a typo ?


3) In "Bogus comment state" (section 8.2.4.44 right now), it would be good if
it could be reworded for clarity. As stated, it requires very careful reading
to decipher its meaning.


4) In "Bogus comment state" (section 8.2.4.44 right now), if we encounter an
EOF, is it not a parse error ? (it delegates to DATA state, where it is not a
parse error iirc).


5) Comment (1), if valid, affects pre-parser logic too (to find encoding).


6) In "Determining the character encoding" (section 8.2.2.1 right now), under
step 5 (the algo to find encoding from html content) :
Under sub-step 1, case '<meta', point 12 which currently says -
"If mode is true but got pragma is false, then jump to the second step of the
overall "two step" algorithm."
Here, 'mode' is undefined from what I saw : I assume it is supposed to be 'need
pragma' ?

6.1) In point 13 from same snippet from (6) above, we have : 
"If charset is a UTF-16 encoding, change the value of charset to UTF-8."
What if it is explicitly set to utf-16LE or utf-16BE ? Should it be changed too
? Or only for 'utf-16' ?


7) In "get an attribute" (#concept-get-attributes-when-sniffing : section
8.2.2.1 algo in main step 5) : currently a value can end on a whitespace or
'>'. What about '/' ? Currently, the '/' will get added to the value ... This
is applicable in two places in that algo : step 10 and step 11.


Thanks,
Mridul

-- 
Configure bugmail: http://www.w3.org/Bugs/Public/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are on the CC list for the bug.

Received on Saturday, 30 April 2011 13:53:02 UTC