- From: Ian Hickson <ian@hixie.ch>
- Date: Fri, 10 Mar 2006 21:00:21 +0000 (UTC)
On Sun, 14 Aug 2005, Henri Sivonen wrote: > > & must start an NCR or an entity reference as in XML. (Rationale: Lone & > likely a mistake anyway.) Agreed. > ' is not considered conforming. (Rationale: Did not exist in HTML4 and is > not supported by IE) Disagreed. Consistency on XML seems like a very good thing here. I've also added AMP, COPY, LT, GT, QUOT and REG for compatibility, and made them conformant. It seems like those would be useful in all-caps text. > Entity references and NCRs have to be terminated explicitly with a > semicolon. (Rationale: Implicit termination is likely a mistake unless > the person who wrote the reference is an SGML pedant. Requiring the > semicolon makes things unambiguous for sure. Also, having an explicit > delimiter helps in avoiding lookahead/pushback in the parser.) Agreed. > Astral non-characters are not banned. (They are not banned in XML 1.0, > either.) The only character that get dropped in the spec are U+0000 and U+000D (the latter having special processing converting some of them to U+000A). So I agree, I guess, unless I misunderstood your comment. > Unescaped < and > in attributes are allowed without warning despite > folklore that warns about this breaking unspecified legacy UAs. Agreed. > Unquoted attribute values must be of the form [a-zA-Z][a-zA-Z0-9-]*, > which is slightly restrictive in a semi-arbitrary way for implementation > convenience. Disagreed. Unquoted attribute value syntax is pretty lax in the spec... also for implementation convenience. :-) > The elements script and style are treated as CDATA. The string "</" may > only occur as part of the end tag. (Rationale: This approach is both > compatible with SGML and the way browsers work. Also, this avoids > lookahead/lookback.) Agreed. > PIs are banned. As are marked sections. Agreed. They both end up forming bogus comments. > Doctypes with the SYSTEM id only are banned. > The internal subset is banned. > The HTML5 doctype passes silently. > The HTML 4.01 Strict and Transitional doctypes cause a warning about the > HTML5-centric nature of the parser. > Doctypes whose public id starts with "-//W3C//DTD XHTML " are banned with a > special message. > Other doctypes are treated as errors as is the lack of a doctype. > The lack of a system id in the HTML 4.01 Transitional doctype is treated as an > error. > The lack of a system id in the HTML 4.01 Strict doctype causes a warning even > though the spec says "must" and gives a doctype with a system id. > Failure to use the canonical system ids cause warnings even though the "must" > in HTML 4.01 could be interpreted as banning these. DOCTYPEs other than <!DOCTYPE HTML> (case-insensitive) all cause parse errors, and may trigger quirks mode. > The internal character encoding information is not passed to the > application as content for consistency with the XML declaration, which > is not exposed through the SAX2 ContentHandler. Nothing special is done for this. > The BOM is sniffed. > The lack of character encoding information (including the BOM) is treated as a > fatal error. This part of the spec needs work. > > But I haven't thought much about this yet. The way parsing is to be > > defined I expect to just say "parsers should do this, and if they hit > > this they should do this, and if they hit this it's an error and they > > should do this", with confomance checkers having to do the same but > > reporting the errors. If that makes sense. > > My parser is (almost) Draconian, so I don't intend to implement the > elaborate error recovery that is needed for browsers. (I have no > interest in competing with John Cowan's TagSoup.) The spec explains how to recover from parse errors, but doesn't require recovery from conformance checkers. -- Ian Hickson U+1047E )\._.,--....,'``. fL http://ln.hixie.ch/ U+263A /, _.. \ _\ ;`._ ,. Things that are impossible just take longer. `._.-(,_..'--(,_..'`-.;.'
Received on Friday, 10 March 2006 13:00:21 UTC