- From: Henri Sivonen <hsivonen@iki.fi>
- Date: Sun, 14 Aug 2005 12:50:49 +0300
On Jul 30, 2005, at 00:17, Ian Hickson wrote: > On Fri, 29 Jul 2005, Henri Sivonen wrote: >> >> I would like to add HTML (both 4 and 5) support to >> http://hsivonen.iki.fi/validator/ . > > Great! I have now have an initial version of a (mostly) Draconian HTML5 parser. It does not do tag inference, yet, so the documents have to be fully-tagged. The parser does not attempt to convert BASE into xml:base. .class and .java available: http://hsivonen.iki.fi/validator-about/htmlparser.jar The class with main() is fi.iki.hsivonen.htmlparser.test.HtmlParserTestDriver. The test driver requires stuff from GNU JAXP in the classpath. The parser itself does not depend on GNU JAXP. The test driver takes files whose name ends with ".html" as arguments and *overwrites* corresponding ".xhtml" files with the conversion result. And to the spec-related point: I made the following decisions while implementing. Hopefully the document conformance requirements will agree. :-) & must start an NCR or an entity reference as in XML. (Rationale: Lone & likely a mistake anyway.) ' is not considered conforming. (Rationale: Did not exist in HTML4 and is not supported by IE) Entity references and NCRs have to be terminated explicitly with a semicolon. (Rationale: Implicit termination is likely a mistake unless the person who wrote the reference is an SGML pedant. Requiring the semicolon makes things unambiguous for sure. Also, having an explicit delimiter helps in avoiding lookahead/pushback in the parser.) Astral non-characters are not banned. (They are not banned in XML 1.0, either.) Unescaped < and > in attributes are allowed without warning despite folklore that warns about this breaking unspecified legacy UAs. Unquoted attribute values must be of the form [a-zA-Z][a-zA-Z0-9-]*, which is slightly restrictive in a semi-arbitrary way for implementation convenience. The elements script and style are treated as CDATA. The string "</" may only occur as part of the end tag. (Rationale: This approach is both compatible with SGML and the way browsers work. Also, this avoids lookahead/lookback.) PIs are banned. As are marked sections. Doctypes with the SYSTEM id only are banned. The internal subset is banned. The HTML5 doctype passes silently. The HTML 4.01 Strict and Transitional doctypes cause a warning about the HTML5-centric nature of the parser. Doctypes whose public id starts with "-//W3C//DTD XHTML " are banned with a special message. Other doctypes are treated as errors as is the lack of a doctype. The lack of a system id in the HTML 4.01 Transitional doctype is treated as an error. The lack of a system id in the HTML 4.01 Strict doctype causes a warning even though the spec says "must" and gives a doctype with a system id. Failure to use the canonical system ids cause warnings even though the "must" in HTML 4.01 could be interpreted as banning these. The internal character encoding information is not passed to the application as content for consistency with the XML declaration, which is not exposed through the SAX2 ContentHandler. The BOM is sniffed. The lack of character encoding information (including the BOM) is treated as a fatal error. >> Assuming that the supported syntax for HTML4 is constrained to exclude >> minimizations that don't work in browsers, the biggest issue with >> decoupling the parser from the HTML version seems to be the doctype. > > Makes sense. I would recommend treating the following syntax, > case-insensitive, as being conformant: > > doctype ::= "<!" "doctype" whitespace+ "html" whitespace* ">" Thanks. > But I haven't thought much about this yet. The way parsing is to be > defined I expect to just say "parsers should do this, and if they hit > this > they should do this, and if they hit this it's an error and they > should do > this", with confomance checkers having to do the same but reporting the > errors. If that makes sense. My parser is (almost) Draconian, so I don't intend to implement the elaborate error recovery that is needed for browsers. (I have no interest in competing with John Cowan's TagSoup.) -- Henri Sivonen hsivonen at iki.fi http://hsivonen.iki.fi/
Received on Sunday, 14 August 2005 02:50:49 UTC