Re: Parsing HTML

On 5/23/07, Norman Walsh <ndw@nwalsh.com> wrote:

> Indeed. The more places we need it, the more I feel we should keep it
> dead simple. I'm now feeling more strongly in favor of just having a
> boolean option to do the cleanup. If we need more control in V2, we can
> add new options.

I want to be clear that I'm advocating cleanup in the case of HTML documents
or chunks of HTML documents.  Anything that is an XML media type should
be considered XML and parsed as such.   There must be no "cleanup" of
XML.

We could specialize an option such as:

   * "parse-as-html"  with a value of "yes" and "no"

since HTML is often malformed and we need to convert it to XHTML and there
are many HTML->XHTML parsers that do cleanup as well (tidy & tagsoup), we
could roll that functionality into one option.

If you don't support HTML parsing, you get a dynamic error.

If you don't support malformed HTML handling, you get a dynamic error.

The unfortunate bit about having one option is that you can't tell the
difference
between malformed HTML and good HTML if you support parsing malformed
HTML.  Since you can't tell the difference between malformed HTML that
tidy supports and malformed HTML that tagsoup supports as they handle
difference cases of "malformed-ness", I think that's the best we can do.

To support a "strict" option, an implementor would either need an SGML parser
or a very good HTML parser.  I'm not sure we want to go there.

In some cases (e.g. p:load and p:http-request), you may get a media type
from the resource that tells you that it is HTML.  As such, we could say that
if you support HTML parsing, you should do that.  Otherwise, you get the
same result for non-XML media types.  As such, we could get away
with no "parse-as-html" option on those steps.

-- 
--Alex Milowski
"The excellence of grammar as a guide is proportional to the paucity of the
inflexions, i.e. to the degree of analysis effected by the language
considered."

Bertrand Russell in a footnote of Principles of Mathematics

Received on Wednesday, 23 May 2007 13:59:53 UTC