Parsing HTML from Alex Milowski on 2007-05-16 (public-xml-processing-model-wg@w3.org from May 2007)

From: Alex Milowski <alex@milowski.org>
Date: Wed, 16 May 2007 07:55:48 -0700
To: "XProc WG" <public-xml-processing-model-wg@w3.org>
Message-ID: <28d56ece0705160755r22fcc360s8a99c76b0c170fb9@mail.gmail.com>

There are at least two steps that may need to deal with HTML content:

   * p:unescape-markup for things like the RSS description element
   * p:http-request for response messages that are HTML

The obvious issue with HTML is the problem of malformed HTML
content.  There are good solutions to that in the form of tidy, tagsoup,
and the like.  Some of them (e.g. tagsoup) turn HTML into XHTML
markup.

I think we will want to have the ability to handle HTML markup and
convert it into XHTML.  Unfortunately, there isn't a well-defined
"what to do with malformed HTML" process out there.

I think we can assume the following:

  * There is a well-defined outcome for taking in HTML and outputting
     XHTML.  HTML5 will codify this into a specification at some point.

  * Many HTML parsers have a "tidy mode" that will handle malformed
     HTML.

  * XHTML is an XML content type and there is no such thing as
     XHTML that isn't well-formed.

That last point is critical.   Firefox and other XHTML compliant browsers
won't display XHTML that isn't well-formed.  Atom won't allow XHTML
in text constructs without it being well-formed.  So, we don't have
to worry about XHTML because a regular XML parser will work just
fine.

I suggested we use the media type (e.g. text/html) to trigger handling
of HTML.  That allows us to have similar behaviors for other content
types that must be unmarshalled into XML to be processed by an
XML pipeline.  Many of these would be implementation defined but you
could expect others in the future to be more standard.

I think we could say that for text/html content types we application a
conversion process to XHTML.  This conversion process is allowed to
handle malformed HTML in a non-standard and non-interoperable
way to create some kind of well-formed XHTML.

Here's a couple of questions to see what our preferences are:

   1. Do we need a "strict" option that forbids conversion of malformed HTML?

   2. Should we have a more general "HTML" section for the steps
        that details how text/html is handled and then specific steps could
        point to that for particular contexts?

In the end, the idea would be that each step that needs to handle HTML
would have some way of:

   * invoking HTML->XHTML conversion
   * having options for controlling strict processing, etc.

We would try to keep these options as consistent as possible.

-- 
--Alex Milowski
"The excellence of grammar as a guide is proportional to the paucity of the
inflexions, i.e. to the degree of analysis effected by the language
considered."

Bertrand Russell in a footnote of Principles of Mathematics

Received on Wednesday, 16 May 2007 14:55:59 UTC