Re: Parsing HTML

/ Alex Milowski <alex@milowski.org> was heard to say:
| There are at least two steps that may need to deal with HTML content:
|
|   * p:unescape-markup for things like the RSS description element

Yeah, fine. You all know how I feel about escaped markup.

|   * p:http-request for response messages that are HTML

Ugh. I hadn't considered this case. Do we have use cases that require
support for non-well-formed messages returned by p:http-request?

I fear if we support it here, then demand for supporting it on p:load
can't be far behind.

| The obvious issue with HTML is the problem of malformed HTML
| content.  There are good solutions to that in the form of tidy, tagsoup,
| and the like.  Some of them (e.g. tagsoup) turn HTML into XHTML
| markup.
|
| I think we will want to have the ability to handle HTML markup and
| convert it into XHTML.  Unfortunately, there isn't a well-defined
| "what to do with malformed HTML" process out there.
|
| I think we can assume the following:
|
|  * There is a well-defined outcome for taking in HTML and outputting
|     XHTML.  HTML5 will codify this into a specification at some point.

I don't think we can make that assumption. If we have to deal with
not-well-formed markup, I would prefer to say that the cleanup is
"implementation dependent". I don't care if not-well-formed crud is not
interoperable.

|  * Many HTML parsers have a "tidy mode" that will handle malformed
|     HTML.

I don't have, nor do I want to get, an "HTML parser". Nothing about
XProc suggests that I should need such a beast.

|  * XHTML is an XML content type and there is no such thing as
|     XHTML that isn't well-formed.

Right.

| I suggested we use the media type (e.g. text/html) to trigger handling
| of HTML.  That allows us to have similar behaviors for other content
| types that must be unmarshalled into XML to be processed by an
| XML pipeline.  Many of these would be implementation defined but you
| could expect others in the future to be more standard.

I don't feel strongly about whether we use a media type mechanism or
a simple boolean flag.

| I think we could say that for text/html content types we application a
| conversion process to XHTML.  This conversion process is allowed to
| handle malformed HTML in a non-standard and non-interoperable
| way to create some kind of well-formed XHTML.
|
| Here's a couple of questions to see what our preferences are:
|
|   1. Do we need a "strict" option that forbids conversion of malformed
|   HTML?

For unescaped markup, definitely not. If we wind up with something on
p:http-request to untangle NWF markup, I want that to be explicit. Not
explicitly asking should cause the step to fail if the data returned
isn't WF.

|   2. Should we have a more general "HTML" section for the steps
|        that details how text/html is handled and then specific steps could
|        point to that for particular contexts?

I suppose that will make sense if we wind up using it in more than one
place.

| In the end, the idea would be that each step that needs to handle HTML
| would have some way of:
|
|   * invoking HTML->XHTML conversion
|   * having options for controlling strict processing, etc.
|
| We would try to keep these options as consistent as possible.

Indeed. The more places we need it, the more I feel we should keep it
dead simple. I'm now feeling more strongly in favor of just having a
boolean option to do the cleanup. If we need more control in V2, we can
add new options.

                                        Be seeing you,
                                          norm

-- 
Norman Walsh <ndw@nwalsh.com> | Impatient people always arrive too
http://nwalsh.com/            | late.--Jean Dutourd

Received on Wednesday, 23 May 2007 12:04:32 UTC