Re: Parsing HTML

/ Alex Milowski <alex@milowski.org> was heard to say:
| On 5/23/07, Norman Walsh <ndw@nwalsh.com> wrote:
|
|> Indeed. The more places we need it, the more I feel we should keep it
|> dead simple. I'm now feeling more strongly in favor of just having a
|> boolean option to do the cleanup. If we need more control in V2, we can
|> add new options.
|
| I want to be clear that I'm advocating cleanup in the case of HTML documents
| or chunks of HTML documents.  Anything that is an XML media type should
| be considered XML and parsed as such.   There must be no "cleanup" of
| XML.

I understood that.

| We could specialize an option such as:
|
|   * "parse-as-html"  with a value of "yes" and "no"
|
| since HTML is often malformed and we need to convert it to XHTML and there
| are many HTML->XHTML parsers that do cleanup as well (tidy & tagsoup), we
| could roll that functionality into one option.

What constitutes "html" in unescaped markup? Is this HTML:

  <p:unescape-markup>
    <p:input port="source">
      <p:inline>
        <foo>
          &lt;book&gt;&lt;title&gt;Book title&lt;/book&gt;
        </foo>
      </p:inline>
    </p:input>
  </p:unescape-markup>

I think the "cleanup" option has to operate on whatever it gets
without attempting to determine if the thing it got was or was not
HTML.

| If you don't support HTML parsing, you get a dynamic error.
|
| If you don't support malformed HTML handling, you get a dynamic error.

If supporting handling malformed HTML is optional then I want the
entire step to be optional. I don't want non-interoperable required
steps.

| In some cases (e.g. p:load and p:http-request), you may get a media type
| from the resource that tells you that it is HTML.  As such, we could say that
| if you support HTML parsing, you should do that.  Otherwise, you get the
| same result for non-XML media types.  As such, we could get away
| with no "parse-as-html" option on those steps.

For file: URIs, I don't think p:load can be relied up on to give you a
media type.

                                        Be seeing you,
                                          norm

-- 
Norman Walsh <ndw@nwalsh.com> | If you settle for what they're giving
http://nwalsh.com/            | you, you deserve what you get.

Received on Wednesday, 23 May 2007 14:13:14 UTC