Re: Parsing HTML

On 5/23/07, Norman Walsh <ndw@nwalsh.com> wrote:
> / Alex Milowski <alex@milowski.org> was heard to say:
> | On 5/23/07, Norman Walsh <ndw@nwalsh.com> wrote:
> |
> |> Indeed. The more places we need it, the more I feel we should keep it
> |> dead simple. I'm now feeling more strongly in favor of just having a
> |> boolean option to do the cleanup. If we need more control in V2, we can
> |> add new options.
> |
> | I want to be clear that I'm advocating cleanup in the case of HTML documents
> | or chunks of HTML documents.  Anything that is an XML media type should
> | be considered XML and parsed as such.   There must be no "cleanup" of
> | XML.
>
> I understood that.
>
> | We could specialize an option such as:
> |
> |   * "parse-as-html"  with a value of "yes" and "no"
> |
> | since HTML is often malformed and we need to convert it to XHTML and there
> | are many HTML->XHTML parsers that do cleanup as well (tidy & tagsoup), we
> | could roll that functionality into one option.
>
> What constitutes "html" in unescaped markup? Is this HTML:
>
>   <p:unescape-markup>
>     <p:input port="source">
>       <p:inline>
>         <foo>
>           &lt;book&gt;&lt;title&gt;Book title&lt;/book&gt;
>         </foo>
>       </p:inline>
>     </p:input>
>   </p:unescape-markup>
>
> I think the "cleanup" option has to operate on whatever it gets
> without attempting to determine if the thing it got was or was not
> HTML.

Sure.

> | If you don't support HTML parsing, you get a dynamic error.
> |
> | If you don't support malformed HTML handling, you get a dynamic error.
>
> If supporting handling malformed HTML is optional then I want the
> entire step to be optional. I don't want non-interoperable required
> steps.

We already have this issue for p:load and the 'validate' option.  If you don't
support validation, you get a dynamic error.  Given your preference, we
should change that as well.

Maybe we should separate this as its own set of optional steps for
handling HTML and not do this in our core components.

There is already a way to handle the text/html response from
p:http-request.

The result's p:body can then be processing with the "parse-html"
step into XHTML.

> | In some cases (e.g. p:load and p:http-request), you may get a media type
> | from the resource that tells you that it is HTML.  As such, we could say that
> | if you support HTML parsing, you should do that.  Otherwise, you get the
> | same result for non-XML media types.  As such, we could get away
> | with no "parse-as-html" option on those steps.
>
> For file: URIs, I don't think p:load can be relied up on to give you a
> media type.

Right.  Maybe p:load just fails if it isn't XML and you should use
the p:http-request + "parse html" step sequence instead.


-- 
--Alex Milowski
"The excellence of grammar as a guide is proportional to the paucity of the
inflexions, i.e. to the degree of analysis effected by the language
considered."

Bertrand Russell in a footnote of Principles of Mathematics

Received on Wednesday, 23 May 2007 14:56:07 UTC