Re: HTML Parsing Step?

On 7/3/07, Norman Walsh <> wrote:
> / Alex Milowski <> was heard to say:
> | At the end of our e-mail discussion in May I suggested we have a separate
> | step for parsing HTML.  I still think this is a good idea.  Anyone else?
> So this is the equivalent of "tidy" not the equivalent of "tagsoup",
> right?
> I guess I'm ok with this, but I wonder if we'll need a
> vocabulary-agnostic cleanup step too. Maybe not.
> I guess the next step is to propose a specific step with a description
> and the options you think it needs.

In proofing the steps, we already have this for p:unescape-markup:

"If the 'content-type' option is specified, an implementation can use
a different parser to produce XML content. Such a behavior is
implementation defined. For example, for the mime type 'text/html', an
implementation might provide an HTML to XHTML parser (e.g. Tidy)."

That means if you want to parse HTML into XHTML, you just set the 'content-type'
to 'text/html' on a p:escape-markup and hope for the best.

With this as the status quo we could:

   1. Remove the 'content-type' option and create a new step type.

   2. Specify some kind of text/html processing for p:unescape-markup.

I don't think we want to do both.

--Alex Milowski
"The excellence of grammar as a guide is proportional to the paucity of the
inflexions, i.e. to the degree of analysis effected by the language

Bertrand Russell in a footnote of Principles of Mathematics

Received on Tuesday, 3 July 2007 19:24:54 UTC