Re: HTML Parsing Step?

On 7/3/07, Norman Walsh <ndw@nwalsh.com> wrote:
> / Alex Milowski <alex@milowski.org> was heard to say:
> | On 7/3/07, Norman Walsh <ndw@nwalsh.com> wrote:
> |> / Alex Milowski <alex@milowski.org> was heard to say:
> |> | At the end of our e-mail discussion in May I suggested we have a separate
> |> | step for parsing HTML.  I still think this is a good idea.  Anyone else?
> |>
> |> So this is the equivalent of "tidy" not the equivalent of "tagsoup",
> |> right?
> |
> | I don't understand this question.
> |
> | Tidy and Tagsoup cleanup HTML.
>
> You're right. Brain cramp. I was thinking that tidy had knowledge of
> the HTML vocabulary (that img and hr are empty, for example) whereas
> tagsoup just cleaned up not-well-formed XML. But that's not the case.
> So nevermind.

To be clear, I think we want this step to take in HTML and output XHTML.

Tidy can cleanup HTML and produce HTML.

Tagsoup can only produce XHTML.


-- 
--Alex Milowski
"The excellence of grammar as a guide is proportional to the paucity of the
inflexions, i.e. to the degree of analysis effected by the language
considered."

Bertrand Russell in a footnote of Principles of Mathematics

Received on Tuesday, 3 July 2007 15:40:50 UTC