Re: HTML Parsing Step?

On 7/3/07, Norman Walsh <ndw@nwalsh.com> wrote:
> / Alex Milowski <alex@milowski.org> was heard to say:
> [...]
> | With this as the status quo we could:
> |
> |   1. Remove the 'content-type' option and create a new step type.
> |
> |   2. Specify some kind of text/html processing for p:unescape-markup.
> |
> | I don't think we want to do both.
>
> Ok. So, what's the right answer? :-)

In thinking about this, since the p:http-request step will produce
escaped HTML when text/html is returned, I think the right thing
to do is to keep the content-type option on p:unescape-markup.

What I'd like to try to do is make the option a little less random
in that:

*  if the content type is 'text/html', XHTML is output of the
   unescaping and parsing via some process like tidy or tagsoup.

*  Unfortunately, there is no standard outcome so the *exact* XHTML
   conversion is implementation defined.  You'd just be guaranteed that
   it would come out in the XHTML namespace, if at all.

* All other content types would be implementation defined.

* A dynamic error is thrown when a step doesn't support the content-type
   specified.

In the future, when/if HTML5 hopefully defines an outcome for HTML to XHTML
conversion, there would possibly be an expected outcome from this feature.

-- 
--Alex Milowski
"The excellence of grammar as a guide is proportional to the paucity of the
inflexions, i.e. to the degree of analysis effected by the language
considered."

Bertrand Russell in a footnote of Principles of Mathematics

Received on Tuesday, 3 July 2007 22:36:05 UTC