Re: Parsing HTML

On 5/16/07, Alex Milowski <alex@milowski.org> wrote:
>
> There are at least two steps that may need to deal with HTML content:
>
>    * p:unescape-markup for things like the RSS description element
>    * p:http-request for response messages that are HTML
>
> The obvious issue with HTML is the problem of malformed HTML
> content.  There are good solutions to that in the form of tidy, tagsoup,
> and the like.  Some of them (e.g. tagsoup) turn HTML into XHTML
> markup.
>
> I think we will want to have the ability to handle HTML markup and
> convert it into XHTML.  Unfortunately, there isn't a well-defined
> "what to do with malformed HTML" process out there.

Isn'it the aim of HTML5 :
http://www.whatwg.org/specs/web-apps/current-work/#parsing

>
> I think we can assume the following:
>
>   * There is a well-defined outcome for taking in HTML and outputting
>      XHTML.  HTML5 will codify this into a specification at some point.

Ok
>
>   * Many HTML parsers have a "tidy mode" that will handle malformed
>      HTML.
>
>   * XHTML is an XML content type and there is no such thing as
>      XHTML that isn't well-formed.
>
> That last point is critical.   Firefox and other XHTML compliant browsers
> won't display XHTML that isn't well-formed.  Atom won't allow XHTML
> in text constructs without it being well-formed.  So, we don't have
> to worry about XHTML because a regular XML parser will work just
> fine.

Ok but what about well formed XML, that is not XHTML but declared as XHTML ?
<a><div/></a>

>
> I suggested we use the media type (e.g. text/html) to trigger handling
> of HTML.  That allows us to have similar behaviors for other content
> types that must be unmarshalled into XML to be processed by an
> XML pipeline.  Many of these would be implementation defined but you
> could expect others in the future to be more standard.
>
> I think we could say that for text/html content types we application a
> conversion process to XHTML.  This conversion process is allowed to
> handle malformed HTML in a non-standard and non-interoperable
> way to create some kind of well-formed XHTML.
>
> Here's a couple of questions to see what our preferences are:
>
>    1. Do we need a "strict" option that forbids conversion of malformed HTML?
>
>    2. Should we have a more general "HTML" section for the steps
>         that details how text/html is handled and then specific steps could
>         point to that for particular contexts?
>
> In the end, the idea would be that each step that needs to handle HTML
> would have some way of:
>
>    * invoking HTML->XHTML conversion
>    * having options for controlling strict processing, etc.
>
> We would try to keep these options as consistent as possible.
>


Mohamed

-- 
Innovimax SARL
Consulting, Training & XML Development
9, impasse des Orteaux
75020 Paris
Tel : +33 8 72 475787
Fax : +33 1 4356 1746
http://www.innovimax.fr
RCS Paris 488.018.631
SARL au capital de 10.000 €

Received on Wednesday, 16 May 2007 15:38:57 UTC