- From: Innovimax SARL <innovimax@gmail.com>
- Date: Wed, 16 May 2007 17:38:53 +0200
- To: "Alex Milowski" <alex@milowski.org>
- Cc: "XProc WG" <public-xml-processing-model-wg@w3.org>
On 5/16/07, Alex Milowski <alex@milowski.org> wrote: > > There are at least two steps that may need to deal with HTML content: > > * p:unescape-markup for things like the RSS description element > * p:http-request for response messages that are HTML > > The obvious issue with HTML is the problem of malformed HTML > content. There are good solutions to that in the form of tidy, tagsoup, > and the like. Some of them (e.g. tagsoup) turn HTML into XHTML > markup. > > I think we will want to have the ability to handle HTML markup and > convert it into XHTML. Unfortunately, there isn't a well-defined > "what to do with malformed HTML" process out there. Isn'it the aim of HTML5 : http://www.whatwg.org/specs/web-apps/current-work/#parsing > > I think we can assume the following: > > * There is a well-defined outcome for taking in HTML and outputting > XHTML. HTML5 will codify this into a specification at some point. Ok > > * Many HTML parsers have a "tidy mode" that will handle malformed > HTML. > > * XHTML is an XML content type and there is no such thing as > XHTML that isn't well-formed. > > That last point is critical. Firefox and other XHTML compliant browsers > won't display XHTML that isn't well-formed. Atom won't allow XHTML > in text constructs without it being well-formed. So, we don't have > to worry about XHTML because a regular XML parser will work just > fine. Ok but what about well formed XML, that is not XHTML but declared as XHTML ? <a><div/></a> > > I suggested we use the media type (e.g. text/html) to trigger handling > of HTML. That allows us to have similar behaviors for other content > types that must be unmarshalled into XML to be processed by an > XML pipeline. Many of these would be implementation defined but you > could expect others in the future to be more standard. > > I think we could say that for text/html content types we application a > conversion process to XHTML. This conversion process is allowed to > handle malformed HTML in a non-standard and non-interoperable > way to create some kind of well-formed XHTML. > > Here's a couple of questions to see what our preferences are: > > 1. Do we need a "strict" option that forbids conversion of malformed HTML? > > 2. Should we have a more general "HTML" section for the steps > that details how text/html is handled and then specific steps could > point to that for particular contexts? > > In the end, the idea would be that each step that needs to handle HTML > would have some way of: > > * invoking HTML->XHTML conversion > * having options for controlling strict processing, etc. > > We would try to keep these options as consistent as possible. > Mohamed -- Innovimax SARL Consulting, Training & XML Development 9, impasse des Orteaux 75020 Paris Tel : +33 8 72 475787 Fax : +33 1 4356 1746 http://www.innovimax.fr RCS Paris 488.018.631 SARL au capital de 10.000 €
Received on Wednesday, 16 May 2007 15:38:57 UTC