- From: Norman Walsh <ndw@nwalsh.com>
- Date: Wed, 23 May 2007 08:03:59 -0400
- To: public-xml-processing-model-wg@w3.org
- Message-ID: <874pm3hhlc.fsf@nwalsh.com>
/ Alex Milowski <alex@milowski.org> was heard to say: | There are at least two steps that may need to deal with HTML content: | | * p:unescape-markup for things like the RSS description element Yeah, fine. You all know how I feel about escaped markup. | * p:http-request for response messages that are HTML Ugh. I hadn't considered this case. Do we have use cases that require support for non-well-formed messages returned by p:http-request? I fear if we support it here, then demand for supporting it on p:load can't be far behind. | The obvious issue with HTML is the problem of malformed HTML | content. There are good solutions to that in the form of tidy, tagsoup, | and the like. Some of them (e.g. tagsoup) turn HTML into XHTML | markup. | | I think we will want to have the ability to handle HTML markup and | convert it into XHTML. Unfortunately, there isn't a well-defined | "what to do with malformed HTML" process out there. | | I think we can assume the following: | | * There is a well-defined outcome for taking in HTML and outputting | XHTML. HTML5 will codify this into a specification at some point. I don't think we can make that assumption. If we have to deal with not-well-formed markup, I would prefer to say that the cleanup is "implementation dependent". I don't care if not-well-formed crud is not interoperable. | * Many HTML parsers have a "tidy mode" that will handle malformed | HTML. I don't have, nor do I want to get, an "HTML parser". Nothing about XProc suggests that I should need such a beast. | * XHTML is an XML content type and there is no such thing as | XHTML that isn't well-formed. Right. | I suggested we use the media type (e.g. text/html) to trigger handling | of HTML. That allows us to have similar behaviors for other content | types that must be unmarshalled into XML to be processed by an | XML pipeline. Many of these would be implementation defined but you | could expect others in the future to be more standard. I don't feel strongly about whether we use a media type mechanism or a simple boolean flag. | I think we could say that for text/html content types we application a | conversion process to XHTML. This conversion process is allowed to | handle malformed HTML in a non-standard and non-interoperable | way to create some kind of well-formed XHTML. | | Here's a couple of questions to see what our preferences are: | | 1. Do we need a "strict" option that forbids conversion of malformed | HTML? For unescaped markup, definitely not. If we wind up with something on p:http-request to untangle NWF markup, I want that to be explicit. Not explicitly asking should cause the step to fail if the data returned isn't WF. | 2. Should we have a more general "HTML" section for the steps | that details how text/html is handled and then specific steps could | point to that for particular contexts? I suppose that will make sense if we wind up using it in more than one place. | In the end, the idea would be that each step that needs to handle HTML | would have some way of: | | * invoking HTML->XHTML conversion | * having options for controlling strict processing, etc. | | We would try to keep these options as consistent as possible. Indeed. The more places we need it, the more I feel we should keep it dead simple. I'm now feeling more strongly in favor of just having a boolean option to do the cleanup. If we need more control in V2, we can add new options. Be seeing you, norm -- Norman Walsh <ndw@nwalsh.com> | Impatient people always arrive too http://nwalsh.com/ | late.--Jean Dutourd
Received on Wednesday, 23 May 2007 12:04:32 UTC