- From: Alex Milowski <alex@milowski.org>
- Date: Wed, 16 May 2007 07:55:48 -0700
- To: "XProc WG" <public-xml-processing-model-wg@w3.org>
There are at least two steps that may need to deal with HTML content: * p:unescape-markup for things like the RSS description element * p:http-request for response messages that are HTML The obvious issue with HTML is the problem of malformed HTML content. There are good solutions to that in the form of tidy, tagsoup, and the like. Some of them (e.g. tagsoup) turn HTML into XHTML markup. I think we will want to have the ability to handle HTML markup and convert it into XHTML. Unfortunately, there isn't a well-defined "what to do with malformed HTML" process out there. I think we can assume the following: * There is a well-defined outcome for taking in HTML and outputting XHTML. HTML5 will codify this into a specification at some point. * Many HTML parsers have a "tidy mode" that will handle malformed HTML. * XHTML is an XML content type and there is no such thing as XHTML that isn't well-formed. That last point is critical. Firefox and other XHTML compliant browsers won't display XHTML that isn't well-formed. Atom won't allow XHTML in text constructs without it being well-formed. So, we don't have to worry about XHTML because a regular XML parser will work just fine. I suggested we use the media type (e.g. text/html) to trigger handling of HTML. That allows us to have similar behaviors for other content types that must be unmarshalled into XML to be processed by an XML pipeline. Many of these would be implementation defined but you could expect others in the future to be more standard. I think we could say that for text/html content types we application a conversion process to XHTML. This conversion process is allowed to handle malformed HTML in a non-standard and non-interoperable way to create some kind of well-formed XHTML. Here's a couple of questions to see what our preferences are: 1. Do we need a "strict" option that forbids conversion of malformed HTML? 2. Should we have a more general "HTML" section for the steps that details how text/html is handled and then specific steps could point to that for particular contexts? In the end, the idea would be that each step that needs to handle HTML would have some way of: * invoking HTML->XHTML conversion * having options for controlling strict processing, etc. We would try to keep these options as consistent as possible. -- --Alex Milowski "The excellence of grammar as a guide is proportional to the paucity of the inflexions, i.e. to the degree of analysis effected by the language considered." Bertrand Russell in a footnote of Principles of Mathematics
Received on Wednesday, 16 May 2007 14:55:59 UTC