W3C home > Mailing lists > Public > public-xml-processing-model-wg@w3.org > February 2007

Re: Dealing with non-XML documents

From: Norman Walsh <Norman.Walsh@Sun.COM>
Date: Thu, 01 Feb 2007 13:22:59 -0800
To: public-xml-processing-model-wg@w3.org
Message-ID: <873b5ptvws.fsf@nwalsh.com>
/ Alex Milowski <alex@milowski.org> was heard to say:
|> Another answer, I think, is that components can produce some sort of
|> quoting element (I forget what name Alex proposed) like
|>   <p:quoted-content type="text/html">
|>   ...
|>   </p:quoted-content>
|> If we adopted this, I think I'd want some sort of user option to
|> enable it.
| As some kind of pipeline option?

No, as a component option:

  <p:step type="p:xslt" quote-non-xml-resources="yes">

| The last answer I can think of is that we could try to tidy/tagsoup.
|> I suppose, if we can't agree on the simplest answer, I'm inclined to
|> say we do the quoted conent thing and have a standard component that
|> takes a quoted content thing and attempts (through an implementation
|> defined mechanism) to turn it into well-formed XML.
| The big  snag  comes in when we consider the HTTP request.  There you
| need a way to deal with making requests that aren't a simple XML
| post and deal with responses that aren't XML or may not have any
| content.

Only if we consider the non-XML cases critical for V1.

| While I've considered using a "quoted content" element, I haven't really
| spent the implementation time to go there.  What I have done is is
| look at the mime-type or component parameters are run the appropriate
| "make this HTML goo XML" component (e.g. TagSoup).  I have also

But as I said before, there are no standards we can point to for the
"make this HTML goo XML" algorithm and I don't want the results to be
implementation dependent.

| allow you to just ask for the HTTP response codes and the return
| the entity body as quoted data.
| I think the cases we need to consider are specific to components we're
| contemplating:
|   * What happens when an XSLT transformation specifies an output
|     mode of 'html' or 'text' ?
|  * Can you use the parse component to handle HTML content?
|  * What does the Load or "Http Request" component do when the mime-type
|    (or assumed mime type) is not an XML type?

Those are three good examples, but I don't want to solve this on a
component-by-component basis. We should be consistent.

| It would be unfortunate for an implementor not to have an option to
| extend the behavior of our core components to allow handling of HTML or
| other media types.

The standard components have to be interoperable so either we define a
precise mechanism for handling these cases or the implementor must
write custom components to do the extended behavior (IMHO).

| That is, more specifically, if an implementor was required to create a
| different component to do "parse HTML into XML when you see HTML"
| then authors would be forced to switch to use the non-standard
| component in all cases (assumed they wanted to be assured that
| the pipeline would succeed).
| On the other hand, if the "Load" or "Http Request" component was
| allowed to handle HTML in some implementations then we'd have a
| interoperability problem.
| In the end, I'm torn.  I'm going to have the "handle HTML" components
| in my implementation somehow because I need that feature.  I'd love
| to have an "optional" feature that falls back to XML parsing.  That way
| there would be interoperability amongst the implementations who
| choose to have that option.
| ...keep in mind that HTML isn't an edge case as there is a lot of it
| hanging around that needs to be processed.

By XProc, by all implementations, in V1?

                                        Be seeing you,

Norman Walsh
XML Standards Architect
Sun Microsystems, Inc.

Received on Thursday, 1 February 2007 21:55:59 UTC

This archive was generated by hypermail 2.3.1 : Tuesday, 6 January 2015 21:32:41 UTC