W3C home > Mailing lists > Public > public-xml-processing-model-wg@w3.org > February 2007

Re: Dealing with non-XML documents

From: Alex Milowski <alex@milowski.org>
Date: Thu, 1 Feb 2007 12:22:30 -0800
Message-ID: <28d56ece0702011222q15fa1c49n40f5a9809bf29639@mail.gmail.com>
To: public-xml-processing-model-wg@w3.org
On 2/1/07, Norman Walsh <Norman.Walsh@sun.com> wrote:
> We've had a couple of proposals in the component thread that amount to
> allowing non-XML documents to flow through the pipeline in some fashion.
> That looks like a slippery slope to me. With sharp spikes at the bottom.
> But if we're going to entertain it, I think we should consider it
> generally and not in isolation around one or two components.
> First off, can we agree that we're talking about things like text/html
> or text/plain or image/jpeg that are manifestly not XML. If a
> component is supposed to generate XML but sends mis-matched start and
> end tag events, the processor is required to consider that an error.


The simplest answer to the question, "how do I process text/html with
> XProc?" is: you don't. Implementors can provide extension components
> that do anything they want, but the standard components like load
> simply produce errors.

That is one possibility.

Another answer, I think, is that components can produce some sort of
> quoting element (I forget what name Alex proposed) like
>   <p:quoted-content type="text/html">
>   ...
>   </p:quoted-content>
> If we adopted this, I think I'd want some sort of user option to
> enable it.

As some kind of pipeline option?

The last answer I can think of is that we could try to tidy/tagsoup.
> I suppose, if we can't agree on the simplest answer, I'm inclined to
> say we do the quoted conent thing and have a standard component that
> takes a quoted content thing and attempts (through an implementation
> defined mechanism) to turn it into well-formed XML.

The big  snag  comes in when we consider the HTTP request.  There you
need a way to deal with making requests that aren't a simple XML
post and deal with responses that aren't XML or may not have any

While I've considered using a "quoted content" element, I haven't really
spent the implementation time to go there.  What I have done is is
look at the mime-type or component parameters are run the appropriate
"make this HTML goo XML" component (e.g. TagSoup).  I have also
allow you to just ask for the HTTP response codes and the return
the entity body as quoted data.

I think the cases we need to consider are specific to components we're

   * What happens when an XSLT transformation specifies an output
     mode of 'html' or 'text' ?

  * Can you use the parse component to handle HTML content?

  * What does the Load or "Http Request" component do when the mime-type
    (or assumed mime type) is not an XML type?

It would be unfortunate for an implementor not to have an option to
extend the behavior of our core components to allow handling of HTML or
other media types.

That is, more specifically, if an implementor was required to create a
different component to do "parse HTML into XML when you see HTML"
then authors would be forced to switch to use the non-standard
component in all cases (assumed they wanted to be assured that
the pipeline would succeed).

On the other hand, if the "Load" or "Http Request" component was
allowed to handle HTML in some implementations then we'd have a
interoperability problem.

In the end, I'm torn.  I'm going to have the "handle HTML" components
in my implementation somehow because I need that feature.  I'd love
to have an "optional" feature that falls back to XML parsing.  That way
there would be interoperability amongst the implementations who
choose to have that option.

...keep in mind that HTML isn't an edge case as there is a lot of it
hanging around that needs to be processed.

--Alex Milowski
"The excellence of grammar as a guide is proportional to the paucity of the
inflexions, i.e. to the degree of analysis effected by the language

Bertrand Russell in a footnote of Principles of Mathematics
Received on Thursday, 1 February 2007 20:22:44 UTC

This archive was generated by hypermail 2.3.1 : Tuesday, 6 January 2015 21:32:41 UTC