W3C home > Mailing lists > Public > public-xml-processing-model-wg@w3.org > February 2007

Dealing with non-XML documents

From: Norman Walsh <Norman.Walsh@Sun.COM>
Date: Thu, 01 Feb 2007 10:07:18 -0800
To: public-xml-processing-model-wg@w3.org
Message-ID: <87fy9pu4yx.fsf@nwalsh.com>
We've had a couple of proposals in the component thread that amount to
allowing non-XML documents to flow through the pipeline in some fashion.

That looks like a slippery slope to me. With sharp spikes at the bottom.

But if we're going to entertain it, I think we should consider it
generally and not in isolation around one or two components.

First off, can we agree that we're talking about things like text/html
or text/plain or image/jpeg that are manifestly not XML. If a
component is supposed to generate XML but sends mis-matched start and
end tag events, the processor is required to consider that an error.

The simplest answer to the question, "how do I process text/html with
XProc?" is: you don't. Implementors can provide extension components
that do anything they want, but the standard components like load
simply produce errors.

Another answer, I think, is that components can produce some sort of
quoting element (I forget what name Alex proposed) like

  <p:quoted-content type="text/html">
  ...
  </p:quoted-content>

If we adopted this, I think I'd want some sort of user option to
enable it.

The last answer I can think of is that we could try to tidy/tagsoup.

I suppose, if we can't agree on the simplest answer, I'm inclined to
say we do the quoted conent thing and have a standard component that
takes a quoted content thing and attempts (through an implementation
defined mechanism) to turn it into well-formed XML.

Simpler is better though, I think.

                                        Be seeing you,
                                          norm

-- 
Norman Walsh
XML Standards Architect
Sun Microsystems, Inc.

Received on Thursday, 1 February 2007 18:07:30 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Tuesday, 8 January 2008 14:21:49 GMT