Re: unescaping markup from Alex Milowski on 2007-05-15 (public-xml-processing-model-wg@w3.org from May 2007)

From: Alex Milowski <alex@milowski.org>
Date: Tue, 15 May 2007 07:56:31 -0700
To: public-xml-processing-model-wg@w3.org
Message-ID: <28d56ece0705150756s23b9ef9ai35c2421b496f3752@mail.gmail.com>

On 5/15/07, Norman Walsh <ndw@nwalsh.com> wrote:
>
> / Alex Milowski <alex@milowski.org> was heard to say:
> | 2) What are the appropriate content types to use for other types of
> |> documents to trigger vs. not trigger this behavior?
> |
> | Using media types allows vendors to specify content handling behavior
> | using specialized media types or  unregistered types (e.g.
> | "application/x-goop"
> | or "application/vnd-random-stuff")
>
> So the content-type tells me the type that I should expect to get when I
> unwrap the escaped markup. So, told to expect text/html, I know that I
> will have to run tagsoup or some similar component to turn it into well
> formed XML.
>
> What would a content-type of "application/vnd-random-stuff" tell me?
> That instead of running tagsoup I should run some other cleanup process?

Basically, yes.  Completely non-interoperable but that is what the "vnd-"
prefix says.

Can I (as an implementor) assert that I accept "image/png" as a
> content-type and that my cleanup process is to base64 encode the data
> and wrap it in an <image> tag? Am I allowed (or required?) to fail if
> the data isn't a PNG image?

Well, keep in mind that you can't "parse"  element content as image/png
and so all image/* media types wouldn't make sense.

If I see a content-type I don't recognize, can I try running tagsoup or
> must I fail?

I think you should fail.

This all seems to expose complexity and interoperability issues that
> don't have much obvious value.

The obvious value is extensibility.

Are there any actual use cases for content-types other than
> application/xml (or application/*+xml) and text/html?

New media types can be registered independent of our specification
and so we're be ready to handle them.

In the case of parsing string element content, the main use case
is XML,. HTML, and XHTML.  While XHTML is an XML media
type (application/xhtml+xml), someone might want to run a "cleanup"
parser on it like tagsoup or tidy.

We could enumerate a set of fixed values: 'xml', 'html', and 'xhtml'.  If
you
specify 'xhtml', you're allowed to fix bad content.  If you want to strictly

enforce XHTML as an XML media type, then you'd just use 'xml'.  I'd make
'xml' the default mode.

We could follow the way the atom 'type' attribute works on their 'content'
element and allow media types as well as a fixed number of tokens
like 'xml', 'html', and 'xhtml'.  What happens for media types would
be implementation defined except for XML, XHTML, and HTML media types.

-- 
--Alex Milowski
"The excellence of grammar as a guide is proportional to the paucity of the
inflexions, i.e. to the degree of analysis effected by the language
considered."

Bertrand Russell in a footnote of Principles of Mathematics

Received on Tuesday, 15 May 2007 14:56:35 UTC