Re: unescaping markup from Alex Milowski on 2007-05-14 (public-xml-processing-model-wg@w3.org from May 2007)

From: Alex Milowski <alex@milowski.org>
Date: Mon, 14 May 2007 07:54:31 -0700
To: public-xml-processing-model-wg@w3.org
Message-ID: <28d56ece0705140754r5713ab73v5ae67fede6492549@mail.gmail.com>

On 5/14/07, Norman Walsh <ndw@nwalsh.com> wrote:
>
> / Alex Milowski <alex@milowski.org> was heard to say:
> [...]
> | Ah... I missed that last bit.
> |
> | Maybe we should have a "content-type" option that would allow you to
> | specify something like "text/html".  What happens for HTML would have
> | to be implementation defined because there is no definition of what
> | "make it well-formed" means.
> |
> | I think if you are parsing an XML-typed media, it should be at least
> | well-formed in accordance with the XML 1.0/1.1 specifications.  If you
> | specify a non-XML media type, then anything appropriate for that
> | media type can happen.  This gives implementors the option of
> | using unregister media types like: "application/x-random-junk" or
> | "application/vnd-tidy-html".
>
> With respect, I think you're still missing the point. Perhaps our
> experiences are different, but the escaped markup that I've encountered
> in the wild is, when unescaped, not well formed about 99 times out of
> 100.
>
> So to my mind that means the unescape markup step is going to fail
> 99 times out of 100 which doesn't seem very useful.
>
> So it seems like it should have a "fix the broken $@$#%@! markup"
> option, even if the exact details of how it does the fixup are
> implementation dependent.

I'm not missing the point.  I get that many instances (especially RSS)
have incorrect escaped markup.  If your experience is that it is
broken 99% of the time, then you'll always set the content-type
option and expect magic to be performed by tidy/tagsoup/etc.

There are other situations where the markup is escaped
but generated by software and so you might very well expect it
to be well-formed.  If one of the main use cases is parsing
escaped XHTML (which isn't what the RSS description element
usually contains), then, by the standard, you shouldn't have
XHTML that isn't well-formed XML.  Of course, in the wild that
may not be true (which is where Atom shines because it doesn't
allow escaping of XHTML and so this problem goes away).

We just need to provide options for handling the "important make
this glob of HTML well-formed" case.

I proposed using media types for that so that it is extensible and
can be specialized by vendors.

The nice thing here is that a vendor can choose to only implement
well-formed XML parsing as a baseline of interoperability.  In that
case they can safely ignore the "content-type" option.

-- 
--Alex Milowski
"The excellence of grammar as a guide is proportional to the paucity of the
inflexions, i.e. to the degree of analysis effected by the language
considered."

Bertrand Russell in a footnote of Principles of Mathematics

Received on Monday, 14 May 2007 15:01:58 UTC