Re: unescaping markup

On 5/14/07, Alessandro Vernet <avernet@orbeon.com> wrote:
>
>
> On 5/8/07, Alex Milowski <alex@milowski.org> wrote:
> > In theory, the same is true for RSS.  So, for example, you could write
> > an XProc pipeline that checks whether all the description elements
> > are correctly escaped XHTML by using unescape-markup and try/catch.
>
> In theory. But like they say, in practice theory often doesn't hold. I
> am trying to think of cases where I am parsing escaped XHTML embedded
> in XML. In most cases, I can't assume the XHTML is well-formed, and
> have to use something like JTidy/TagSoup. So I agree with Norm: I
> think it would be convenient to have the option
> "force-markup-to-be-well-formed" right there.


Ah... I missed that last bit.

Maybe we should have a "content-type" option that would allow you to
specify something like "text/html".  What happens for HTML would have
to be implementation defined because there is no definition of what
 "make it well-formed" means.

I think if you are parsing an XML-typed media, it should be at least
well-formed in accordance with the XML 1.0/1.1 specifications.  If you
specify a non-XML media type, then anything appropriate for that
media type can happen.  This gives implementors the option of
using unregister media types like: "application/x-random-junk" or
"application/vnd-tidy-html".


-- 
--Alex Milowski
"The excellence of grammar as a guide is proportional to the paucity of the
inflexions, i.e. to the degree of analysis effected by the language
considered."

Bertrand Russell in a footnote of Principles of Mathematics

Received on Monday, 14 May 2007 14:16:51 UTC