- From: Alex Milowski <alex@milowski.org>
- Date: Mon, 14 May 2007 07:54:31 -0700
- To: public-xml-processing-model-wg@w3.org
- Message-ID: <28d56ece0705140754r5713ab73v5ae67fede6492549@mail.gmail.com>
On 5/14/07, Norman Walsh <ndw@nwalsh.com> wrote: > > / Alex Milowski <alex@milowski.org> was heard to say: > [...] > | Ah... I missed that last bit. > | > | Maybe we should have a "content-type" option that would allow you to > | specify something like "text/html". What happens for HTML would have > | to be implementation defined because there is no definition of what > | "make it well-formed" means. > | > | I think if you are parsing an XML-typed media, it should be at least > | well-formed in accordance with the XML 1.0/1.1 specifications. If you > | specify a non-XML media type, then anything appropriate for that > | media type can happen. This gives implementors the option of > | using unregister media types like: "application/x-random-junk" or > | "application/vnd-tidy-html". > > With respect, I think you're still missing the point. Perhaps our > experiences are different, but the escaped markup that I've encountered > in the wild is, when unescaped, not well formed about 99 times out of > 100. > > So to my mind that means the unescape markup step is going to fail > 99 times out of 100 which doesn't seem very useful. > > So it seems like it should have a "fix the broken $@$#%@! markup" > option, even if the exact details of how it does the fixup are > implementation dependent. I'm not missing the point. I get that many instances (especially RSS) have incorrect escaped markup. If your experience is that it is broken 99% of the time, then you'll always set the content-type option and expect magic to be performed by tidy/tagsoup/etc. There are other situations where the markup is escaped but generated by software and so you might very well expect it to be well-formed. If one of the main use cases is parsing escaped XHTML (which isn't what the RSS description element usually contains), then, by the standard, you shouldn't have XHTML that isn't well-formed XML. Of course, in the wild that may not be true (which is where Atom shines because it doesn't allow escaping of XHTML and so this problem goes away). We just need to provide options for handling the "important make this glob of HTML well-formed" case. I proposed using media types for that so that it is extensible and can be specialized by vendors. The nice thing here is that a vendor can choose to only implement well-formed XML parsing as a baseline of interoperability. In that case they can safely ignore the "content-type" option. -- --Alex Milowski "The excellence of grammar as a guide is proportional to the paucity of the inflexions, i.e. to the degree of analysis effected by the language considered." Bertrand Russell in a footnote of Principles of Mathematics
Received on Monday, 14 May 2007 15:01:58 UTC