- From: <noah_mendelsohn@us.ibm.com>
- Date: Thu, 22 Jan 2009 21:31:36 -0500
- To: "Anne van Kesteren" <annevk@opera.com>
- Cc: elharo@metalab.unc.edu, www-tag <www-tag@w3.org>
Anne van Kesteren asks: > Why can the author not use an HTML parser for the database? > Henri Sivonen e.g. has written tools for parsing HTML in Java > that conform to the HTML5 > specification (means that you get the same tree as browsers > get) that plug directly into an XML toolchain if desired so you > can use XSLT etc. Well, if you're willing to go around and find all the tools that were written to handle XML, and augment them with parsers that special-case the conversion of HTML to XML, yes you can do that. Sometimes that will be practical. Then again, there's an awful lot of software out there that already handles XML that doesn't have such special case code. There are, I would think, tens of millions of copies of Microsoft Office applications like Excel, for example. Part of the value of XML is that it handles pretty much all input the same way. You don't have to go around adding new converting parsers to all of your tools, first for HTML, then for the next language that comes along that needs to be almost-but-not-quite XML. It's a tradeoff. Given that there's a lot of HTML out there that isn't XML, there is of course incremental value in doing what Henri has done in cases where that's practical. I'm pointing out that part of the value of XML comes in scenarios where you are using the same tools (e.g. XML databases, styling tools, etc.) >unmodified< to manage or even integrate a wide range of data formats. I was also responding specifically to a question of why there was value in specifying syntax and parsing rules from XML separately from the specifications for particular applications that use XML. Noah -------------------------------------- Noah Mendelsohn IBM Corporation One Rogers Street Cambridge, MA 02142 1-617-693-4036 -------------------------------------- "Anne van Kesteren" <annevk@opera.com> 01/21/2009 04:40 AM To: noah_mendelsohn@us.ibm.com, elharo@metalab.unc.edu cc: www-tag <www-tag@w3.org> Subject: Re: Comments on HTML WG face to face meetings in France Oct 08 On Tue, 20 Jan 2009 21:17:12 +0100, <noah_mendelsohn@us.ibm.com> wrote: > Consider, though, a different use case, in which some of the same XHMTL > documents are to be stored in an XML database and their attributes and > other data used as the subjects of queries. Now you have in intersting > tension. The database will presumably deal only with well formed XML > documents, which means that the messier content that browsers deal with > won't work in the database, at least not in the obvious way. On the > other hand, the positive value of the layering becomes a bit clearer. > The XML > specification describes the subset of the documents that will work in > tools like the XML database. Conforming XML parsers will accept those > documents and reject others (though, as Elliotte points out, nothing > prevents those parsers from handing the input text up to a browser, that > may still decide to render it.) Why can the author not use an HTML parser for the database? Henri Sivonen e.g. has written tools for parsing HTML in Java that conform to the HTML5 specification (means that you get the same tree as browsers get) that plug directly into an XML toolchain if desired so you can use XSLT etc. It seems to me that solving a toolchain problem is much better solved on the toolchain level than in the format which is used in the database and published on the Web. -- Anne van Kesteren <http://annevankesteren.nl/> <http://www.opera.com/>
Received on Friday, 23 January 2009 02:32:28 UTC