W3C home > Mailing lists > Public > www-tag@w3.org > January 2009

Re: Comments on HTML WG face to face meetings in France Oct 08

From: <noah_mendelsohn@us.ibm.com>
Date: Thu, 22 Jan 2009 21:31:36 -0500
To: "Anne van Kesteren" <annevk@opera.com>
Cc: elharo@metalab.unc.edu, www-tag <www-tag@w3.org>
Message-ID: <OFBC06637C.5D87622B-ON85257547.000D2A49-85257547.000DE226@lotus.com>

Anne van Kesteren asks:

> Why can the author not use an HTML parser for the database? 
> Henri Sivonen e.g. has written tools for parsing HTML in Java 
> that conform to the HTML5 
> specification (means that you get the same tree as browsers 
> get) that plug directly into an XML toolchain if desired so you
> can use XSLT etc.

Well, if you're willing to go around and find all the tools that were 
written to handle XML, and augment them with parsers that special-case the 
conversion of HTML to XML, yes you can do that.  Sometimes that will be 
practical.  Then again, there's an awful lot of software out there that 
already handles XML that doesn't have such special case code.  There are, 
I would think, tens of millions of copies of Microsoft Office applications 
like Excel, for example.

Part of the value of XML is that it handles pretty much all input the same 
way.  You don't have to go around adding new converting parsers to all of 
your tools, first for HTML, then for the next language that comes along 
that needs to be almost-but-not-quite XML.  It's a tradeoff.  Given that 
there's a lot of HTML out there that isn't XML, there is of course 
incremental value in doing what Henri has done in cases where that's 
practical.  I'm pointing out that part of the value of XML comes in 
scenarios where you are using the same tools (e.g. XML databases, styling 
tools, etc.) >unmodified< to manage or even integrate a wide range of data 
formats.  I was also responding specifically to a question of why there 
was value in specifying syntax and parsing rules from XML separately from 
the specifications for particular applications that use XML.

Noah

--------------------------------------
Noah Mendelsohn 
IBM Corporation
One Rogers Street
Cambridge, MA 02142
1-617-693-4036
--------------------------------------








"Anne van Kesteren" <annevk@opera.com>
01/21/2009 04:40 AM
 
        To:     noah_mendelsohn@us.ibm.com, elharo@metalab.unc.edu
        cc:     www-tag <www-tag@w3.org>
        Subject:        Re: Comments on HTML WG face to face meetings in 
France Oct 08


On Tue, 20 Jan 2009 21:17:12 +0100, <noah_mendelsohn@us.ibm.com> wrote:
> Consider, though, a different use case, in which some of the same XHMTL
> documents are to be stored in an XML database and their attributes and
> other data used as the subjects of queries.  Now you have in intersting
> tension.  The database will presumably deal only with well formed XML
> documents, which means that the messier content that browsers deal with
> won't work in the database, at least not in the obvious way.  On the 
> other hand, the positive value of the layering becomes a bit clearer. 
> The XML
> specification describes the subset of the documents that will work in
> tools like the XML database.  Conforming XML parsers will accept those
> documents and reject others (though, as Elliotte points out, nothing
> prevents those parsers from handing the input text up to a browser, that
> may still decide to render it.)

Why can the author not use an HTML parser for the database? Henri Sivonen 
e.g. has written tools for parsing HTML in Java that conform to the HTML5 
specification (means that you get the same tree as browsers get) that plug 
 
directly into an XML toolchain if desired so you can use XSLT etc.

It seems to me that solving a toolchain problem is much better solved on 
the toolchain level than in the format which is used in the database and 
published on the Web.


-- 
Anne van Kesteren
<http://annevankesteren.nl/>
<http://www.opera.com/>
Received on Friday, 23 January 2009 02:32:28 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Thursday, 26 April 2012 12:48:11 GMT