- From: Leif Halvard Silli <xn--mlform-iua@xn--mlform-iua.no>
- Date: Mon, 3 Dec 2012 15:04:56 +0100
- To: Robin Berjon <robin@w3.org>
- Cc: David Carlisle <davidc@nag.co.uk>, public-html WG <public-html@w3.org>, www-tag@w3.org
Hi Robin! David replied you like so: David Carlisle, Mon, 03 Dec 2012 10:49:36 +0000: > On 03/12/2012 10:35, Robin Berjon wrote: >> Case in point: we have a few large HTML datasets at hand which we >> can use to look at how HTML is used in the wild. But for the most >> part we tend to be limited to grepping, to some simple indexing, or >> to parsing them all and applying some ad hoc rules to extract data >> (which is slow). It would be sweet to be able to just load those >> into a DB and run XQuery (or something like it) on them. If that >> were possible, you'd get very fast, high octane analysis with >> off-the-shelf software (a lot of it open source). > > Not directly related to the document at hand but for the record on the > list archives, I think it should be noted of course that that is > possible now. If you hook (say) the validator.nu sax parser up to any > java based xslt/xquery engine, say saxon, then you can process html now > with xml tools. My comment: Perhaps my point wasn't clear enough and could be misunderstood. That, in turn, might be coloured by me not understanding perfectly how these tool chains works. But I stand by - firmly - that the XML/HTML task force mixed things up in their first conclusion. They serve a false dichotomy. Why? Let me first say that I agree that their problem statement was all right. Nothing wrong with it. There is a need to process HTML. And Henri's parser is one way to solve that problem. Fine. But what has that to do with Polyglot Markup? The task force, by saying that polyglot cannot solve this problem, are sending the signal that some think polyglot markup can - or was meant to - solve that problem. But of course it can't. Who said it could? If you are dealing with polyglot markup, then you don't need Henri's parser. Except that you can still use Henri's parser to process polyglot markup, if you so wish. And if you are *not* dealing with polyglot markup, then you can also use Henri's parser. And my point, my comment to the TAG, was that Henri's parser and/or the rest of the tool chain can take well-formed, or non-well-formed HTML and spit out *polyglot* HTML. Because polyglot is an output format. How to parse polyglot, by contrast, is defined by XML, by HTML5 etc - and not by the polyglot markup spec. It would have been relevant if the XML/HTML TF discussed whether it would be useful to spit out polyglot markup via that toolchain. Had they done so, then they would have demonstrated that they understood the purpose of polyglot markup. But as I said in my reply, they don't discuss that option and prefers instead to mention that polyglot markup cannot replace Henri's parser. -- leif halvard silli
Received on Monday, 3 December 2012 14:05:32 UTC