Re: HTML/XML TF Report glosses over Polyglot Markup (Was: Statement why the Polyglot doc should be informative) from Leif Halvard Silli on 2012-12-03 (public-html@w3.org from December 2012)

From: Leif Halvard Silli <xn--mlform-iua@xn--mlform-iua.no>
Date: Mon, 3 Dec 2012 15:04:56 +0100
To: Robin Berjon <robin@w3.org>
Cc: David Carlisle <davidc@nag.co.uk>, public-html WG <public-html@w3.org>, www-tag@w3.org
Message-ID: <20121203150456692839.1cc68f0b@xn--mlform-iua.no>

Hi Robin! David replied you like so:

David Carlisle, Mon, 03 Dec 2012 10:49:36 +0000:
> On 03/12/2012 10:35, Robin Berjon wrote:
>> Case in point: we have a few large HTML datasets at hand which we
>> can use to look at how HTML is used in the wild. But for the most
>> part we tend to be limited to grepping, to some simple indexing, or
>> to parsing them all and applying some ad hoc rules to extract data
>> (which is slow). It would be sweet to be able to just load those
>> into a DB and run XQuery (or something like it) on them. If that
>> were possible, you'd get very fast, high octane analysis with
>> off-the-shelf software (a lot of it open source).
> 
> Not directly related to the document at hand but for the record on the
> list archives, I think it should be noted of course that that is
> possible now. If you hook (say) the validator.nu sax parser up to any
> java based xslt/xquery engine, say saxon, then you can process html now
> with xml tools.

My comment: Perhaps my point wasn't clear enough and could be 
misunderstood. That, in turn, might be coloured by me not understanding 
perfectly how these tool chains works. But I stand by - firmly - that 
the XML/HTML task force mixed things up in their first conclusion. They 
serve a false dichotomy. Why? Let me first say that I agree that their 
problem statement was all right. Nothing wrong with it. There is a need 
to process HTML. And Henri's parser is one way to solve that problem. 
Fine.

But what has that to do with Polyglot Markup? The task force, by saying 
that polyglot cannot solve this problem, are sending the signal that 
some think polyglot markup can - or was meant to - solve that problem. 
But of course it can't. Who said it could?

If you are dealing with polyglot markup, then you don't need Henri's 
parser. Except that you can still use Henri's parser to process 
polyglot markup, if you so wish. And if you are *not* dealing with 
polyglot markup, then you can also use Henri's parser. And my point, my 
comment to the TAG, was that Henri's parser and/or the rest of the tool 
chain can take well-formed, or non-well-formed HTML and spit out 
*polyglot* HTML.

Because polyglot is an output format. How to parse polyglot, by 
contrast, is defined by XML, by HTML5 etc - and not by the polyglot 
markup spec. It would have been relevant if the XML/HTML TF discussed 
whether it would be useful to spit out polyglot markup via that 
toolchain. Had they done so, then they would have demonstrated that 
they understood the purpose of polyglot markup. But as I said in my 
reply, they don't discuss that option and prefers instead to mention 
that polyglot markup cannot replace Henri's parser.
-- 
leif halvard silli

Received on Monday, 3 December 2012 14:05:32 UTC