RE: EXI on HTML

Hi Robin,

Thanks for the inquiry. That is indeed a good question!

In response to your question, the WG had a conversation to collect the 
experiences among the WG members. Described below is the epitome of what
we noted in the conversation. 

There is an issue in supporting HTML on the encoder (i.e. server) side that 
stems either from HTML's intrinsic difference from XML or from in some cases 
markup errors in HTML documents.

For example, HTML allows for certain void elements, attributes without
values or quotations around values. They are all legitimate HTML, nontheless
do not parse well with XML parsers. In other cases, there are erraneous
HTML documents such as ones that contain elements that do not nest correctly
balanced.

EXI encoders that operate on documents (i.e. files) find it difficult to
transform the input HTML document into an EXI stream when plain XML parsers
are used to parse a document that exhibits any one or more of the HTML's
quirks described above. Otherwise, when the document parses with XML parsers 
without errors as are the cases with valid polyglot [1] documents, EXI encoders
are able to process the HTML document just as if it were an XML document.
It does not appear to be unusual for HTML documents served for mobile
devices to be consistently parsable by XML parsers. This may be related to 
what is suggested in the Mobile Web Best Practices document [2].

The WG noted that any HTML document including the one with certain errors
can be transformed into a model in a way consistent across browser 
implementations, employing the rules defined by HTML 5. It is the expectation
that EXI can be universally applied to HTML documents when HTML parsers 
that adequately implement the rules are used by EXI encoders.

EXI has its registered content-coding tag "exi" that is available for use. 
Conceptually, you may well be able to consider the combination of the 
transformation (i.e. tidying-up HTML) and the transmogrification (i.e. 
generating EXI) collectively as one operation that represents the 
content-coding "exi" over HTML documents.

[1] http://www.w3.org/TR/html-polyglot/
[2] http://www.w3.org/TR/mobile-bp/

Thanks!

-taki


-----Original Message-----
From: Robin Berjon [mailto:robin@berjon.com] 
Sent: Monday, January 16, 2012 2:48 AM
To: public-exi@w3.org
Subject: EXI on HTML

Hi all!

I don't know if you're aware of this, but there is currently a W3C task force that's looking at HTML/XML reconciliation[0] and that is getting close to publishing a document[1] about some aspects of the problem.

One topic that has surfaced a few times already is the applicability of EXI to HTML, despite the X in its name. Arguing in the abstract that it's possible (at least for a class of documents) is both true and unconvincing, so I was wondering if there'd be anyone here willing to share experience with this there? I don't think that the TF is looking for a full report on the issue, but something along the lines of "We tried it, it works (or not) except in this or that case, we had to apply this magic here to work around that problem, etc." would likely prove quite illuminating. Equally, "we looked into it and it turns out to be a daft idea" would be helpful if only in putting the matter to rest.

Thanks for any input!

[0] http://lists.w3.org/Archives/Public/public-html-xml/
[1] http://www.w3.org/2010/html-xml/snapshot/report.html

-- 
Robin Berjon - http://berjon.com/ - @robinberjon

Received on Tuesday, 31 January 2012 02:27:24 UTC