Re: The non-polyglot elephant in the room from Michael[tm] Smith on 2013-01-21 (public-html@w3.org from January 2013)

From: Michael[tm] Smith <mike@w3.org>
Date: Mon, 21 Jan 2013 23:47:40 +0900
To: "Martin J. Dürst" <duerst@it.aoyama.ac.jp>
Cc: Henri Sivonen <hsivonen@iki.fi>, public-html WG <public-html@w3.org>, "www-tag@w3.org List" <www-tag@w3.org>
Message-ID: <20130121144739.GD46651@sideshowbarker>

"\"Martin J. Dürst\"" <duerst@it.aoyama.ac.jp>, 2013-01-21 20:14 +0900:

> On 2013/01/21 18:46, Henri Sivonen wrote:
> Very clear explanation. But just a question: What would be the effort of
> checking for polyglot markup?

I think that's a reasonable question to ask but I think an even more
reasonable question to ask is whether the supposed benefits of adding it
are worth the effort at all.

> I don't know the internal structure of your validator, but at least in
> some ideal implementation, "validates as polyglot" could just be defined
> as "validates as HTML" AND "validates as application/xhtml+xml".

So people can already determine that with the validator just by manually
running their documents through it twice: once with the HTML option
selected, and then again with the XHTML option selected.

> So even for implementing polyglot validation, we might not need a
> document describing polyglot markup :-).
> 
> The problems with the above simple plan that I managed to come up in the
> five minutes I wrote this mail are: a) although a document might be valid
> both ways, the DOMs wouldn't match; b) merging errors may be quite tricky
> (but maybe not necessary); and c) there may be additional user interface
> overhead (but it could be as simple as changing the HTML/XHTML choice from
> radio buttons to checkmarks.

In the simplest implementation, the validator would need to automatically
parse and validate the document twice: once with the HTML parser and once
with with the XML parser. But the error messages would not be merged. It
would show the messages from the HTML pass and then the messages from the
XML pass. So the information shown to the user would be pretty much the
same as if they just did the two validation passes manually. So the only
thing they'd be gaining would be the relatively minor convenience of having
the validator automate one extra step for them.

As far as implementing it to merge the error messages from separate passes,
that would take quite a lot of effort. The validator does streaming parsing
and validation and error reporting. Merging the error messages on the
server side would require them to not be emitted in a streaming way but
instead stored in memory and processed further before emitting them. It
would be possible to merge them on the client side, in JavaScript but that
also would take significant effort.

Another way to implement it would be to not do two parsing/validation
passes at all but instead to add some additional error-checking/reporting
to the HTML parser. That would require less effort than doing the two-pass
thing, and actually Sam Ruby has already contributed patches that do some
of that. But it's not complete as far as reporting all the things it should
report to conform with the Polyglot spec.

The partial support was for quite a while actually implemented (using Sam's
patches) and exposed as an option in the validator. But it was not clear
that very many users were actually using it. And after it was removed, we
didn't get any bug reports asking where it had gone. So I don't think we
have any evidence to suggest that having it is a high priority for a lot of
users.

  --Mike

-- 
Michael[tm] Smith http://people.w3.org/mike

Received on Monday, 21 January 2013 14:47:52 UTC