W3C home > Mailing lists > Public > www-archive@w3.org > November 2011

Patch for "HTML5" support

From: Bjoern Hoehrmann <derhoermi@gmx.net>
Date: Sat, 05 Nov 2011 01:16:34 +0100
To: tidy-develop@lists.sourceforge.net
Message-ID: <p0u8b75v28kqj37tqgosvq9l7m54u54ek0@hive.bjoern.hoehrmann.de>
Hi,

  http://lists.w3.org/Archives/Public/www-archive/2011Nov/0005.html has
a patch that adds support for "HTML5" and "XHTML5" as per W3C's "Last
Call" Working Draft <http://www.w3.org/TR/2011/WD-html5-20110525/>. The
intended level of support is just "Does not corrupt or mark as invalid
fully conforming documents". It is not intended to conform, say, to the
"HTML5" parsing requirements in any way beyond that.

The patch breaks the public `tidyAttrIsProp` function, which is supposed
to tell whether an attribute is proprietary, but it's passed only the
attribute and that's not enough to answer the question, so now it always
returns the same value. I doubt this affects anybody. I'll probably make
it do so by returning just "no" instead of the current indirect method,
and change AttributeVersions back into a static function.

Breaking it is a side-effect of removing the versions column from the
attribute_defs table, as above, it's not useful to know which versions
have a "type" attribute on one or more elements, as we have that for all
important document types and all their elements and attributes on a per-
element basis.

This currently rejects "data-*" attributes, they need a special case in
some place I haven't yet looked up. It also does not support inline SVG
and MathML content, I am not entirely sure how to support those without
breaking other content while not spending much effort on the problem. A
simple example would be handling of the SVG <title> element which likely
needs to be handled differently than the HTML <title> element.

So the patch mainly just updates the element and attributes tables, and
I guessed some parsing approximations, like <section> is parsed like a
<div>, which is a good approximation, but others might be not so good. I
also updated the "auto" doctype logic, so if you use "HTML5"-only markup
and no non-"HTML5"-markup you should get the appropriate doctype. There
is no --doctype setting to force "HTML5" output. I might add a "plain"
setting there, not the best choice, but "five" would be misleading due
to the lack of version numbers.

I have not updated any of the already known elements in the tag_defs
table, I am unsure how to handle <menu> there for instance which used to
be CM_OBSOLETE but has been resurrected. <keygen> and <wbr> are similar.
So those likely need some fine-tuning. Similarily, there may have been
changes to the lexical space of some attribute values which may lead
Tidy to complain about values that haven't been allowed before. That too
is fine-tuning that doesn't necessarily have to be done by me.

If there is enough interest in this that we get some test reports to the
develop@lists.sourceforge.net mailing list, and people can't find major
bugs, I might polish the patch and commit it.

regards,
-- 
Björn Höhrmann · mailto:bjoern@hoehrmann.de · http://bjoern.hoehrmann.de
Am Badedeich 7 · Telefon: +49(0)160/4415681 · http://www.bjoernsworld.de
25899 Dagebüll · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/ 
Received on Saturday, 5 November 2011 00:43:51 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Wednesday, 7 November 2012 14:18:41 GMT