Making the HTML language self-describing

On Wed, 7 Jan 2009, Martin Atkins wrote:
> 
> It would be ideal if future versions of HTML would be parsable by todays 
> parsers, even if they ultimately ignore elements they don't understand.
> 
> The best example of this is void elements that get parsed as non-void by 
> legacy parsers; it is therefore not possible to use new void elements 
> without breaking software that employs legacy parsers, since the entire 
> tree after the new void element will be incorrect.

On Wed, 7 Jan 2009, Jonas Sicking wrote:
> 
> So, sort of restarting this thread again. Here are the problems that 
> would be good to solve:
> 
> 1. When a new version of HTML6 comes out, it should be possible to write 
> a document that uses elements from HTML6, but that parses to the same 
> DOM in a browser that both supports HTML6 and HTML5. Ideally such a 
> document would also validate as valid HTML6 and HTML5. Note that this 
> doesn't mean that *every* document should parse to the same DOM, just 
> that it is possible to write one that uses a new element but still 
> produces the same DOM in both parsers. So for example it's IMHO ok to 
> require that </p> elements are closed and that no tags are missnested 
> for the same DOM to be produced.

If you never use optional end tags, the only thing that would cause a DOM 
difference that I can think of is void elements.

However, DOM differences would be the least of your problems if the UA 
doesn't support the void elements. With flow elements like <section> or 
<meter>, you might be able to use the elements even though the UA doesn't 
support them because you can style them. But with void elements, the 
elements are useless if the UA doesn't support them.

In other words, it basically *doesn't matter* if the DOM is different if 
you're using void elements the UA doesn't support.

In fact, as far as I can tell, the only problem would be with 
round-tripping, which is a serialisation issue:

> 2. Make it possible to create a generic serializer that takes a DOM and 
> produces HTML that parses into the same DOM. Independent of which HTML 
> version (>= 5) is used to parse.

As far as I can tell, if you have a conforming document and you're willing 
to not omit any of the optional end tags, all you need to have a generic 
serialiser is a list of void elements, elements CDATA elements, RCDATA 
elements, and the list of elements that are affected by the historical 
pre/textarea implied newline processing.

This can be trivially encoded as four lines in a configuration file.


> 3. Write a generic parser that can be used to parse HTML markup of any 
> version (>= 5) into a DOM.

I don't think we'll ever be able to do this. For example, there is no way 
I could have predicted how we were going to add <ruby> parsing to the spec 
before I added it. This would be possible if we could guarantee that for 
all time, all new inventions would always be done in a regular way, but 
history has shown that we would be naive to assume this.


> 1 seems very important to me to allow for adoption of new elements. I'd 
> hate it if people were forced to use document.write hacks along with 
> browser detection to be able to use new elements.

You can use new elements other than void elements easily; void elements 
are only useful once the UAs support the feature anyway.


> 2 [and 3] seems important to allow generic tools, such as XSLT or DOM to 
> produce [and consume] HTML.

I would strongly encourage people who are using such pipelines to use XML, 
and just stick an XML-to-HTML converter on the end of their pipelines (and 
an HTML-to-XML convertor on the front of their pipelines). These tools 
already exist, and they can be updated when HTML is updated.

-- 
Ian Hickson               U+1047E                )\._.,--....,'``.    fL
http://ln.hixie.ch/       U+263A                /,   _.. \   _\  ;`._ ,.
Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'

Received on Thursday, 8 January 2009 01:13:28 UTC