Re: Formal definition of HTML5 (was Re: Version information) from Ian Hickson on 2007-04-16 (public-html@w3.org from April 2007)

From: Ian Hickson <ian@hixie.ch>
Date: Mon, 16 Apr 2007 21:22:32 +0000 (UTC)
To: Henrik Dvergsdal <henrik.dvergsdal@hibo.no>
Cc: public-html@w3.org
Message-ID: <Pine.LNX.4.62.0704162102030.17772@dhalsim.dreamhost.com>
On Mon, 16 Apr 2007, Henrik Dvergsdal wrote:
> 
> First of all I'm not suggesting the schema should be an add-on to the 
> english prose. What I suggest is that we make an official schema and 
> then hardwire it into the prose by replacing the syntax definitions in 
> the prose with excerpts from the schema, much like its done in the HTML 
> 4 standard.

Ah, ok. I don't want to do that because I have received feedback from a 
number of implementors and authors that they find english prose easier to 
understand than formal gramars.


> 1. It will facilitate a more tidy and efficient way of managing the standard.

I don't think so. My understanding is that it is inordinately complicated 
to express some of the spec's current restrictions in a formal grammar.


> To have competing schemas (or other specification techniques) reflect a 
> spec like this, will lead to a chaotic situation in which a lot of 
> people will waste a lot of time.

How is this different to the situation with the browsers?


> There should (and probably will) only be one schema. This will become a 
> de facto standard so we might as well make it official. To make it 
> unofficial or call it an "implementation detail" may have some 
> rhetorical value but nothing more.

I believe there already are multiple schemas.


> 2. It will make changes versus HTML4 more explicit.

Only if we use DTDs, which are amongst the least expressive of the schema 
languages out there. DTDs are so inexpressive they can't even express the 
XHTML1 element syntax completely (not even considering attributes).


> Take for instance the current definition of the TABLE element content model:
> 
> "In this order: Optionally a caption element, followed by either zero or 
> more colgroup elements, followed optionally by a thead element, 
> followed optionally by a tfoot element, followed by either zero or more 
> tbody elements or one or more tr elements, followed optionally by a 
> tfoot element (but there can only be one tfoot element child in total)."
> 
> This definition contains three differences with respect to HTML4. Try to 
> spot these without translating it into a grammar (at least in your head) 
> and comparing it to the HTML 4 DTD.

The three [sic] differences are:

   COL can no longer be a child of TABLE (that was a bug in HTML4)
   TFOOT can be at the end of TABLE
   TBODY can be omitted as a child of TABLE
   TR can be a child of TABLE

It didn't take me any effort but I guess I'm not a fair test.


> 3. It will be easier to spot bugs.
> 
> I suspect two of the differences above are bugs. Where are they?

The first is a bug in HTML4 (it is incompatible with the parsing that 
browsers do). The rest are intentional changes.


> 4. We will reduce the number of bugs caused by the translation step from 
> prose to schema.

...while increasing the number of bugs caused by translating the spec to a 
different schema or a programming language. Implementors don't learn the 
schema in the spec. They just assume. However, by and large they _do_ know 
English, and so they understand that better.


> In Sivonens schema the TABLE element is defined as 
> follows (my formatting):
>
> table.inner =
>   (caption.elem?, colgroup.elem*, thead.elem?,
>   ((tfoot.elem, (tbody.elem+ | tr.elem+)) | (( tbody.elem+ | tr.elem+),
> tfoot.elem?)))
> 
> Can you spot the difference versus the english prose above?

No, but that says more about my ability to understand an arbitrary grammar 
without knowing its language than it does about the suitability of English 
prose. (What is the difference? Oh, is it that it doesn't allow the lack 
of TBODYs and TRs?)


> 5. We will gain precision
> 
> Take the current content model of the OBJECT element:
> 
> "When used as the child of a figure element, or, when used as a figure 
> fallback object: Zero or more param elements, followed by either zero or 
> more block-level elements or a single object element, which is then 
> considered to be a figure fallback object.
> 
> Otherwise: Zero or more param elements, followed by inline-level 
> content."
> 
> This can be interpreted in at least three different ways. Which one is 
> correct?

The only one that makes sense is the one where the single object element 
is the figure fallback object, since the others would have mismatched use 
of the plural/singular endings. However, this is academic for two reasons. 
First, you couldn't express this in non-prose at all, let alone more 
clearly, and second, this entire section needs to be revamped anyway.


> 6. It will make the text of the standard more accessible, at least for 
> "competent" developers. When you get used to the formal syntax it is 
> much easier to read than the prose.

Sadly, most developers and implementors (and spec writers!) do not fall 
under the label "competent" by that definition.

-- 
Ian Hickson               U+1047E                )\._.,--....,'``.    fL
http://ln.hixie.ch/       U+263A                /,   _.. \   _\  ;`._ ,.
Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'
Received on Monday, 16 April 2007 21:22:43 UTC