Re: Formal definition of HTML5 (was Re: Version information) from Henrik Dvergsdal on 2007-04-16 (public-html@w3.org from April 2007)

From: Henrik Dvergsdal <henrik.dvergsdal@hibo.no>
Date: Mon, 16 Apr 2007 14:33:53 +0200
To: public-html@w3.org
Message-Id: <648D4A86-88BF-425A-8E53-E2E42D4B11DC@hibo.no>
On 16. apr. 2007, at 03.50, Ian Hickson wrote:

> What's the _advantage_ of
> having an official formal schema?

> Note that I'm not at all opposed to
> making publicly available unofficial formal grammars.

First of all I'm not suggesting the schema should be an add-on to the  
english prose. What I suggest is that we make an official schema and  
then hardwire it into the prose by replacing the syntax definitions  
in the prose with excerpts from the schema, much like its done in the  
HTML 4 standard. Where the schema language is not expressive enough  
and where it is not feasible to use the schema syntax directly, we  
include additional formal descriptions and rules in the text.

The advantages are:

1. It will facilitate a more tidy and efficient way of managing the  
standard. It is already quite extensive and complex and when the  
rendering section is included, its going to be huge. To have  
competing schemas (or other specification techniques) reflect a spec  
like this, will lead to a chaotic situation in which a lot of people  
will waste a lot of time. There should (and probably will) only be  
one schema. This will become a de facto standard so we might as well  
make it official. To make it unofficial or call it an "implementation  
detail" may have some rhetorical value but nothing more.


2. It will make changes versus HTML4 more explicit.

Take for instance the current definition of the TABLE element content  
model:

"In this order: Optionally a caption element, followed by either zero  
or more colgroup elements,
followed optionally by a thead element), followed optionally by a  
tfoot element,
followed by either zero or more tbody elements or one or more tr  
elements,
followed optionally by a tfoot element (but there can only be one  
tfoot element child in total)."

This definition contains three differences with respect to HTML4. Try  
to spot these without translating it into a grammar (at least in your  
head) and comparing it to the HTML 4 DTD.


3. It will be easier to spot bugs.

I suspect two of the differences above are bugs. Where are they?


4. We will reduce the number of bugs caused by the translation step  
from prose to schema. In Sivonens schema the TABLE element is defined  
as follows (my formatting):

table.inner =
   (caption.elem?, colgroup.elem*, thead.elem?,
   ((tfoot.elem, (tbody.elem+ | tr.elem+)) | (( tbody.elem+ | tr.elem 
+), tfoot.elem?)))

Can you spot the difference versus the english prose above?


5. We will gain precision

Take the current content model of the OBJECT element:

"When used as the child of a figure element, or, when used as a  
figure fallback object:
Zero or more param elements, followed by either zero or more block- 
level elements
or a single object element, which is then considered to be a figure  
fallback object.
Otherwise: Zero or more param elements, followed by inline-level  
content."

This can be interpreted in at least three different ways. Which one  
is correct?

When used as the child of a figure element, or, when used as a figure  
fallback object:
(Zero or more param elements, followed by either zero or more block- 
level elements)
or a single object element, which is then considered to be a figure  
fallback object.
Otherwise: Zero or more param elements, followed by inline-level  
content.

When used as the child of a figure element, or, when used as a figure  
fallback object:
Zero or more param elements, followed by (either zero or more block- 
level elements
or a single object element), which is then considered to be a figure  
fallback object.
Otherwise: Zero or more param elements, followed by inline-level  
content.

When used as the child of a figure element, or, when used as a figure  
fallback object:
(Zero or more param elements, followed by either zero or more block- 
level elements
or a single object element), which is then considered to be a figure  
fallback object.
Otherwise: Zero or more param elements, followed by inline-level  
content.


6. It will make the text of the standard more accessible, at least  
for "competent" developers. When you get used to the formal syntax it  
is much easier to read than the prose.

--
henrik
Received on Monday, 16 April 2007 12:34:17 UTC