Re: Formal definition of HTML5 (was Re: Version information) from Henrik Dvergsdal on 2007-04-17 (public-html@w3.org from April 2007)

From: Henrik Dvergsdal <henrik.dvergsdal@hibo.no>
Date: Tue, 17 Apr 2007 02:23:40 +0200
To: public-html@w3.org
Message-Id: <C6D95844-316C-4F5B-A4AF-A387FA6809D0@hibo.no>
On 16. apr. 2007, at 23.22, Ian Hickson wrote:

>
> On Mon, 16 Apr 2007, Henrik Dvergsdal wrote:
>>
>> First of all I'm not suggesting the schema should be an add-on to the
>> english prose. What I suggest is that we make an official schema and
>> then hardwire it into the prose by replacing the syntax  
>> definitions in
>> the prose with excerpts from the schema, much like its done in the  
>> HTML
>> 4 standard.
>
> Ah, ok. I don't want to do that because I have received feedback  
> from a
> number of implementors and authors that they find english prose  
> easier to
> understand than formal gramars.

I don't have enough data on this to argue with you, but I guess this  
depends on training as well as the complexity of what is described  
and precision requirements.

>
>> 1. It will facilitate a more tidy and efficient way of managing  
>> the standard.
>
> I don't think so. My understanding is that it is inordinately  
> complicated
> to express some of the spec's current restrictions in a formal  
> grammar.

Which is exactly the point. This is a lot of very complicated work.  
It should only be done once.

>
>> To have competing schemas (or other specification techniques)  
>> reflect a
>> spec like this, will lead to a chaotic situation in which a lot of
>> people will waste a lot of time.
>
> How is this different to the situation with the browsers?

I'm not sure what you mean here. If the situation with the browsers  
is a chaotic one where people are wasting time, we don't want to  
repeat that in web applications.

>
>> There should (and probably will) only be one schema. This will  
>> become a
>> de facto standard so we might as well make it official. To make it
>> unofficial or call it an "implementation detail" may have some
>> rhetorical value but nothing more.
>
> I believe there already are multiple schemas.

OK, maybe it could be useful to maintain different versions for  
different schema languages.
It would be interesting to have an overview of what schemas are  
actually being developed. Who knows, maybe there's even a DTD here  
somewhere.

>
>> 2. It will make changes versus HTML4 more explicit.
>
> Only if we use DTDs, which are amongst the least expressive of the  
> schema
> languages out there. DTDs are so inexpressive they can't even  
> express the
> XHTML1 element syntax completely (not even considering attributes).

I find it quite easy to compare DTD and NG RELAX syntax provided they  
describe the same syntax.

>
>> Take for instance the current definition of the TABLE element  
>> content model:
>>
>> "In this order: Optionally a caption element, followed by either  
>> zero or
>> more colgroup elements, followed optionally by a thead element,
>> followed optionally by a tfoot element, followed by either zero or  
>> more
>> tbody elements or one or more tr elements, followed optionally by a
>> tfoot element (but there can only be one tfoot element child in  
>> total)."
>>
>> This definition contains three differences with respect to HTML4.  
>> Try to
>> spot these without translating it into a grammar (at least in your  
>> head)
>> and comparing it to the HTML 4 DTD.
>
> The three [sic] differences are:
>
>    COL can no longer be a child of TABLE (that was a bug in HTML4)
>    TFOOT can be at the end of TABLE
>    TBODY can be omitted as a child of TABLE
>    TR can be a child of TABLE
>
> It didn't take me any effort but I guess I'm not a fair test.

No, you seem to be rather good at these things :-)

>
>> 3. It will be easier to spot bugs.
>>
>> I suspect two of the differences above are bugs. Where are they?
>
> The first is a bug in HTML4 (it is incompatible with the parsing that
> browsers do). The rest are intentional changes.

That's good, but I think decisions like this should be recorded  
somewhere, just so that we can rule them out as bugs.

>
>> 4. We will reduce the number of bugs caused by the translation  
>> step from
>> prose to schema.
>
> ...while increasing the number of bugs caused by translating the  
> spec to a
> different schema or a programming language. Implementors don't  
> learn the
> schema in the spec. They just assume. However, by and large they  
> _do_ know
> English, and so they understand that better.
>
>
>> In Sivonens schema the TABLE element is defined as
>> follows (my formatting):
>>
>> table.inner =
>>   (caption.elem?, colgroup.elem*, thead.elem?,
>>   ((tfoot.elem, (tbody.elem+ | tr.elem+)) | (( tbody.elem+ |  
>> tr.elem+),
>> tfoot.elem?)))
>>
>> Can you spot the difference versus the english prose above?
>
> No, but that says more about my ability to understand an arbitrary  
> grammar
> without knowing its language than it does about the suitability of  
> English
> prose. (What is the difference? Oh, is it that it doesn't allow the  
> lack
> of TBODYs and TRs?)

My point is that the prose will eventually have to be translated to a  
schema - even if its just in order to define an implementation detail.

>
>> 5. We will gain precision
>>
>> Take the current content model of the OBJECT element:
>>
>> "When used as the child of a figure element, or, when used as a  
>> figure
>> fallback object: Zero or more param elements, followed by either  
>> zero or
>> more block-level elements or a single object element, which is then
>> considered to be a figure fallback object.
>>
>> Otherwise: Zero or more param elements, followed by inline-level
>> content."
>>
>> This can be interpreted in at least three different ways. Which  
>> one is
>> correct?
>
> The only one that makes sense is the one where the single object  
> element
> is the figure fallback object, since the others would have  
> mismatched use
> of the plural/singular endings. However, this is academic for two  
> reasons.
> First, you couldn't express this in non-prose at all, let alone more
> clearly, and second, this entire section needs to be revamped anyway.

I'm not sure if that I agree on all of this, but lets leave it for now

>
>> 6. It will make the text of the standard more accessible, at least  
>> for
>> "competent" developers. When you get used to the formal syntax it is
>> much easier to read than the prose.
>
> Sadly, most developers and implementors (and spec writers!) do not  
> fall
> under the label "competent" by that definition.

I guess I should have written "developers with basic training in  
writing/using formal grammars"


--
Henrik
Received on Tuesday, 17 April 2007 00:24:08 UTC