Clarifying presentation of microsyntax semantics in HTML 5

Ian:  when we chatted at breakfast this morning at TPAC you asked that I 
remind you of a couple of issues relating to the HTML 5 draft.  I assume 
it eases bug tracking if I send them in separate notes, so this is the 
first of two.

The first concern we discussed is that the semantics of microsyntaxes like 
signed integer [1] are a) unduly burried in the imperative parsing rules 
and b) thus at some risk of not making it into any authoring 
specification.  I suspect that's enough to remind you of the concern, but 
for the benefit of readers who weren't with us at breakfast, here's the 
same thing in more detail:

The declarative part of the explanation of signed integer says:

"A string is a valid integer if it consists of one of more characters in 
the range U+0030 DIGIT ZERO (0) to U+0039 DIGIT NINE (9), optionally 
prefixed with a U+002D HYPHEN-MINUS ("-") character."

Sympathetic readers will obviously infer that the characters "123" in fact 
refer to the number one hundred-twenty three, but nothing in the above 
says that.  If I were to claim that these characters represented the 
number "three hundred twenty-one" you couldn't prove me wrong from the 
above. 

Now, immediately following the above are a set of step-by-step parsing 
rules, which implement what appears to be the logic of a function.  The 
last step says "If sign is "positive", return value, otherwise return 
0-value", and indeed a sympathetic reader will understand that these rules 
have indeed computed a result that defines the intended semantic to be 
"one hundred twenty-three".  So, the semantic is there in the parsing 
rules, at least if you're willing to make the assumption that what's 
referred to as the return value is in fact the intended semantic of the 
string being parsed.

So, to reiterate the concern, now that the details have been set out:

a) There are probably clearer and simpler ways of conveying the intended 
semantic than burying them in the parsing rules.  Alternatives range from 
informal "these strings have the obvious interpretation as integers, high 
order digits on the left, etc.,  with '-' indicating negative numbers" to 
more rigorous or even formal mappings using the appropriate polynomial. 
I'm not here recommending which of the many options should be chosen, just 
suggesting that burying the semantics in the parsing rules is suboptimal.

b) I believe the intention is to produce a specification for HTML 5 
authors by, among other things, stripping out the parsing rules.  There is 
a risk that the resulting specification would lack any indication at all 
of the intended interpretation of the strings. 

I believe that similar comments would apply to many of the other 
microsyntaxes, and perhaps in other parts of the specification as well. 
Thank you.

Noah

[1] http://www.w3.org/html/wg/html5/#signed-integers



--------------------------------------
Noah Mendelsohn 
IBM Corporation
One Rogers Street
Cambridge, MA 02142
1-617-693-4036
--------------------------------------

Received on Monday, 20 October 2008 16:35:26 UTC