datatypes in HTML 5, Java, and XML Schema and the principle of well-defined behavior [... TAG Review of HTML 5] from Dan Connolly on 2009-09-03 (www-tag@w3.org from September 2009)

From: Dan Connolly <connolly@w3.org>
Date: Thu, 03 Sep 2009 14:39:59 -0500
To: ashok.malhotra@oracle.com
Cc: noah_mendelsohn@us.ibm.com, www-tag@w3.org
Message-Id: <1252006799.22683.11306.camel@pav.lan>

On Wed, 2009-09-02 at 19:39 -0700, ashok malhotra wrote:
> I did some of my homework re HTML5.  I had some comments and questions 
> on section 2.4
> 
> Section 2.4 describes several datatypes.  The syntax for these datatypes 
> is described informally.
[...]
> Q2.  Why are these algorithms required?  Typically, it is hard to get 
> the bugs out of them.
> Larry say they are for conformance/consistency.  If so, why not just 
> reference standard works such as
> ISO 8601 or IEEE 754.
> 
> Q3.  Does HTML5 convert the string representation to binary for, say, 
> floating point numbers?
> If so, I'm sure, implementations just use the native language libraries 
> such as the java Math library.
> Why not just refer to these?
> 
> Note that XML Schema covers much of the same ground and may be a good 
> reference.

The problem is that the details of the way these datatypes are
implemented in the web platform don't quite match Java or
XML Schema.

For example, in Javascript,
parseInt("1a1") gives 1
(try it yourself at http://www.squarefree.com/shell/shell.html )

but in Java it throws an exception:

java.lang.NumberFormatException: java.lang.NumberFormatException: For
input string: "1a1"

It's somewhat traditional to say that cases like "1a1" are
out of scope and leave them implementation-defined, but
that goes against one of the principles of the HTML 5 effort:

"Prefer to clearly define behavior that content authors could rely on,
in preference to vague or implementation-defined behavior. This way, it
is easier to author content that works in a variety of user agents."
http://www.w3.org/TR/html-design-principles/#well-defined-behavior

And yes, it's hard to get the bugs out of specifications of
this style. Given the number of details and the interactions
between them, my mind boggles at the size of the test suite
that would give me confidence about interoperability.
Numbers like 50,000 tests get thrown around. Considering
that XQuery's test suite was about that big and XQuery is
more regular (having been designed rather than reverse
engineered), even that many will leave lots of holes.

I found the "1a1" case in test materials just for number parsing;
it's 2773 lines long... about 250 test cases.
http://hg.gsnedders.com/php-html-5-direct/file/8c27462f5f41/tests/numbersTest

  Implementation + Test Cases Available For Numbers Subsection of Common
Microsyntaxes
  Geoffrey Sneddon
  12 Jul 2007
  http://lists.w3.org/Archives/Public/public-html/2007Jul/0650.html

-- 
Dan Connolly, W3C http://www.w3.org/People/Connolly/
gpg D3C2 887B 0F92 6005 C541  0875 0F91 96DE 6E52 C29E

Received on Thursday, 3 September 2009 19:40:09 UTC