Notations in (x)HTML from Arjun Ray on 2000-02-06 (www-html@w3.org from February 2000)

From: Arjun Ray <aray@q2.net>
Date: Sun, 6 Feb 2000 02:48:17 -0500 (EST)
To: www-html@w3.org
Message-ID: <Pine.LNX.4.10.10002060102320.21982-100000@mail.q2.net>
On Sat, 5 Feb 2000, Murray Altheim wrote:

> [I] have lobbied within the HTML WG to begin work on figuring out
> exactly what all of the data types (ie., notations) currently used
> in XHTML are, and come to some determination on how they can be
> declared in a way that is the same between XHTML DTDs and Schemas.

Hoo boy, that last bit is a biggie...:)

Right now, the Schema activity has Second System Syndrome.  I still
can't get through the new WDs without my eyes glazing over, and I
suspect my experience isn't atypical.  Even though some sort of "DTD
compatibility" is a requirement, I'm guessing this will be the first
thing to go when the Schema stuff manages to articulate its direction
to the General Public.  Which is to say, the best we can hope for,
IMHO, is the evantual availability of an explicit conversion program,
for the old fogies still out there who Have Not Seen The Way And Thus
Realized That DTD Syntax Is Imperishably Ugly And Has Gotta Go.

> We need to understand better the data types we're using anyway,
> esp. as we move into schemas for XHTML.

The -datatypes.mod is a good start.  Is there something important
missing?

IMHO, Schemas are tending too much towards an "ontological" view of
datatypes (the 'Datatypes' moiety relegating regular expressions to a
"string" type says it all, I think.)  For a long time, I used to be
troubled by the set of "datatypes" provided by SGML.  Then I realized
that the real problem was the illusion of a "datatype" itself - i.e.
it's not useful to think of those thingies in these terms, because
SGML is basically just a taxonomic formalism.  It's all about names.
Attribute value literals, regardless of the declared value, are
replaceable character data and thus just strings: the only distinction
is whether the string can (or should) be tokenized into combinations
of name characters - CDATA vs all the others - because name tokens are
the *only* true native notation ("datatype") in SGML, and so that
particular tokenization service "comes for free" in the parser.

This is why notations are so important: one of their taxonomic roles
is to "hook" to other tokenizing, structuring or machine-processing
schemes (note the typical usage of system identifiers for notations
pointing to interpreters.)  Beyomd that is the black art of judging
when enough is enough: how much do you try to encompass within your
formalism, and how much are you content to simply point to (i.e.
record a reference only)?

As an example, suppose the tokenization of strings were taken beyond
name tokens, to regular expressions (surely the natural generalization
in text processing.)  Well, can a URI be described by a regular
expression (and therefore speficiable "internally" to a parser or
validator that groks regexps)? Here's one answer:

  http://www.deja.com/=dnc/getdoc.xp?AN=513160002
  http://www.deja.com/=dnc/getdoc.xp?AN=513219055

Be careful what you ask for!  Sometimes the better part of wisdom is
to declare a notation, rather than invest hope in a built-in schema
validator:)
  
> If the 'DATA' attributes feature proves that valuable, then
> perhaps we can lobby for its inclusion in a future version of XML.

We need data attributes and something like DAFE, too.


Arjun
Received on Sunday, 6 February 2000 02:32:07 UTC