Re: including a schema with "HTML: The Markup Language" Clarifying TAG Re: Courtesy notification

Dan Connolly <connolly@w3.org>, 2010-03-16 09:38 -0500:

> On Mon, 2010-03-15 at 19:49 +0900, Michael(tm) Smith wrote:
> > I moved the status of bug 8611 to resolved=wontfix.
> > I think discussion of what kind of schema or set of schemas (e.g.,
> > RelaxNG plus Schematron plus whatever else) to publish, and how to
> > publish it, should be raised either as a new bug or as an HTML WG
> > Tracker issue.
> 
> That seems kinda odd, Mike; the way the document is built uses
> a schema for its backbone. The schema is there in the source:
>   http://dev.w3.org/cvsweb/html5/markup/schema/

The schema is not in the source any longer (I've removed it).

As far as the relationship between that schema and the H:TML doc,
I'm not sure "backbone" is an accurate word to describe that; the
RelaxNG grammar itself is one among several sources of information
the document is built from.

And that grammar on its own doesn't even completely express all
the constraint checks that need to be done by an actual validation
tool that uses it as an input -- in particular, datatype constraints.
For those it relies on a datatype library written in Java:

  https://whattf.svn.cvsdude.com/syntax/trunk/relaxng/datatype/java/src/org/whattf/datatype/

(That works with the RELAX NG Pluggable Datatype Libraries
interface that James Clark and KAWAGUCHI Kohsuke developed.)

A number of those datatypes are not practically expressible in any
formalism other than actual program code. For example, this is the
code that checks whether a language tag is valid or not:

  https://whattf.svn.cvsdude.com/syntax/trunk/relaxng/datatype/java/src/org/whattf/datatype/Language.java
  https://whattf.svn.cvsdude.com/syntax/trunk/relaxng/datatype/java/src/org/whattf/datatype/data/LanguageData.java

If you take a look at that, you'll see that what it's doing is,
it's grabbing a copy of the IANA language-subtag registry, parsing
it, and then doing a number of checks against it. (You may also
notice that along with emitting errors for invalid subtags, it is
also -- when it is used within the context of the validator.nu
service -- capable of emitting warnings for cases of subtags that
are valid but deprecated, which is something not typically done by
any validation language or validation tool that is strictly
grammar-based.)

And in addition to relying on that datatype library -- which is
nominally part of the Relax NG schema but not expressible within
the schema itself -- there's also Java code that acts as a
Schematron workalike to do assertions-based checking:

  https://whattf.svn.cvsdude.com/syntax/trunk/non-schema/java/src/org/whattf/checker/schematronequiv/Assertions.java
  https://whattf.svn.cvsdude.com/syntax/trunk/non-schema/java/src/org/whattf/checker/ConformingButObsoleteWarner.java

Those checks are fully expressible in Schematron, but there are
other constraints that are not, practically; the Java code that
does those checks are here:

  https://whattf.svn.cvsdude.com/syntax/trunk/non-schema/java/src/org/whattf/checker/
  https://whattf.svn.cvsdude.com/syntax/trunk/non-schema/java/src/org/whattf/checker/table/

There are also constraint checks that are performed by -- and
error messages emitted for -- the validator.nu/Mozilla HTML5
parser code:

  http://hg.mozilla.org/projects/htmlparser/file/546412142175/src/nu/validator/htmlparser/impl

If you want to see what I mean, grep through that directory for
"err" and "warn".

Anyway, my point is, that RelaxNG grammar is far from being on its
own a worthwhile means for doing many of the most useful
constraint checks that a modern, useful HTML validation tool
really needs to do in practice -- so far, IMHO, that it would be
potentially very misleading to publish it and encourage its use
outside the context of its function as part of a complete HTML
validation tool (like validator.nu).

Another problem with that particular grammar is that it is not
optimized for use as a part of a formal description of the HTML
language. It is instead (to some degree at least) optimized for
the particular use that it is being put to within the validator.nu
service. And it may be in the future that it will end up being
optimized for that use to an even greater degree. For example,
there are a number of constraint checks that are currently done in
the grammar but that -- in order to generate more useful error
messages reported for them -- would instead much better be done
using Schematron (or in Schematron-workalike code such as
validator.nu uses). (That's maybe due in part to the poor quality
of error messages that Jing (which validator.nu uses) currently
emits for certain cases -- like the case of required-but-missing
attributes.)

Anyway, I also think that in a validation system that relies on
multiple means for doing constraint checks, just because something
is expressible in a grammar-based schema that's part of that
systems doesn't mean it *should* be expressed in a grammar, nor
that the grammar is necessarily the best place to express it. So
in the case of this particular schema, there is no guarantee that
some constraints that are currently in that grammar won't
eventually be moved out to a different part of the system.

> So you're already publishing the information that's in
> the schema;

The information in that schema was developed from the almost
completely prose-based document-conformances constraint expressions
in the HTML5 spec itself. So it can reasonably be viewed as just
one possible implementation of just some particular document-
conformance constraints that are expressed in the HTML5 spec.

> is there some reason not to just include the
> source of the schema in an informative
> appendix (with whatever disclaimers you like), so that
> other people can make similar uses of it?

For one thing, the H:TML doc, as it stands now, by design does not
actually use the same expression language as that schema. The
parts that express the same constraints that are in that schema
are instead expressed in the H:TML doc in natural language.
That natural language is currently generated in part (through an
very baroque, fragile, inelegant build process I hacked together)
from that schema, but I do not guarantee that it will always
remain so. I originally did it that way in part as an experiment
to see how doable and useful it'd end up being, and I'm still not
sure myself that it's been a successful experiment; it may be
better in the long run to sever the tie and just maintain those
parts manually rather than generating them.

On top of that, if the HTML WG were to publish a formalism or set
of formalisms for the HTML language, I think there is a good
argument against making that particular schema the basis for it
(for the reasons I mentioned above -- that it's already optimized
to some degree for particular use with validator.nu, and may in
the future end up being optimized for that even more so).

  --Mike

-- 
Michael(tm) Smith
http://people.w3.org/mike

Received on Thursday, 18 March 2010 10:11:56 UTC