- From: Michael(tm) Smith <mike@w3.org>
- Date: Thu, 18 Mar 2010 19:11:48 +0900
- To: Dan Connolly <connolly@w3.org>
- Cc: noah_mendelsohn@us.ibm.com, Paul Cotton <paul.cotton@microsoft.com>, Philippe Le Hegaret <plh@w3.org>, Sam Ruby <rubys@intertwingly.net>, "www-tag@w3.org WG" <www-tag@w3.org>, Maciej Stachowiak <mjs@apple.com>
Dan Connolly <connolly@w3.org>, 2010-03-16 09:38 -0500: > On Mon, 2010-03-15 at 19:49 +0900, Michael(tm) Smith wrote: > > I moved the status of bug 8611 to resolved=wontfix. > > I think discussion of what kind of schema or set of schemas (e.g., > > RelaxNG plus Schematron plus whatever else) to publish, and how to > > publish it, should be raised either as a new bug or as an HTML WG > > Tracker issue. > > That seems kinda odd, Mike; the way the document is built uses > a schema for its backbone. The schema is there in the source: > http://dev.w3.org/cvsweb/html5/markup/schema/ The schema is not in the source any longer (I've removed it). As far as the relationship between that schema and the H:TML doc, I'm not sure "backbone" is an accurate word to describe that; the RelaxNG grammar itself is one among several sources of information the document is built from. And that grammar on its own doesn't even completely express all the constraint checks that need to be done by an actual validation tool that uses it as an input -- in particular, datatype constraints. For those it relies on a datatype library written in Java: https://whattf.svn.cvsdude.com/syntax/trunk/relaxng/datatype/java/src/org/whattf/datatype/ (That works with the RELAX NG Pluggable Datatype Libraries interface that James Clark and KAWAGUCHI Kohsuke developed.) A number of those datatypes are not practically expressible in any formalism other than actual program code. For example, this is the code that checks whether a language tag is valid or not: https://whattf.svn.cvsdude.com/syntax/trunk/relaxng/datatype/java/src/org/whattf/datatype/Language.java https://whattf.svn.cvsdude.com/syntax/trunk/relaxng/datatype/java/src/org/whattf/datatype/data/LanguageData.java If you take a look at that, you'll see that what it's doing is, it's grabbing a copy of the IANA language-subtag registry, parsing it, and then doing a number of checks against it. (You may also notice that along with emitting errors for invalid subtags, it is also -- when it is used within the context of the validator.nu service -- capable of emitting warnings for cases of subtags that are valid but deprecated, which is something not typically done by any validation language or validation tool that is strictly grammar-based.) And in addition to relying on that datatype library -- which is nominally part of the Relax NG schema but not expressible within the schema itself -- there's also Java code that acts as a Schematron workalike to do assertions-based checking: https://whattf.svn.cvsdude.com/syntax/trunk/non-schema/java/src/org/whattf/checker/schematronequiv/Assertions.java https://whattf.svn.cvsdude.com/syntax/trunk/non-schema/java/src/org/whattf/checker/ConformingButObsoleteWarner.java Those checks are fully expressible in Schematron, but there are other constraints that are not, practically; the Java code that does those checks are here: https://whattf.svn.cvsdude.com/syntax/trunk/non-schema/java/src/org/whattf/checker/ https://whattf.svn.cvsdude.com/syntax/trunk/non-schema/java/src/org/whattf/checker/table/ There are also constraint checks that are performed by -- and error messages emitted for -- the validator.nu/Mozilla HTML5 parser code: http://hg.mozilla.org/projects/htmlparser/file/546412142175/src/nu/validator/htmlparser/impl If you want to see what I mean, grep through that directory for "err" and "warn". Anyway, my point is, that RelaxNG grammar is far from being on its own a worthwhile means for doing many of the most useful constraint checks that a modern, useful HTML validation tool really needs to do in practice -- so far, IMHO, that it would be potentially very misleading to publish it and encourage its use outside the context of its function as part of a complete HTML validation tool (like validator.nu). Another problem with that particular grammar is that it is not optimized for use as a part of a formal description of the HTML language. It is instead (to some degree at least) optimized for the particular use that it is being put to within the validator.nu service. And it may be in the future that it will end up being optimized for that use to an even greater degree. For example, there are a number of constraint checks that are currently done in the grammar but that -- in order to generate more useful error messages reported for them -- would instead much better be done using Schematron (or in Schematron-workalike code such as validator.nu uses). (That's maybe due in part to the poor quality of error messages that Jing (which validator.nu uses) currently emits for certain cases -- like the case of required-but-missing attributes.) Anyway, I also think that in a validation system that relies on multiple means for doing constraint checks, just because something is expressible in a grammar-based schema that's part of that systems doesn't mean it *should* be expressed in a grammar, nor that the grammar is necessarily the best place to express it. So in the case of this particular schema, there is no guarantee that some constraints that are currently in that grammar won't eventually be moved out to a different part of the system. > So you're already publishing the information that's in > the schema; The information in that schema was developed from the almost completely prose-based document-conformances constraint expressions in the HTML5 spec itself. So it can reasonably be viewed as just one possible implementation of just some particular document- conformance constraints that are expressed in the HTML5 spec. > is there some reason not to just include the > source of the schema in an informative > appendix (with whatever disclaimers you like), so that > other people can make similar uses of it? For one thing, the H:TML doc, as it stands now, by design does not actually use the same expression language as that schema. The parts that express the same constraints that are in that schema are instead expressed in the H:TML doc in natural language. That natural language is currently generated in part (through an very baroque, fragile, inelegant build process I hacked together) from that schema, but I do not guarantee that it will always remain so. I originally did it that way in part as an experiment to see how doable and useful it'd end up being, and I'm still not sure myself that it's been a successful experiment; it may be better in the long run to sever the tie and just maintain those parts manually rather than generating them. On top of that, if the HTML WG were to publish a formalism or set of formalisms for the HTML language, I think there is a good argument against making that particular schema the basis for it (for the reasons I mentioned above -- that it's already optimized to some degree for particular use with validator.nu, and may in the future end up being optimized for that even more so). --Mike -- Michael(tm) Smith http://people.w3.org/mike
Received on Thursday, 18 March 2010 10:11:56 UTC