How the markup spec is put together [was: Who is the Intended Audience of the Markup Spec Proposal?]

Henri Sivonen <hsivonen@iki.fi>, 2008-11-19 21:09 +0200:

>  The syntax is RELAX NG Compact Syntax. The syntax for the regular 
>  expressions appearing in the document is the XSD regular expression syntax. 
>  (Instead of pulling the regexps from schema comment, I think it would be 
>  nicer to pull the same descriptions Validator.nu uses as UI strings: 
>  http://wiki.whatwg.org/wiki/MicrosyntaxDescriptions )

Yeah, I agree, and I'll update the build for the document to
scrape that page and pull those descriptions in instead.

>  I guess the methodology behind the document isn't clear to everyone on the 
>  list. The document is not manually written.

On last week's HTML WG telcon[1], I discussed a bit about how the
document is put together, though the record in the minutes is not
very detailed:

  http://www.w3.org/html/wg/markup-spec/schema.html

I will be adding an Acknowledgments section give credits and
copyright statements for the sources (the whattf.org schema, the
existing HTML5 draft, the default user-agent stylesheet from
WebKit) of the parts of the spec that are generated in the output
as part of the build -- possibly also along with a short Colophon
that describes how the generated parts of the document are built.

For now, here are a few more details -

Some parts of the spec are manually written, though the
per-element "Content model", Attribute", and "Assertions"
subsections are generated, as is everything else from section 5
"Common Content Models" on.

The parts that I'm manually maintaining now are the Syntax
section (based largely on initial text from the existing HTML5
draft, and reorganized) and the prose descriptions of the elements
and attributes.

In most cases, the current prose descriptions for the elements are
primarily still verbatim text initially pulled from the HTML5
draft, though I think I may have reworded some slightly.

Some of the attribute descriptions I have already re-written a bit
(or maybe more than a bit) from descriptions initially pulled in
from the HTML5 draft. I think so far, I've done that only for some
of the "A" ones -- e.g., <a>, <area>, <audio> -- and <base>.
Mostly that re-writing has amounted to attempting to make those
descriptions more succinct (where it seemed like they could be)
and doing rephrasing to fit the context of this document.

The per-element Examples subsections are all currently pulled in
by the build verbatim from the HTML5 draft. But I may change some
or remove some later.

>  It has been generated from various sources using XSLT.

...and a specially modified/hacked version of Trang, and some Perl
hacks, and maybe some other things I'm forgetting about.

For those that are interested, the Makefile that does the build is
here:

  http://www.w3.org/html/wg/markup-spec/Makefile

...and the main XSLT driver stylesheet is here:

  http://www.w3.org/html/wg/markup-spec/tools/generate-spec-source.xsl

>  The document has some original text, but a lot 
>  of content in pulled in and mashed up from the HTML 5 spec proper, the 
>  whattf.org HTML5+ARIA schema used by html5.validator.nu

...which, for the record, is here:

  http://svn.versiondude.net/whattf/syntax/trunk/relaxng/

The nature of the build is such that whenever that schema changes
and I re-build, the per-element "Content model", Attribute", and
"Assertions" subsections, etc., will get regenerated and will
reflect any changes made to the schema.

The basic intent is for the specification to be automatically
consistent with the same conformance rules that are checked by
validator.nu.

I fully recognize the potential issues of tying the spec to a
particular schema and too closely to a particular conformance-
checking tool. It could be that the draft might eventually use a
different schema instead of the whattf.org schema, or I may
dispense entirely with the idea of trying to use a schema to
auto-generate those parts of the draft, and use manually
maintained prose descriptions instead. But for now, I think it's
kind of useful to experiment at least with keeping it closely in
sync with the one HTML5 conformance checker that we doe have.

>  and from the UA style sheet of WebKit.

...the source for which is here:

  http://svn.webkit.org/repository/webkit/trunk/WebCore/css/html4.css

The build actually takes that and converts it to an XML
representation (yeah, go ahead and say ugh) and then chops it up
per-element and add syntax highlighting to it to produce what's
actually shown in the draft.

>  I think the document is very cool as documentation of the whattf.org schema 
>  and works as a reference for people who are comfortable with reading RELAX 
>  NG. (I link to it from the Validator.nu documentation.) However, I don't 
>  support putting it forward as a normative spec.

As far providing a reference for people who are comfortable with
reading RELAX NG, there's also a hyperlinked HTML representation
of the whattf.org schema here:

  http://www.w3.org/html/wg/markup-spec/schema.html

That's auto-generated by the build from the schema sources, so if
the schema sources change, it will get automatically updated.

> > with some weird anomalies with the way attributes are seemingly included 
> > within the element's content model.
> 
>  That's a pretty cool feature in RELAX NG, actually.

Along with the simplicity of the RELAX NG compact syntax, I think
it makes for relatively readable content models (though I
recognize they're less friendly to casual readers than prose
descriptions of the content models are).

  --Mike

-- 
Michael(tm) Smith
http://people.w3.org/mike/

Received on Thursday, 20 November 2008 08:22:10 UTC