Starting point

We currently have 26 participants on this CG, which is rather more
than I expected.  For this CG to be successful, I think we are going
to have to be careful to avoid the natural tendency for large groups
to produce large specs. For this reason, I would suggest that we start
something ultra-small and then add to it only if we get consensus.

At the moment, three grammars have been proposed:

- my first blog post: http://blog.jclark.com/2010/12/microxml.html
- my second blog post: http://blog.jclark.com/2010/12/more-on-microxml.html
- John Cowan's Editor's Draft: http://home.ccil.org/~cowan/MicroXML.html

However, I think these all include features that a reasonable person
might want to leave out . So here's my suggested starting point, which
is a subset of the intersection of these three grammars.  The goal is
that this shouldn't have anything in it that anybody on the CG thinks
they might want to leave out.  I expect everybody (including me) will
have stuff that they want to add.

# Documents
document ::= s element s
# Elements
element ::= startTag content endTag
content ::= (element | dataChar | charRef)*
startTag ::= '<' name (s+ attribute)* s* '>'
endTag ::= '</' name s* '>'
# Attributes
attribute ::= name s* '=' s* attributeValue
attributeValue ::= '"' ((attributeValueChar - '"') | charRef)* '"'
                 | "'" ((attributeValueChar - "'") | charRef)* "'"
attributeValueChar ::= char - ('<' | '&')
# Data characters
dataChar ::= char - ('<' | '&' | '>')
# Character references
charRef ::= decCharRef | hexCharRef | namedCharRef
decCharRef ::= '&#' [0-9]+ ';'
hexCharRef ::= '&#x' [0-9a-fA-F]+ ';'
namedCharRef ::= '&' charName ';'
charName ::= 'amp' | 'lt' | 'gt' | 'quot' | 'apos'
# Names
name ::= nameStartChar nameChar*
nameStartChar ::= [A-Z] | [a-z] | "_" | [#xC0-#xD6] | [#xD8-#xF6] |
[#xF8-#x2FF] | [#x370-#x37D]
                | [#x37F-#x1FFF] | [#x200C-#x200D] | [#x2070-#x218F] |
[#x2C00-#x2FEF]
                | [#x3001-#xD7FF] | [#xF900-#xFDCF] | [#xFDF0-#xFFFD]
| [#x10000-#xEFFFF]
nameChar ::= nameStartChar | [0-9] | "-" | "." | #xB7 |
[#x0300-#x036F] | [#x203F-#x2040]
# White space
s ::= #x9 | #xA | #xD | #x20
# Characters
char ::= s | ([#x21-#x10FFFF] - forbiddenChar)
forbiddenChar ::= surrogateChar | #FFFE | #FFFF
surrogateChar ::= [#xD800-#xDFFF]

There are lots of different ways to describe the data model. Here's
one way of doing it, which is designed to be very close to JsonML.
This defines the data model as a grammar over a particular kind of
tree.  These trees have one atomic type, a character (equivalent to a
Unicode code-point), and two composite types, arrays and maps. In the
following, [...] denotes arrays, and {...} denotes maps:

document ::= element
element ::= [name, attributes, content]
attributes ::= { (name => attributeValue)* }
attributeValue = [ char* ]
content ::= [ (char | element)* ]
name ::= [ nameStartChar, nameChar* ]
char, nameStartChar, nameChar ::= <single character as in grammar for
concrete syntax>

With this starting point, the list of features to consider adding would be:

- empty element tags eg <foo/>
- comments
- bare DOCTYPE declaration eg <!DOCTYPE html>
- namespaces/prefixes on elements/attributes
- processing instructions

Note that all of these features have implications for the data model
and/or HTML5-friendliness.

I would suggest we discuss further the goals and the starting point,
and then consider each of these features.

James

Received on Tuesday, 24 July 2012 03:47:12 UTC