Re: Starting point

Taking out the things that John questions (all reasonably questionable
in my view) leaves us with:

# Documents
document ::= s element s
# Elements
element ::= startTag content endTag
content ::= (element | dataChar | charRef)*
startTag ::= '<' name (s+ attribute)* s* '>'
endTag ::= '</' name s* '>'
# Attributes
attribute ::= name s* '=' s* attributeValue
attributeValue ::= '"' ((dataChar - '"') | charRef)* '"'
         | "'" ((dataChar - "'") | charRef)* "'"
# Data characters
dataChar ::= char - ('<' | '&' | '>')
# Character references
charRef ::= hexCharRef | namedCharRef
hexCharRef ::= '&#x' [0-9a-fA-F]+ ';'
namedCharRef ::= '&' charName ';'
charName ::= 'amp' | 'lt' | 'gt' | 'quot' | 'apos'
# Names
name ::= nameStartChar nameChar*
nameStartChar ::= [A-Z] | [a-z] | "_"
nameChar ::= nameStartChar | [0-9] | "-" | "." | #xB7 |
[#x0300-#x036F] | [#x203F-#x2040]
# White space
s ::= #x9 | #xA | #xD | #x20
# Characters
char ::= s | ([#x21-#x10FFFF] - forbiddenChar)
forbiddenChar ::= surrogateChar | #FFFE | #FFFF
surrogateChar ::= [#xD800-#xDFFF]

> In addition, UTF-8 as the only character encoding.

Yes, although I think I would like to have both the concept of

- a well-formed MicroXML byte sequence, which would be encoded UTF-8 only, and
- a well-formed MicroXML character sequence, for which encoding is irrelevant.

The list of issues to consider then becomes:

- empty element tags eg <foo/>
- comments
- bare DOCTYPE declaration eg <!DOCTYPE html>
- namespaces/prefixes on elements/attributes
- processing instructions
- Unicode names for elements/attributes
- allow > in attribute values for Canonical XML compatibility?
- decimal character references

James

Received on Tuesday, 24 July 2012 07:42:32 UTC