A Selectors3 parser based on Syntax3 #################################### First revision. Simon Sapin, 2012-06-09 Introduction ============ This is an attempt at a formal specification for Selectors Level 3 parser based on the css3-syntax state-machine (in its 2012-06-09 editor’s draft) rather than the CSS 2.1 Core Grammar. It is proposal for replacing the grammar in section 10 of the Selectors spec: http://www.w3.org/TR/css3-selectors/#w3cselgrammar The parser in this document accepts a super-set of allowed Level 3 selectors: the list of allow pseudo-elements and pseudo-classes is deliberately not included. This allows eg. css3-lists to define the `::marker` pseudo-element without changing the "main" parser. A functional pseudo-class is represented by its name and unparsed list of arguments. The rules for parsing the arguments are defined separately, case-by-case. All strings are Unicode codepoints, unless otherwise specified. "The original string" is the input to the tokenizer. It is in CSS syntax and can thus contain comments or backslash-escapes. These are resolved by the tokenizer. Input ===== The input for this parser is assumed to have already gone through the tokenizer and tree construction steps for Syntax3. It matches what Syntax3 puts in the selector of a style rule. An additional procedure should probably be added to Syntax3 for the tree construction of a stand-alone selector, as found for example in getElementsBySelector(). The input is a sequence of "regrouped tokens". (TODO: bikeshed the name. Maybe something with 'tree'?) Regrouped tokens are close to but slightly higher-level than the tokens produced by the Syntax3 tokenizer. A regrouped token is either: * An identifier, at-keyword, hash, string, url, delim, number, percentage, dimension, unicode-range, whitespace, colon or semicolon token. * A block * A function Blocks have: * a type: '{', '[' or '(' * content: a sequence of regrouped tokens. Does not include the opening or closing tokens for this block. Functions have: * a name: a string * arguments (content): a sequence of regrouped tokens. Does not include the opening or closing parentheses. Regrouped tokens are never function, bad-string, bad-url, cdo, cdc, open-brace, close-brace, open-paren, close-paren, open-bracket or close-bracket tokens. These either trigger a parse error (in which case there is no selector to parse) or are turned into regrouped blocks or functions. Although they are never allowed in Level 3 selectors, some tokens like url and unicode-range are included to keep this definition as wide as possible for future levels. Side note: An explicitly-defined concept like regrouped tokens can be useful as Syntax3’s output not only for selectors, but also for property values and unparsed at-rules. Namespaces ---------- In addition to the regrouped tokens, the input for this parser is made of a (possibly empty) mapping of namespace prefixes to URIs and an optional URI for the default namespace. In CSS, namespace are declared with @namespace rules. See http://www.w3.org/TR/css3-namespace/ Output ====== If the selector is valid, the output of this parser is a tree of objects. The root of the tree is always a group of selector. A group of selector object is a list of one ore more simple selectors. A selector object is made of: * A combined selector or sequence of simple selectors * An optional pseudo-element name (a string) A combined selector is made of: * On its left: a combined selector or sequence of simple selectors * A combinator: one of descendant, child, adjacent sibling or general sibling * On its right: a sequence of simple selectors A simple selector is either a type selector, an universal selector, a class selector, an attribute selector, a simple pseudo-class, a functional pseudo-class or a negation pseudo-class. A sequence of simple selectors must starts with a type selector or an universal selector. A type selector or universal selectors must be first in a sequence of simple selectors. An implicit universal selector in the original string will be explicit in the parsed sequence of simple selectors. A namespace object are made of: * A namespace "type": one of 'URI', 'any' or 'none' * If the type is 'URI', the URI (a string) `ns|E` in the original string becomes 'URI' (or gives a parse error if the prefix is not declared), `*|E` becomes 'any', `|E` becomes 'none', and `E` becomes 'URI' if a default namespace is declared, 'any' otherwise. Type selectors are made of: * A namespace object * A type name (a string) Universal selectors are made of: * A namespace object Class selectors are made of: * A class name (a string) ID selectors are made of: * An identifier (a string) Attribute selectors are made of: * A namespace object * An attribute name (a string) * An operator: one of 'exists', '=', '~=', '|=', '^=', '$=' or '*=' * If the operator is not 'exists', a value (a string) Simple pseudo-classes are made of: * A name (a string) Functional pseudo-classes are made of: * A name (a string) * Its arguments. The shape/type is defined for each pseudo-class. Level 3 pseudo-classes are defined at the end of this document. Negation pseudo-classes are made of: * A negated selector. (a simple selector that is not a negation) Issue 1: These definitions encode the constraint that a pseudo-element can only be last. Should they be more general, in case future levels want to relax the constraint? Invalid selectors ----------------- If at any point an invalid selector is encountered, the parser is aborted and there is no output/result tree. It is up to the host language to define what happens to invalid selectors. For CSS style rules, it up to the Syntax3 tree construction to make sure that an invalid selector and its declaration block are completely consumed and ignored. Selector parsing ================ Just like "raw" tokens, sequences of regrouped tokens can be consumed item-by-item with an implicit iterator/index. When the end of a sequence has been reached, consuming it further yields eof tokens. Note that a eof token while consuming the content of a block or a function marks the end of the block or function, not the end of the selector. Likewise, eof while consuming the input sequence marks the end of the selector, not that of any larger unit (like a CSS stylesheet) where the selector was read. TODO: the actual state-machine-based parser. Parsing of functional pseudo-classes arguments ============================================== Each functional pseudo-class has a specialized parser for its arguments. These parsers can either make the selector invalid or return the arguments in a higher-level form. The input is the arguments of the function object that represents the pseudo-class, with whitespace tokens removed at the start and end of the sequence. It is a sequence of regrouped tokens. :lang() ------- Output: a string If there is exactly one argument and that argument is an ident token, return the token’s value. Otherwise, the selector is invalid. Issue 2: Are string tokens allowed instead of ident? :nth-child(), :nth-last-child(), :nth-of-type() and :nth-last-of-type() ----------------------------------------------------------------------- Output: a pair (a, b) of integers Internal state: a, b (integers), negative-b (flag, initially unset) This should match the grammar defined in the current level 3 spec: nth : S* [ ['-'|'+']? INTEGER? {N} [ S* ['-'|'+'] S* INTEGER ]? | ['-'|'+']? INTEGER | {O}{D}{D} | {E}{V}{E}{N} ] S* ; Issue 3: Whitespace is allowed on either side of b’s sign, but not between a and its sign (if any). Is this what we want? This seems consistent with the "whitespace" examples in the spec. Consume the arguments one-by-one, and start in nth-start mode. nth-start mode .............. Consume the next argument. ident token with the value 'even' Set a to 2, b to 0. Switch to the nth-end mode. ident token with the value 'odd' Set a to 2, b to 1. Switch to the nth-end mode. ident token with the value 'n' Set a to 1. Switch to the nth-after-n mode. ident token with the value '-n' Set a to -1. Switch to the nth-after-n mode. ident token with the value 'n-' Set a to 1. Set the negative-b flag. Switch to the nth-after-b-sign mode. ident token with the value '-n-' Set a to -1. Set the negative-b flag. Switch to the nth-after-b-sign mode. dimension token with the integer flag and the unit 'n' Set a to the token’s value. Switch to the nth-after-n mode. dimension token with the integer flag and the unit 'n-' Set a to the token’s value. Set the negative-b flag. Switch to the nth-after-b-sign mode. dimension token with the integer flag and unit that matches 'n-[0-9]+' Set a to the token’s value. Set b to the token’s unit parsed as a decimal integer, after removing the initial 'n'. Switch to the nth-end mode. number token with the integer flag Set a to 0. Set b to the token’s value. Switch to the nth-end mode. anything else The selector is invalid. nth-after-n mode ................ Consume the next argument. eof token Set b to 0. Return (a, b) whitespace token Do nothing. Remain in this mode. delim token with the value '+' Switch to the nth-after-b-sign mode. delim token with the value '-' Set the negative-b flag. Switch to the nth-after-b-sign mode. number token with the integer flag Set b to the token’s value. Switch to the nth-end mode. anything else The selector is invalid. nth-after-b-sign mode ..................... Consume the next argument. whitespace token Do nothing. Remain in this mode. number token with the integer flag and a representation that does not start with a '-' or a '+' Set b to the opposite of the token’s value if negative-b is set, to the token’s value otherwise. Switch to the nth-end mode. anything else The selector is invalid. nth-end mode ............ Consume the next argument. eof token Return (a, b) anything else The selector is invalid