Lexical Syntax

This is a proposal for the core lexical syntax for tokenizing the
text strings:

An expression is one or more tokens, optionally separated by
white space characters, although white space is sometimes needed
to disambiguate where one token ends and the next begins.

White space is one or more characters from the set:

    space characters
    horizontal tabs
    carriage returns
    line feeds
    form feeds

Each token takes one of the following forms:

    Name tokens:
        A letter [a-zA-Z] followed by one or more characters
        from the set: [a-zA-Z.] e.g.  x i j k T M pi big.apple

    Number tokens:
        Start with a digit, e.g.  0  1.2  0.05   1.2e-23

    Symbol tokens:
        One or more characters from the set:
            + - / * ^ _ | < >
            ! ~ @ $ % & = : ? . " ` '

    Fence tokens:
        One character from the set:
            ( ) [ ]

        These are restricted to `{' and `}'

The `\' character can be used as a prefix to any of the above
(e.g. `\pi' and `\^') to yield distinct tokens. Note that `\\'
behaves as a literal backslash character and may be used within
symbol tokens, while `\{' and `\}' behave as literal brace
characters (fence tokens) when you want to explicit curly braces
in expressions. Tokens swallow characters until an unambigous
delimiter seen.


    a)  It is unclear whether `<' `|' or `>' should be treated as
        fence tokens or not. If they are, this prohibits `<<' as tokens.
        If they aren't, then whitespace may be needed for disambiguation.

    b)  Should we reserve certains characters for future use?
        e.g. for comments or escaping to other notations

    c)  What about characters > 127, either Latin-1 or Unicode?
        We should look forward to when support for Unicode is ubiquitous.
        What does this mean in practice?

    d)  What about breaking and non-breaking spaces?

-- Dave Raggett <dsr@w3.org> tel: +1 (617) 258 5741 fax: +1 (617) 258 5999
   World Wide Web Consortium, 545 Technology Square, Cambridge, MA 02139
   url = http://www.w3.org/People/Raggett