This is a proposal for the core lexical syntax for tokenizing the text strings: An expression is one or more tokens, optionally separated by white space characters, although white space is sometimes needed to disambiguate where one token ends and the next begins. White space is one or more characters from the set: space characters horizontal tabs carriage returns line feeds form feeds end-of-file Each token takes one of the following forms: Name tokens: A letter [a-zA-Z] followed by one or more characters from the set: [a-zA-Z.] e.g. x i j k T M pi big.apple Number tokens: Start with a digit, e.g. 0 1.2 0.05 1.2e-23 Symbol tokens: One or more characters from the set: + - / * ^ _ | < > ! ~ @ $ % & = : ? . " ` ' Fence tokens: One character from the set: ( ) [ ] Brackets: These are restricted to `{' and `}' The `\' character can be used as a prefix to any of the above (e.g. `\pi' and `\^') to yield distinct tokens. Note that `\\' behaves as a literal backslash character and may be used within symbol tokens, while `\{' and `\}' behave as literal brace characters (fence tokens) when you want to explicit curly braces in expressions. Tokens swallow characters until an unambigous delimiter seen. Issues: a) It is unclear whether `<' `|' or `>' should be treated as fence tokens or not. If they are, this prohibits `<<' as tokens. If they aren't, then whitespace may be needed for disambiguation. b) Should we reserve certains characters for future use? e.g. for comments or escaping to other notations c) What about characters > 127, either Latin-1 or Unicode? We should look forward to when support for Unicode is ubiquitous. What does this mean in practice? d) What about breaking and non-breaking spaces? -- Dave Raggett <dsr@w3.org> tel: +1 (617) 258 5741 fax: +1 (617) 258 5999 World Wide Web Consortium, 545 Technology Square, Cambridge, MA 02139 url = http://www.w3.org/People/RaggettReceived on Friday, 12 April 1996 15:45:26 UTC
This archive was generated by hypermail 2.4.0 : Saturday, 15 April 2023 17:19:56 UTC