Here is a summary of my present understanding of our consensus on the general framework for HTML-Math lexical analysis, parsing, and extensibility, modified in some cases by my own proposals. This supersedes my original "position papers" (linked to from the bottom of Dave's HTML-Math home page), even though they are much more detailed, and some aspects of them remain either in our consensus view or as my proposals. I provide this for the purpose of helping to organize the present discussion. With this background established, I'll later discuss some of the specific issues mentioned here in separate letters. 1. SGML parsing, character set. The source document is parsed in accordance with an SGML DTD. Individual HTML-Math elements are encapsulated by the markup <math>...</math> with the interior largely parsed in subsequent stages in an HTML-Math specific way (i.e. not by the DTD), though to the extent that this element contains SGML markup, that must also be parsable by the DTD. Extended characters are denoted by the "entity notation" &xxx; where xxx is the name or abbreviation of the character. This is parsed in accordance with SGML, which means that the subsequent HTML-Math parsing process can't tell whether a character was given in this form. This means, for example, that if there is an entity notation for a blank space, it will still be treated as whitespace by the HTML-Math parser as used only to separate tokens. 2. HTML-Math tokenization (lexical analysis). The characters between <math> and </math> in each HTML-Math element are turned into a linear sequence of HTML-Math lexical tokens. Tokens include "terms", which can be HTML-Math identifiers, numbers, or embedded non-Math HTML elements, and also "operators", which includes not only infix, prefix, and postfix operators, but also HTML-Math markup tags (such as <math>, <mn>, </mn>, etc; see below) and bracketing characters. { and } are SHORTREF for <math> and </math> (or possibly <me> and </me>) respectively. (<me>...</me> stands for "math expression", and is markup which forces the contents to be treated as a single subexpression; but it may turn out to be entirely redundant with <math>...</math> which encapsulates an HTML-Math expression in a non-Math section of an HTML document.) Whitespace is thrown out, and is used only to separate adjacent tokens. (No token can contain embedded whitespace, except an embedded non-math HTML element.) The specifics of lexical analysis can be discussed in separate letters. Some of the details of tokenization (especially the way characters are classified into different types) are author-extensible. We have not yet discussed the format in which this is specified. The location and inheritance of such specifications is the same as for all other author extensions (see below). 3. HTML-Math parsing (use of operator precedences and brackets to form the "parse tree"). The HTML-Math parser uses a context-dependent set of operator declarations, which declares the list of operator and bracket tokens, and for each one, its precedence and associativity. This also includes a precedence for the "missing infix operator", <juxt>, which is assumed by the parser to lie invisibly between adjacent terms (as necessary to obtain a single expression), and is often used by authors to represent multiplication. The details of all this can be discussed elsewhere. The result of parsing is an "expression tree" or "parse tree" (which has the same nature as a "display list"). All of these terms refer to structures with the same data type, and any such structure *could* be generated from parsing, or *could* be directly rendered (provided each operator used has a well-defined rendering behavior). Generally I will use the term "parse tree" to denote the direct output of parsing, and "display list" to denote the structure which is directly rendered, and refer to both of these generically as "expression trees"; the display list will be generated from the parse tree by macro processing (step 4). We can assume that an expression tree takes the form of a LISP-like list whose first element (or "head") corresponds to an operator, and whose remaining elements (or "arguments") correspond to subexpressions. However, in some cases, the head of this list may represent several operator tokens as one unit, when these operators are collected together with intervening subexpressions into one larger expression -- for example, if the token sequence is ( a + b ) the parse tree or display list representation might be ("(X)" ("+" (id a) (id b))) where the heads of the lists have the following meanings: id identifier "+" ... + ... "(X)" ( ... ) We have not yet reached complete consensus on the details of this; so far each piece of email includes a different form of parse tree, so these should be considered illustrations of principles under discussion. We also have not yet reached agreement on the details of operator precedences and associativities, the relation of this to brackets, or the format in which all this will be specified by authors (when extending the defaults provided by the standard). (I mailed a proposal for the details of operator precedences and associativities and brackets to the list some time ago, but it is one of the letters which seems to have mysteriously disappeared from the w3c-math-erb archives. If it is not dug up soon, I will reconstruct it (unfortunately I trusted the archives enough not to keep a perfect copy of my own).) E. Extensibility (a digression from the chronological discussion of the stages of HTML-parsing in the order in which they occur). All these details of parsing (and everything else which is author- extensible) are part of the Math Context (which I previously called the Math Syntax Model) which is available as part of the context of each subexpression. There is a default Math Context specified by HTML-Math, but it can be incrementally extended or wholly replaced by authors, for all or any part of a document, with scoping compatible with SGML markup tags such as the <div> element (a proposal for HTML3 which sets up properly nested scopes for general HTML context information), as well as the { and } tags within HTML-Math. We haven't discussed the details of how extensions are specified by authors, but we agree that extensions can be given directly in the source file, or by reference to another URL, and that parts of the Math Context for an outer context are inherited by inner (directly nested) contexts unless something specifies otherwise. All extensible information mentioned in this letter (and some more) is part of the Math Context; it includes parameters for lexical tokenization, operator parsing, macro rules, and display parameters (e.g. "chemistry" or "math" mode for vertical positioning of sub/super-scripts). It also includes whatever semantic information can be specified. (That's why I renamed it from Math Syntax Model -- not nearly all of it concerns syntax. I have also avoided the term "Math Model" to avoid confusion with the term "model" from mathematical logic, which might suggest that this extendable contextual information is *entirely* semantic, whereas it is mostly *not* semantic.) 4. Macro expansion (and its relation to layout schema and rendering). The Math Context contains "macro rules" which consist of a "template", i.e. a pattern that can match certain parts of an expression tree (and which can contain named formal parameters), and a "result", which is a replacement for the matched subexpression (which can also contain instances of the named formal parameters.) The purposes of the macro facility include: - allowing authors to use abbreviations; - allowing authors to specify semantic connotations, by the choice of source form for expressions which might render the same way; - allowing viewers (e.g. the human users of renderers) to use specialized rules for viewing (or otherwise processing) special classes of expressions, especially when those classes correspond to unique semantic connotations; - allowing the HTML-Math proposal to specify certain high-level constructs (e.g. an "integral" operator) in the form of standard macros which translate them into lower-level rendering primitives ("layout schemas") (and which also carry informal semantic connotations), thus avoiding the need for an excessive number of specialized rendering rules, and easing the customization of rendering rules by users. In my own proposal, the Math Context need not contain any rules for the most ordinary operators like +, which can be fully specified by their precedence and associativity. Such operators can be considered to be display primitives; in fact, any expression of the form (infix-operator term1 term2) can be rendered in the same manner by default; in a typical 2-D renderer this manner would be a horizontal sequence of term1, operator, term2. (If the precise spacing should depend on the operator, I will consider this a rendering parameter rather than requiring that a macro rule somehow specify the spacing.) Similarly, any expression of the form (prefix-operator term1) can be rendered as the sequence "operator term1", and similarly for postfix operators. Even expression trees like this example from above, ("(X)" ("+" (id a) (id b))) can be rendered entirely using default rendering behavior (perhaps modified by operator-specific rendering parameters) without the need for any template matching. Constructs with traditional "2-dimensional" notations, however, will typically require macros (provided by the standard or by the author of a document) to translate from a "logical source form" (that is not a standard term) into a "layout schema". None of the details of this are yet worked out fully, let alone agreed upon. However, we have general agreement about the flavor and scope of this, so I will fill in the details in one possible manner in order to illustrate what I think we agree on. I'll concentrate on the examples of fractions (with horizontal bars) and integrals. I'll assume for these examples that "words" are always identifiers, but "\words" can be operators. Every layout schema (e.g. in Neil's list of "Candidate Rendering Schema" recently forwarded to the group by Dave under that subject heading) will have a standard low-level form, which can be directly rendered without the use of macros. For example, we might specify that {\frac \num ... \den ...} is a low-level form (in HTML-Math source format) for fractions with horizontal bars. One way to make this work is to declare (in the standard Math Context) that \frac is a prefix operator (with lower precedence than anything except bracket-interiors), and that \num and \den are *infix* operators with the same left-associative precedence, a bit higher than \frac but lower than anything else. Then everything of the above form (provided the ...s are wrapped in {}s if they contain any \word-like operators themselves) will parse into an expression tree that looks like ( \frac ( \den ( \num <missing> ... ) ... )) which can be recognized and directly intepreted by the renderer. ( <missing> is an empty element inserted by the parser for missing terms. The reason to use it here is mainly to allow \num and \den to come in either order.) There are other ways to set this up; for example, the low-level form could instead be {\frac {\num ...} {\den ...}} with \frac a prefix operator of lower precedence than <juxt>, and \num and \den prefix operators of the lowest possible precedence (except higher than the left-precedence of "}"), so that the above parses to the expression tree ( \frac ( <juxt> ( \num ... ) ( \den ... ))) The reasons to prefer one of these low-level forms (or some other one) over another are beyond the scope of this letter; either of the above methods is sufficient to serve as an example. I'll assume the first one for now. Finally I can explain how this relates to macros. For each layout schema, besides the low-level form there may be one or more higher-level constructs which make use of it. These would not be necessary if the only goal was to permit rendering of anything expressible using the layout schema; rather, by providing predefined macros for certain higher-level constructs, we permit authors to use abbreviations, imply certain (perhaps informal) semantic connotations, and permit customization of rendering rules by viewers (who can override the standard macro rules for these constructs). For example, we might choose to define an infix operator "\over" (perhaps of the same precedence as the infix operator "/" for division with linear (horizontal) syntax), and provide a standard macro rule which transforms { ... \over ... } into { \frac \num ... \den ... } Why would we want to do this? Abbreviation and semantic connotation might be motivating factors; in more complex examples these reasons will be more compelling. E.g., if the \frac construct has additional options (e.g. to replace or remove the horizontal bar), and is sometimes used for non-fractions (e.g. for the insides of binomial coefficents), then by declaring the \over syntax and macro-expanding it into the \frac form, we allow authors to declare which instances of the \frac layout schema actually represent fractions in the semantic sense (i.e. the ones represented by the \over operator), and thus allow viewers to use different rendering rules for true fractions but not for general uses of the \frac construct. (One might suggest, in light of this, that my example \frac layout schema is misnamed, and that in general one should avoid naming layout schemas in accordance with one of their common semantic uses; and I would tend to agree.) It bears emphasis that, in this macro proposal, there is no distinction between a "macro instance" and any other expression, except whether there happens to be a macro rule which transforms that expression, and that this may depend on which viewer is viewing a document (or even on the mode being used by that viewer, e.g. on-screen or copy-command). Potentially any expression structure could be transformed by a macro rule. I believe that we have consensus on this feature, even if not on all details of macros as I'm describing them. Also, I believe we have complete consensus that macro processing occurs after operator-precedence parsing, and can't reorganize the expression tree as if it was re-parsed (as can happen in macro systems which occur at the textual or token stage, as in the C programming language). It is very important that we are providing both authors and viewers the ability to define their own macro rules. However, to the extent that we provide predefined, standard macros for common constructs like fractions and integrals, this does not release us from the responsibility to design them as well as if we weren't providing authors with the ability to override us. Authors will be stuck with our decisions about standard macros (and their semantic connotations, if any), to the extent that they want to represent semantics in a way which is a universal standard. (Within specific subcommunities, such as submitters of manuscripts to some set of math journals, it is of course possible and hoped-for that an additional set of standard macros with formal or informal semantic connotations would be agreed upon.) --- I hope that this clarifies some of the structure we have arrived at, even though I have not completely spelled out what part of it is the group's consensus, and what part remains my proposals. To the extent that any of it is disagreed with, it would be very helpful if this was specifically pointed out.Received on Wednesday, 17 April 1996 19:10:29 UTC
This archive was generated by hypermail 2.4.0 : Saturday, 15 April 2023 17:19:56 UTC