Terminology and Framework.

From: Bruce Smith <bruce@wolfram.com> Date: Wed, 17 Apr 1996 16:08:26 -0700 Message-Id: <199604171608.4873@uvea.wolfram.com> To: w3c-math-erb@w3.org · This archive was generated by hypermail 2.4.0 : Saturday, 15 April 2023 17:19:56 UTC

Here is a summary of my present understanding of our consensus on
the general framework for HTML-Math lexical analysis, parsing, and
extensibility, modified in some cases by my own proposals.

This supersedes my original "position papers" (linked to from the
bottom of Dave's HTML-Math home page), even though they are much
more detailed, and some aspects of them remain either in our
consensus view or as my proposals.

I provide this for the purpose of helping to organize the present
discussion. With this background established, I'll later discuss
some of the specific issues mentioned here in separate letters.

1. SGML parsing, character set.

The source document is parsed in accordance with an SGML DTD.
Individual HTML-Math elements are encapsulated by the markup

	<math>...</math>

with the interior largely parsed in subsequent stages in an HTML-Math
specific way (i.e. not by the DTD), though to the extent that this
element contains SGML markup, that must also be parsable by the
DTD.

Extended characters are denoted by the "entity notation"

	&xxx;

where xxx is the name or abbreviation of the character. This is
parsed in accordance with SGML, which means that the subsequent
HTML-Math parsing process can't tell whether a character was given
in this form. This means, for example, that if there is an entity
notation for a blank space, it will still be treated as whitespace
by the HTML-Math parser as used only to separate tokens.

2. HTML-Math tokenization (lexical analysis).

The characters between <math> and </math> in each HTML-Math element
are turned into a linear sequence of HTML-Math lexical tokens.

Tokens include "terms", which can be HTML-Math identifiers, numbers,
or embedded non-Math HTML elements, and also "operators", which
includes not only infix, prefix, and postfix operators, but also
HTML-Math markup tags (such as <math>, <mn>, </mn>, etc; see below)
and bracketing characters.

{ and } are SHORTREF for <math> and </math> (or possibly <me> and
</me>) respectively. (<me>...</me> stands for "math expression",
and is markup which forces the contents to be treated as a single
subexpression; but it may turn out to be entirely redundant with
<math>...</math> which encapsulates an HTML-Math expression in a
non-Math section of an HTML document.)

Whitespace is thrown out, and is used only to separate adjacent
tokens. (No token can contain embedded whitespace, except an embedded
non-math HTML element.)

The specifics of lexical analysis can be discussed in separate
letters.

Some of the details of tokenization (especially the way characters
are classified into different types) are author-extensible. We
have not yet discussed the format in which this is specified. The
location and inheritance of such specifications is the same as for
all other author extensions (see below).

3. HTML-Math parsing (use of operator precedences and brackets
to form the "parse tree").

The HTML-Math parser uses a context-dependent set of operator
declarations, which declares the list of operator and bracket
tokens, and for each one, its precedence and associativity. This
also includes a precedence for the "missing infix operator", <juxt>,
which is assumed by the parser to lie invisibly between adjacent
terms (as necessary to obtain a single expression), and is often
used by authors to represent multiplication. The details of all
this can be discussed elsewhere.

The result of parsing is an "expression tree" or "parse tree" (which
has the same nature as a "display list"). All of these terms refer
to structures with the same data type, and any such structure
*could* be generated from parsing, or *could* be directly rendered
(provided each operator used has a well-defined rendering behavior).
Generally I will use the term "parse tree" to denote the direct
output of parsing, and "display list" to denote the structure which
is directly rendered, and refer to both of these generically as
"expression trees"; the display list will be generated from the
parse tree by macro processing (step 4).

We can assume that an expression tree takes the form of a LISP-like
list whose first element (or "head") corresponds to an operator,
and whose remaining elements (or "arguments") correspond to
subexpressions. However, in some cases, the head of this list may
represent several operator tokens as one unit, when these operators
are collected together with intervening subexpressions into one
larger expression -- for example, if the token sequence is

	(
	a
	+
	b
	)

the parse tree or display list representation might be

	("(X)" ("+" (id a) (id b)))

where the heads of the lists have the following meanings:

id	identifier
"+"	... + ...
"(X)"	( ... )

We have not yet reached complete consensus on the details of this;
so far each piece of email includes a different form of parse tree,
so these should be considered illustrations of principles under
discussion.

We also have not yet reached agreement on the details of operator
precedences and associativities, the relation of this to brackets,
or the format in which all this will be specified by authors
(when extending the defaults provided by the standard).

(I mailed a proposal for the details of operator precedences and
associativities and brackets to the list some time ago, but it is
one of the letters which seems to have mysteriously disappeared
from the w3c-math-erb archives. If it is not dug up soon, I will
reconstruct it (unfortunately I trusted the archives enough not
to keep a perfect copy of my own).)

E. Extensibility (a digression from the chronological discussion
of the stages of HTML-parsing in the order in which they occur).

All these details of parsing (and everything else which is author-
extensible) are part of the Math Context (which I previously called
the Math Syntax Model) which is available as part of the context
of each subexpression. There is a default Math Context specified
by HTML-Math, but it can be incrementally extended or wholly replaced
by authors, for all or any part of a document, with scoping compatible
with SGML markup tags such as the <div> element (a proposal for
HTML3 which sets up properly nested scopes for general HTML context
information), as well as the { and } tags within HTML-Math.

We haven't discussed the details of how extensions are specified
by authors, but we agree that extensions can be given directly in
the source file, or by reference to another URL, and that parts of
the Math Context for an outer context are inherited by inner
(directly nested) contexts unless something specifies otherwise.

All extensible information mentioned in this letter (and some more)
is part of the Math Context; it includes parameters for lexical
tokenization, operator parsing, macro rules, and display parameters
(e.g. "chemistry" or "math" mode for vertical positioning of
sub/super-scripts). It also includes whatever semantic information
can be specified.

(That's why I renamed it from Math Syntax Model -- not nearly all
of it concerns syntax. I have also avoided the term "Math Model"
to avoid confusion with the term "model" from mathematical logic,
which might suggest that this extendable contextual information is
*entirely* semantic, whereas it is mostly *not* semantic.)

4. Macro expansion (and its relation to layout schema and rendering).

The Math Context contains "macro rules" which consist of a "template",
i.e. a pattern that can match certain parts of an expression tree
(and which can contain named formal parameters), and a "result",
which is a replacement for the matched subexpression (which can
also contain instances of the named formal parameters.)

The purposes of the macro facility include:

- allowing authors to use abbreviations;

- allowing authors to specify semantic connotations, by the choice
of source form for expressions which might render the same way;

- allowing viewers (e.g. the human users of renderers) to use
specialized rules for viewing (or otherwise processing) special
classes of expressions, especially when those classes correspond
to unique semantic connotations;

- allowing the HTML-Math proposal to specify certain high-level
constructs (e.g. an "integral" operator) in the form of standard
macros which translate them into lower-level rendering primitives
("layout schemas") (and which also carry informal semantic
connotations), thus avoiding the need for an excessive number of
specialized rendering rules, and easing the customization of
rendering rules by users.

In my own proposal, the Math Context need not contain any rules
for the most ordinary operators like +, which can be fully specified
by their precedence and associativity. Such operators can be
considered to be display primitives; in fact, any expression of
the form

	(infix-operator term1 term2)

can be rendered in the same manner by default; in a typical 2-D
renderer this manner would be a horizontal sequence of term1,
operator, term2. (If the precise spacing should depend on the
operator, I will consider this a rendering parameter rather than
requiring that a macro rule somehow specify the spacing.)

Similarly, any expression of the form

	(prefix-operator term1)

can be rendered as the sequence "operator term1", and similarly
for postfix operators. Even expression trees like this example
from above,

	("(X)" ("+" (id a) (id b)))

can be rendered entirely using default rendering behavior (perhaps
modified by operator-specific rendering parameters) without the
need for any template matching.

Constructs with traditional "2-dimensional" notations, however,
will typically require macros (provided by the standard or by the
author of a document) to translate from a "logical source form"
(that is not a standard term) into a "layout schema".

None of the details of this are yet worked out fully, let alone
agreed upon. However, we have general agreement about the flavor
and scope of this, so I will fill in the details in one possible
manner in order to illustrate what I think we agree on. I'll
concentrate on the examples of fractions (with horizontal bars)
and integrals. I'll assume for these examples that "words" are
always identifiers, but "\words" can be operators.

Every layout schema (e.g. in Neil's list of "Candidate Rendering
Schema" recently forwarded to the group by Dave under that subject
heading) will have a standard low-level form, which can be directly
rendered without the use of macros. For example, we might specify
that

	{\frac \num ... \den ...}

is a low-level form (in HTML-Math source format) for fractions with
horizontal bars.

One way to make this work is to declare (in the standard Math
Context) that \frac is a prefix operator (with lower precedence
than anything except bracket-interiors), and that \num and \den
are *infix* operators with the same left-associative precedence,
a bit higher than \frac but lower than anything else. Then everything
of the above form (provided the ...s are wrapped in {}s if they
contain any \word-like operators themselves) will parse into an
expression tree that looks like

	( \frac
	  ( \den
	    ( \num
	      <missing>
	      ... )
	    ... ))

which can be recognized and directly intepreted by the renderer.
( <missing> is an empty element inserted by the parser for missing
terms. The reason to use it here is mainly to allow \num and \den
to come in either order.)

There are other ways to set this up; for example, the low-level
form could instead be

	{\frac {\num ...} {\den ...}}

with \frac a prefix operator of lower precedence than <juxt>, and
\num and \den prefix operators of the lowest possible precedence
(except higher than the left-precedence of "}"), so that the above
parses to the expression tree

	( \frac
	  ( <juxt>
	    ( \num ... )
	    ( \den ... )))

The reasons to prefer one of these low-level forms (or some other
one) over another are beyond the scope of this letter; either of
the above methods is sufficient to serve as an example. I'll assume
the first one for now.

Finally I can explain how this relates to macros.

For each layout schema, besides the low-level form there may be
one or more higher-level constructs which make use of it. These
would not be necessary if the only goal was to permit rendering of
anything expressible using the layout schema; rather, by providing
predefined macros for certain higher-level constructs, we permit
authors to use abbreviations, imply certain (perhaps informal)
semantic connotations, and permit customization of rendering rules
by viewers (who can override the standard macro rules for these
constructs).

For example, we might choose to define an infix operator "\over"
(perhaps of the same precedence as the infix operator "/" for
division with linear (horizontal) syntax), and provide a standard
macro rule which transforms

	{ ... \over ... }

into

	{ \frac \num ... \den ... }

Why would we want to do this? Abbreviation and semantic connotation
might be motivating factors; in more complex examples these reasons
will be more compelling. E.g., if the \frac construct has additional
options (e.g. to replace or remove the horizontal bar), and is
sometimes used for non-fractions (e.g. for the insides of binomial
coefficents), then by declaring the \over syntax and macro-expanding
it into the \frac form, we allow authors to declare which instances
of the \frac layout schema actually represent fractions in the
semantic sense (i.e. the ones represented by the \over operator),
and thus allow viewers to use different rendering rules for true
fractions but not for general uses of the \frac construct.

(One might suggest, in light of this, that my example \frac layout
schema is misnamed, and that in general one should avoid naming
layout schemas in accordance with one of their common semantic
uses; and I would tend to agree.)

It bears emphasis that, in this macro proposal, there is no distinction
between a "macro instance" and any other expression, except whether
there happens to be a macro rule which transforms that expression,
and that this may depend on which viewer is viewing a document (or even
on the mode being used by that viewer, e.g. on-screen or copy-command).
Potentially any expression structure could be transformed by a macro
rule. I believe that we have consensus on this feature, even
if not on all details of macros as I'm describing them.

Also, I believe we have complete consensus that macro processing
occurs after operator-precedence parsing, and can't reorganize the
expression tree as if it was re-parsed (as can happen in macro
systems which occur at the textual or token stage, as in the C
programming language).

It is very important that we are providing both authors and viewers
the ability to define their own macro rules. However, to the extent
that we provide predefined, standard macros for common constructs
like fractions and integrals, this does not release us from the
responsibility to design them as well as if we weren't providing
authors with the ability to override us. Authors will be stuck with
our decisions about standard macros (and their semantic connotations,
if any), to the extent that they want to represent semantics in a
way which is a universal standard. (Within specific subcommunities,
such as submitters of manuscripts to some set of math journals, it
is of course possible and hoped-for that an additional set of
standard macros with formal or informal semantic connotations would
be agreed upon.)

---

I hope that this clarifies some of the structure we have arrived
at, even though I have not completely spelled out what part of it
is the group's consensus, and what part remains my proposals. To
the extent that any of it is disagreed with, it would be very
helpful if this was specifically pointed out.