(long) Content Markup suggestions for next version of MathML

Hi all,


I have no idea where the discussion on the next version of MathML is
going right now, or if Content MathML has been worked over yet in that
process.  In case these points have not yet been discussed in the
working group, I would like to point out a few places that I believe
could use some clarification or adjustment in order to come closer to
the stated requirement for MathML content markup:

  "Since the intent of MathML content markup is to encode mathematical
  expressions in such a way that the mathematical structure of the
  expression is clear, the syntax and usage of content markup must be 
  consistent enough to facilitate automated semantic interpretation." 

Here goes:

 - Compare these two quotes from the specs:

   -- "The condition element is always used together with one or more
   bvar elements." 

   -- "Note that the bound variable may be implicit: 

    <apply><max/>
      <condition>
        <apply><and/>
          <reln><in/><ci>x</ci><ci type="set">B</ci></reln>
          <reln><notin/><ci>x</ci><ci type="set">C</ci></reln>
        </apply>
      </condition>
    </apply>"

   These two obviously contradict each other.  

   I strongly recommend striking the second quote from the
   spec. Making the bound variables implicit like that is always a
   very bad idea in a semantically oriented language, as the specs
   note in another place: 

   "(The condition may involve more than one symbol.)"

   This same kind of mistake was made in the specs for KIF 3.0 for the 
   "setofall" operator, leading to incorrect semantics whenever the
   condition contained a parameter (arbitrary constant).
   Consequently(?), the current ANSI draft spec for KIF no longer 
   contains a "setofall" operator.

   The main point is that correct automated semantic interpretation
   cannot be guaranteed unless the requirement for listing the bound 
   variables is always enforced.

 - "It is an error to enclose a relation in an element other than
   reln."

   Actually, they may also be enclosed in a <fn>, since <fn> turns
   anything into a function (e.g., a relation into its characteristc 
   function mapping to {0,1}).

   Also, the definition of <apply> says that anything appearing as
   its first argument is autmagically interpreted as a function as
   if it were wrapped in a <fn>.  Consequently, relations may also 
   appear in <apply>s.
   
 - "When used with int [or sum or product], each qualifier schema is
   expected to contain a single child schema; otherwise an error is
   generated." 

   This clashes with the definition of <interval>, doesn't it?

   But then, I also noticed this:

 - <interval> is used in a dual fashion:  as a qualifier, and as a
   constructor.  This can in some rare cases lead to ambiguities:

     "Considering interval-valued functions F bounded by functions
     f and g (i.e., F=[f,g], to abuse notation), it is easy to see
     that the integral of F (i.e., the integral of [f,g]) is 
     [integral of f, integral of g]."

   If you consider representing this in content-MathML, here is
   how you would want to do it interpreting <interval> as a 
   constructor:

   <reln><eq/>
     <apply><int/>
       <interval> <fn><ci>f</ci></fn>
                  <fn><ci>g</ci></fn>
       </interval>
     </apply>
     <interval>
       <apply><int/> <fn><ci>f</ci></fn>  </apply>
       <apply><int/> <fn><ci>g</ci></fn>  </apply>
     </interval>
   </reln>

   However, the MathML spec would wrongly lead to an interpretation 
   of the first of these <interval>s as a qualifier, because that's
   the syntactic disambiguation specified in MathML.

   It is possible, of course, to circumvent this problem by wrapping 
   that <interval> expression with a <fn>, but that kind of a 
   disambiguation technique may not always be desirable in similar
   cases.

   The easiest solution here would be to strike <interval> from the
   list of qualifiers entirely and instead note that <interval> is a
   kind of set, and we can therefore simply specify

       <condition> <interval> ... </interval> </condition>

   because <condition> is allowed to contain a set rather than a 
   boolean expression, giving it precisely the meaning we want.

   However, note that in the case of a set-valued condition, the
   sibling <bvar> variable's scope excludes that set (and thus the
   <condition> qualifier), because there is an implicit "var \in set"
   wrapped around it.  In the case of a predicative <condition>, on
   the other hand, the <condition> is inside the scope of the sibling
   <bvar> variable(s) because the integration variable appears in the
   condition.  

   This observation would argue for replacing the
   <interval> qualifier by a more general qualifier for sets that the
   dependent variable ranges over instead:

      <rangesover> <interval> ... </interval> </rangesover>

   (I should reiterate my opinion here that it's not a good idea to
   allow sibling nodes to be in different scopes.  OpenMath has
   rightly opted against doing that, albeit after long and hard
   discussions, because solutions would be much cleaner, and
   "automated semantic interpretation" much "facilitated" 
   if scope boundaries were always container element boundaries, both
   for operator *and* for variable scopes.  See [1].)

 - Note that we could also write

       <set> <interval/> ... </set> 

   instead of <interval> ... </interval> if we generalize the <set>
   constructor to take set operators as a first argument and act a
   little like <fn> and <reln> in this respect.  <interval/> would be
   one such operator, set union and similar operators would be others.


Another major suggestion that I would like to make is to include
discussion of variable scoping in the discussion of semantics of
MathML elements "to facilitate automated semantic interpretation".
Here are some rules that may cover that topic in the current version
of MathML: 

 - The scope of a variable appearing in a bvar qualifier element is the
   container element containing the bvar qualifier, and all its
   children except <interval>, <lowlimit>, or <uplimit> 
   qualifiers appearing as siblings of the <bvar> qualifier.  

   (I discussed the reason why interval, lowlimit, and uplimit
   (as well as a potential "rangesover") are outside that scope at
   some length in a message to this forum a year and a half ago.
   See also [1] for a more detailed discussion, and how the
   compositionality principle comes in.)

   In particular, a condition qualifier is within the scope of a
   sibling bvar qualifier's variable.  (But see above comment on
   the <interval> qualifier/constructor:  if the <condition> is
   a set rather than a predicative expression, the sibling <bvar>'s
   variable's scope should *not* include the <condition>!)

 - Variables in a bvar element are bound within their scope;
   identifiers with identical names appearing outside their scope are
   semantically distinct entities that may take on different values in
   a valid interpretation, even if they denote the same concept.

   To illustrate the point, consider the example

      <apply></plus>
        <ci>x</ci>
        <apply><int/>
          <bvar> <ci>x</ci> </bvar>
          <ci>x</ci>
        </apply>
      </apply>
   
    Here, the third x is within the scope of the second x, but the 
    first x is outside its scope.  Conceptually, the third x would
    range over some interval while calculating the value of this
    expression for one particular value of the first x.  Nevertheless,
    all three occurrences denote the concept of "the x-axis" -- in
    particular, the integral is implicitly assumed to produce a 
    function in x (a variable that is semantically identical to the 
    first x!).

As far as I can tell, these simple rules would allow one to correctly
interpret the semantics of bound variables in the current MathML.  A
set of rules like this would also make it possible to add additional
operators of the product and sum variety (which automatically come
with any n-ary operator) or new quantifiers ("there exists exactly
one" and "for almost all" are two such quantifiers that I have met
with in college), and to correctly interpret them as long as they
adhere to the style used by current MathML practices.  Moreover, you
could write a general-purpose MathML interpreter that would obey
variable scoping semantics both for the current and for user-extended
MathML:

  <apply> <fn definitionURL="...">exists_uniquely</fn>
     <bvar> <ci type="real">&alpha;</ci> </bvar>
     <apply> <and/>
         <reln> <gt/> <ci>x</ci> <cn>0</cn> </reln>
         <reln> <eq/> 
            <apply> <times/> <ci>x</ci> <ci>x</ci> </apply>
            <cn> 2 </cn>
         </reln>
     </apply>
  </apply>

(Incidentally, I think all those applys should be relns, and there may
be need for an equivalent of fn for relations.)


Some other minor points:

 - "implies" is listed as "relation", while "and", "or", "xor", and
   all the rest are "operators".  This is inconsistent.

   (Personally, I'd list them all, along with the quantifiers, as 
   boolean operators/ relations.)

 - `4.3.2.7 order
    list 
        indicates ordering on the list. Predefined values:
        lexicographic, numeric  

         Default = "numeric" '
  
   Shouldn't the default be "unordered" or some such thing, for the
   case where the list is given by naming its elements, which may be
   totally unordered?


Finally, I would like to verify my understanding of this point, because
it may be a source of incompatibility with OpenMath:

  "real: A real number is presented in decimal notation. Decimal
  notation consists of an optional sign ("+" or "-") followed by a
  string of digits possibly separated into an integer and a fractional
  part by a "decimal point". Some examples are .3, 1, and -31.56. If a
  different BASE is specified, then the digits are interpreted as being
  digits computed to that base.  

  "A real number may also be presented in scientific notation. Such
  numbers have two parts (a mantissa and an exponent) separated by
  e. The first part is a real number while the second part is an integer
  exponent indicating a power of the base. For example, 12.3e5
  represents 12.3 times 10 ^5."

Does this mean that MathML represents "big floats" -- floats to an
arbitrary precision (read: number of digits)?


Sorry for the long message...


Regards,

               Andreas Strotmann



[1] L.J.Kohout, A.Strotmann: "Understanding and Improving Content
Markup for the Web: from the Perspectives of Formal Linguistics,
Algebraic Logics, and Cognitive Science." in: ISIC/CIRA/ISAS '98 Joint
Conference on the Science and Technology of Intelligent Systems.


PS:  My apologies for not sending this sooner, but my move to and
first time at FSU was a bit time-consuming.

Received on Thursday, 13 May 1999 16:52:17 UTC