comments on MathML last call from Martin Duerst on 2003-05-07 (www-math@w3.org from May 2003)

From: Martin Duerst <duerst@w3.org>
Date: Wed, 07 May 2003 18:02:43 -0400
To: www-math@w3.org
Cc: w3c-i18n-ig@w3.org
Message-Id: <4.2.0.58.J.20030507115916.0416af10@localhost>
Dear MathML WG,

This is my review of your last call at
http://www.w3.org/TR/2003/WD-MathML2-20030411/, mainly based
on the diff-marked HTML version.

This is currently a personal review. I'm sending this to you
because your last call period is closing shortly.

The core task force of the Internationalization (I18N) WG meets
again next Tuesday, and will have a look at my comments and may
then indorse some of the comments, add some, or modify some.
I hope you can grant the I18N WG an extension of a few days.



Front page:

"Please refer to the errata for this document, which may include some 
normative corrections."

According to the new process document (now in review), errata
pages are not normative.


Overall: It would be extremely nice to have an index of elements
and attributes. Given all the technology used for producing the
spec, this shouldn't be a problem at all.


3.2.8 String Literal (ms)

"In practice, non-ASCII characters will typically be represented by entity 
references."

This sentence should be removed. It is technologically biased
(there are enccodings and tools that don't need entity references),
and it is also biased with regards to language. E.g. Japanese
mathematician would never represent Japanese (i.e. non-ASCII)
characters with entity references.


3.2.9 Adding new character glyphs to MathML (mglyph)

We have earlier complained about this: "character glyph" is
not a term that we know, nor is it defined in this spec, nor
should it be used, because it is highly confusing. The text
of the section now mostly manages to avoid it. If you want
an easy way out, substitute "character glyphs" with
"characters/glyphs". This at least makes it clear that
these are two different things.



4.4.11.1 Annotation (annotation)

(and same for 4.4.11.3 XML-based annotation (annotation-xml) )

 >>>>
The annotation element takes the attributes definitionURL and encoding that 
can be used to override the default semantics. Only the encoding attribute 
is required whenever the semantics remains unchanged.
 >>>>

It would be good to have a clear explanation of 'encoding' here,
so that people don't confuse it with the 'encoding' pseudo-attribute
in the XML declaration.


6.1 Introduction

 >>>>
It did not fall naturally within the purview of developing a specification 
enabling mathematics to be used with HTML and producing a DTD for the 
Working group this to worry about more than the entities allowed in the DTD.
 >>>>

"this" is weird.

More general, the I18N WG has on various occasions requested that the
introduction in chapter 6 be seriously shortened to make sure the document
stays a spec rather than a historical account of a spec's history.


"While a long process of review and adoption by UTC and ISO/IEC of the 
characters of special interest to mathematics and MathML is now  complete 
(Unicode Work in Progress) there remains the possibility of some further 
modification of the lists of characters accepted, of the code assignments 
for those adopted, or of the names given them by Unicode. To make sure any 
possible corrections to relevant standards are taken into account, and for 
the latest character tables and font information, see the W3C Math Working 
Group home page and the Unicode site."

This is highly misleading. There is a very strong commitment by
Unicode and ISO to not change any codepoints or names. The characters
referenced in the spec to our knowledge all have been fully
accepted, and any language such as the above suggesting there
will be further changes is highly confusing and misleading and
should be removed.


"The parenthetical notation beginning with U+ is one recommended by Unicode 
for referring to Unicode characters [see [Unicode], page xxviii]."

What about this notation is parenthetical? Proposal: remove 'parenthetical'.
'is one' -> 'is the one'; also, just introduce the notation, and then
avoid to list the same numbers twice, once without and once with U+.


6.2.1 Unicode Character Data

 >>>>>>>>
     * Using characters directly: For example, an A may be entered as 'A' 
from a keyboard (character U+0041J). This option is only available if the 
character encoding specified for the XML document includes the character. 
Most commonly used encodings will have 'A' in the ASCII position. In many 
encodings, characters may need more than one byte. Note that if the 
document is, for example, encoded in Latin-1 (ISO-8859-1) then only the 
characters in that encoding are available directly. Unfortunately, most 
mathematical symbols may not be encoded as character data in this way.
 >>>>>>>>

The last sentence is misleading. Using UTF-8 or UTF-16, the two only
encodings that all XML processors are required to accept, mathematical
symbols can be encoded as character data.

 >>>>
By using Character references it is always possible to access the entire 
Unicode range.
 >>>>

'Character references': inconsistent capitalization.



6.2.2 Special Characters Not in Unicode

 >>>>
In these cases one may use the mglyph  element for direct access to a glyph 
from some font and creation of a MathML character corresponding.
 >>>>

corresponding to what?


6.2.3 Mathematical Alphanumeric Symbols Characters.

there should not be a dot after the title

 >>>>
  The new Mathematical Alphanumeric Symbols provided in Unicode 3.1
 >>>>

remove 'new'. Otherwise, the spec already looks outdated
before it is approved.

 >>>>
... in contrast to the Basic Multilingual Plane (BMP) which has been used 
by Unicode so far.
 >>>>

remove temporal context ('so far')

 >>>>
For example, a Mathematical Fraktur alphabet is being added, and the code 
point for Mathematical Fraktur A is U1D504.
 >>>>

'is being added' seems to refer to some activity that is now complete.
Please update. Also, U1D504 -> U+1D504


6.2.4 Non-Marking Characters

 >>>>
Some characters, although important for the quality of print or alternative 
rendering, do not have glyph marks that correspond directly.
 >>>>

correspond to what?


 >>>>
The Universal Character Set (UCS) of Unicode and ISO 10646 continues to 
evolve, see Section 6.4.4 Status of Character Encodings. A small number of 
the changes recently introduced, relative to those resulting from the needs 
of Asian languages, are those designed exactly to facilitate the use of 
Unicode by the 'equation-writing' community. This specification is written 
on the assumption that the code assignments suggested to ISO/IEC 
JTC1/SC2/WG2 by the UTC will be confirmed as they are in public draft forms 
of Unicode 3.1 and 3.2. As before, we can only reiterate that for latest 
developments on details of character standards as far as they influence 
mathematical formalism the home page of the W3C Math Working Group should 
be consulted.
 >>>>

This seems to be totally outdated. Also, http://www.w3.org/Math/workingGroup
does not provide any relevant info. As text such as this has appeared
in older versions, http://www.w3.org/Math/workingGroup should contain
such info, even if it is just to say that all characters in question have
been approved in the meantime.


6.3 Character Symbol Listings

 >>>>
  The characters are listed by name, and sample glyphs provided for all of 
them. Each character name is accompanied by a code for a character grouping 
chosen from a list given below, a short verbal description, and a Unicode 
hex code drawn from ISO 10646, now extended in accordance with the proposal 
forwarded by the UTC to ISO/IEC WG2 in March 2000.
 >>>>

outdated, please fix


6.3.1 Special Constants

 >>>>
These have been accorded new Unicode values.
 >>>>

'have been accorded': remove temporal reference


6.3.4 Negated Mathematical Characters

 >>>>
Note that it is the policy of the W3C and of Unicode that if a single 
character is already defined for what can be achieved with a combining 
character, that character must be used instead of the decomposed form. It 
is also intended that no new single characters representing what can be 
done by with existing compositions will be introduced.
 >>>>

There should be an explicit mention of NFC, with a reference to Unicode
Standard Annex #15.



6.3.6 Mathematical Alphanumeric Symbols

 >>>>
Most of these characters come from the additions to Plane 1, however a few 
characters (such as the double-struck letters N, P, Z, Q, R, C, H 
representing common number sets) were already present in Unicode 3.0 and 
retain their original positions.
 >>>>

This is again more version/history-oriented than necessary. What about:

Most of these characters are in Plane 1, except for a few characters (such 
as the double-struck letters N, P, Z, Q, R, C, H representing common number 
sets) which are in the BMP.



6.4.2 Fewer Non-marking Characters

 >>>>
It used to be in MathML 1.0 that there were a number more non-marking 
character entities listed.
 >>>>

'It used to be' reads like 'once upon a time'. But this is a spec, not
a fairy tale. What about:

MathML 1.0 contained a small number of non-marking character entities that
have been removed in MathML 2.0.



6.4.4 Status of Character Encodings

This section needs serious rework. Some of the (updated) text is speaking
about events in 2001. The section simply should say that earlier
versions may have mentioned that different characters were in different
stages of adoption in the standards process, but that all characters
now in the spec are fully standardized. This is the message that
we need to get out, and this is the way to avoid that the spec
looks silly in a few years.


 >>>>
Even with the good will shown to the mathenatical community by the Unicode 
process a small number of characters of special interest to some may not 
yet have been included. The obvious solution of avoiding their use may not 
satisfy all. For these characters the Unicode mechanism involving Private 
Use Area codes could be deployed, in spite of all the dangers of confusion 
and collisions of conventions this brings with it. However, this is the 
situation for which mglyph was introduced.
 >>>>

This paragraph should be rewritten and shortened, if it belongs
into this section at all. It is particularly important to us
that mention of the private use area is removed. What about:

To refer to symbols not included in Unicode, please use the <mglyph>
element.


A.1 Use of MathML as Well-Formed XML

 >>>>
The document should be encoded in an encoding (for example UTF-8) in which 
al needed characters may be encoded as character data,...
 >>>>

al -> all

Finally UTF-8 is mentioned. Great!

 >>>>
However, in many circumstance,
 >>>>

circumstance -> circumstances; rest of this paragraph needs some
work too, e.g. "specification, Following" -> "specification. Following";
"the a schema validating processor schema" ->
"a schema validating processor"


A.2.2.2 Plane 1 Characters

As discussed earlier, what this section tries to do
(to provide workarounds for non-compliant XML implementations)
is unacceptable. This is even more so in that the problems in
IE, according to our knowledge, have been fixed. This section
should be removed, and the corresponding DTD fragments fixed
to eliminate the "plane1D" parameter entity.


B Content Markup Validation Grammar

 >>>>
[4]     Char     ::=      Space | [#x21 - #xFFFD] | [#x00010000 - 
#x7FFFFFFFF]  /* valid XML chars */
 >>>>

This production is clearly wrong, and needs to be fixed.


XML Schema, at http://www.w3.org/Math/XMLSchema/mathml2/mathml2.xsd
(several files):


<?xml version="1.0" encoding="UTF-8"?>
...
<xs:annotation>
   <xs:documentation>
   This is an XML Schema for MathML.
   Author: St&#233;phane Dalmas, INRIA.
   </xs:documentation>
</xs:annotation>

"&#233; : If this is UTF-8, then please use UTF-8.


Regards,    Martin.
Received on Wednesday, 7 May 2003 18:05:50 UTC