Re: New Editions of MathML3 and XML Entities specifications published. from David Carlisle on 2014-06-18 (www-math@w3.org from June 2014)

From: David Carlisle <davidc@nag.co.uk>
Date: Wed, 18 Jun 2014 14:34:39 +0100
To: Christian Lerch <christian.p.lerch@gmail.com>
CC: "www-math@w3.org" <www-math@w3.org>
Message-ID: <53A1956F.7040803@nag.co.uk>
On 18/06/2014 06:06, Christian Lerch wrote:
> I'm building a database of Unicode character properties from various 
> sources.
> I was also considering to import from W3C XML Entity Names unicode.xml 
> but had not luck in finding a description or spec for the meaning or 
> application of data items contained in this file.
> Some exemplary questions:
> What exactly does it mean for a character to have a @mode value of 
> e.g. "text" or a @type="other"?
> What are the applications for this information, what is it used for 
> (at W3C)?
> What are the elements  AMS, APS, ACS, AIP, etc used for (I can 
> understand the acronyms).
> To what version of Mathematica does element Wolfram refer to?
> etc
> Although there is a formal rnc schema fro unicode.xml, there seems to 
> be no authoritative documentation of the data items in this file.
> Can anybody please help me with gathering these field definitions?
> Thank you in advance!
> Regards,
> Chris


The original file was put together by Sebastian in the 1990's while he 
was working on his jadetex DSSSL processor,
  primarily to map unicode points to tex macros. Then it morphed into a 
version of Barbara's data on math characters
for the STIX consortium font proposal and math character submissions.
Then it became the source of the MathML entities.

So in recent years the entity mapping part is the most actively managed 
and while I tried to keep the other parts
accurate as the stix characters moved from private use area slots in the 
original proposal to their final unicode
positions.  as Fred Wang commented the other day the TeX mappings could 
probably do with some updating.

That said...

 > What exactly does it mean for a character to have a @mode value of 
e.g. "text"

mode was a classification of the TeX mapping so U+03B1 was mapped to 
\alpha which is a Tex math mode
command that generates an error if used in normal text, hence the mode=math.

But really that shows up the 1980's origins of the TeX macro mappings 
which are primarily set up for
English mathematical documents. If you (or I) were updating this now for 
a mapping from unicode to TeX
You would not map Greek in this way, in fact it would probably make more 
sense to assume xetex or
luatex which just handle Unicode input natively so the need to name 
every character and map to a specific
127 (or 255) glyph font encoding goes away.


 > or a @type="other"?

I think type was intended to sumarise the Unicode character class but 
better now to use the
<unicodedata category= attribute

so U+0001 is type=other
but more specifically
unicodedata category="Cc"
(Control character)

 > What are the applications for this information, what is it used for 
(at W3C)?
It was originally used to build the mathml DTD and the character tables 
in the mathml spec
It is still used in parts during the build of the mathml spec (to 
implement the choice between
showing characters as entities or numeric data for example)
It's used by the scripts at
http://www.w3.org/2003/entities/2007doc/
to generate the tables in that spec, and the DTD, Json, and XSLT files.
It is also used to derive the list of entities used in HTML(5) at the 
whatwg from
where they are pulled back into HTML5 at W3C

 > What are the elements  AMS, APS, ACS, AIP, etc used for (I can 
understand the acronyms).
That comes from the STIX data I believe: as part of the project plan to 
build up a representative
list of characters that should be in a math font and submitted to 
Unicode,  the private character lists
of the various publishers were pooled. In the case of the AMS, the names 
correspond to the TeX
macro names in amsmath/amssymb in the case of other publishers they 
don't correspond to
anything that is public  (or available to me:-) as far as I know.

 > To what version of Mathematica does element Wolfram refer to?
whatever version was current at the start of the STIX  project (so 15 or 
so years ago:-)
I think they were just treated as a "publisher" and supplied a list of 
named math characters.


so to take one entry:


       <character
  id="U00021"
The unicode code point using 5 letters and no +  so it is a legal XML id 
and for ease of sorting

dec="33"
same in decimal rather than hex

  mode="text"
indication of its use in TeX (probably obsolete)

  type="punctuation"
indication of Unicode character class (obsolete)
 >
          <unicodedata
           category="Po"
          combclass="0"
         bidi="ON"
        mirror="N"
       mathclass="N"/>
These correspond to the unicode data fields.

0021;EXCLAMATION MARK;Po;0;ON;;;;;N;;;;;

Note that the UCD should be taken as authoritative here
unicode.xml is intended to map additional information such as tex and 
entity names to the Unicode code points
it isn't intended to be a replacement for the UCD data itself (although 
I believe it is accurate up to Unicode 6.2)




          <afii>EE35</afii>

The AFII glyph index.  These were probably correct when Barbara and 
Sebastian added them originally
and I've tried not to break them while moving data around over the years 
but I have no independent check of this data.

          <latex>!</latex>
The LaTeX coding (possibly obsolete in general as it assumes ascii input)

          <entity id="excl" set="8879-isonum">
             <desc>=exclamation mark</desc>
          </entity>
          <entity id="excl" set="9573-2003-isonum">
             <desc>=exclamation mark</desc>
          </entity>

the entity descriptions (in ISO 8879 and 9573 respectively, along with 
text from the comments in those entity files.

          <font name="ptmr7t" pos="33"/>
a source for the character in some font (Times Roman here) assuming 8bit 
TeX encodings.
Mostly this should be ignored and use a more modern Unicode aware font 
mapping.

          <operator-dictionary priority="810" form="postfix" lspace="1" 
rspace="0"/>
The default spacing attributes for these characters as used in MathML.
Appendix C of the MathML spec is just this data in HTML table form.

          <description unicode="1.1">EXCLAMATION MARK</description>
The Unicode name of the character and indication of which Unicode 
release it was added.



David
Received on Wednesday, 18 June 2014 13:35:11 UTC