- From: David Carlisle <davidc@nag.co.uk>
- Date: Wed, 18 Jun 2014 14:34:39 +0100
- To: Christian Lerch <christian.p.lerch@gmail.com>
- CC: "www-math@w3.org" <www-math@w3.org>
On 18/06/2014 06:06, Christian Lerch wrote: > I'm building a database of Unicode character properties from various > sources. > I was also considering to import from W3C XML Entity Names unicode.xml > but had not luck in finding a description or spec for the meaning or > application of data items contained in this file. > Some exemplary questions: > What exactly does it mean for a character to have a @mode value of > e.g. "text" or a @type="other"? > What are the applications for this information, what is it used for > (at W3C)? > What are the elements AMS, APS, ACS, AIP, etc used for (I can > understand the acronyms). > To what version of Mathematica does element Wolfram refer to? > etc > Although there is a formal rnc schema fro unicode.xml, there seems to > be no authoritative documentation of the data items in this file. > Can anybody please help me with gathering these field definitions? > Thank you in advance! > Regards, > Chris The original file was put together by Sebastian in the 1990's while he was working on his jadetex DSSSL processor, primarily to map unicode points to tex macros. Then it morphed into a version of Barbara's data on math characters for the STIX consortium font proposal and math character submissions. Then it became the source of the MathML entities. So in recent years the entity mapping part is the most actively managed and while I tried to keep the other parts accurate as the stix characters moved from private use area slots in the original proposal to their final unicode positions. as Fred Wang commented the other day the TeX mappings could probably do with some updating. That said... > What exactly does it mean for a character to have a @mode value of e.g. "text" mode was a classification of the TeX mapping so U+03B1 was mapped to \alpha which is a Tex math mode command that generates an error if used in normal text, hence the mode=math. But really that shows up the 1980's origins of the TeX macro mappings which are primarily set up for English mathematical documents. If you (or I) were updating this now for a mapping from unicode to TeX You would not map Greek in this way, in fact it would probably make more sense to assume xetex or luatex which just handle Unicode input natively so the need to name every character and map to a specific 127 (or 255) glyph font encoding goes away. > or a @type="other"? I think type was intended to sumarise the Unicode character class but better now to use the <unicodedata category= attribute so U+0001 is type=other but more specifically unicodedata category="Cc" (Control character) > What are the applications for this information, what is it used for (at W3C)? It was originally used to build the mathml DTD and the character tables in the mathml spec It is still used in parts during the build of the mathml spec (to implement the choice between showing characters as entities or numeric data for example) It's used by the scripts at http://www.w3.org/2003/entities/2007doc/ to generate the tables in that spec, and the DTD, Json, and XSLT files. It is also used to derive the list of entities used in HTML(5) at the whatwg from where they are pulled back into HTML5 at W3C > What are the elements AMS, APS, ACS, AIP, etc used for (I can understand the acronyms). That comes from the STIX data I believe: as part of the project plan to build up a representative list of characters that should be in a math font and submitted to Unicode, the private character lists of the various publishers were pooled. In the case of the AMS, the names correspond to the TeX macro names in amsmath/amssymb in the case of other publishers they don't correspond to anything that is public (or available to me:-) as far as I know. > To what version of Mathematica does element Wolfram refer to? whatever version was current at the start of the STIX project (so 15 or so years ago:-) I think they were just treated as a "publisher" and supplied a list of named math characters. so to take one entry: <character id="U00021" The unicode code point using 5 letters and no + so it is a legal XML id and for ease of sorting dec="33" same in decimal rather than hex mode="text" indication of its use in TeX (probably obsolete) type="punctuation" indication of Unicode character class (obsolete) > <unicodedata category="Po" combclass="0" bidi="ON" mirror="N" mathclass="N"/> These correspond to the unicode data fields. 0021;EXCLAMATION MARK;Po;0;ON;;;;;N;;;;; Note that the UCD should be taken as authoritative here unicode.xml is intended to map additional information such as tex and entity names to the Unicode code points it isn't intended to be a replacement for the UCD data itself (although I believe it is accurate up to Unicode 6.2) <afii>EE35</afii> The AFII glyph index. These were probably correct when Barbara and Sebastian added them originally and I've tried not to break them while moving data around over the years but I have no independent check of this data. <latex>!</latex> The LaTeX coding (possibly obsolete in general as it assumes ascii input) <entity id="excl" set="8879-isonum"> <desc>=exclamation mark</desc> </entity> <entity id="excl" set="9573-2003-isonum"> <desc>=exclamation mark</desc> </entity> the entity descriptions (in ISO 8879 and 9573 respectively, along with text from the comments in those entity files. <font name="ptmr7t" pos="33"/> a source for the character in some font (Times Roman here) assuming 8bit TeX encodings. Mostly this should be ignored and use a more modern Unicode aware font mapping. <operator-dictionary priority="810" form="postfix" lspace="1" rspace="0"/> The default spacing attributes for these characters as used in MathML. Appendix C of the MathML spec is just this data in HTML table form. <description unicode="1.1">EXCLAMATION MARK</description> The Unicode name of the character and indication of which Unicode release it was added. David
Received on Wednesday, 18 June 2014 13:35:11 UTC