Re: "XML Entity Definitions for Characters" Last Call Draft from David Carlisle on 2009-12-06 (www-math@w3.org from December 2009)

From: David Carlisle <davidc@nag.co.uk>
Date: Sun, 6 Dec 2009 15:20:17 GMT
To: duerst@it.aoyama.ac.jp
Cc: www-math@w3.org, member-i18n-core@w3.org
Message-Id: <200912061520.nB6FKHrK032322@edinburgh.nag.co.uk>
Martin,

thanks again for your comments on the last call draft of 
XML Entity Definitions for Characters


The last call draft is at the URI:

http://www.w3.org/TR/2009/WD-xml-entity-names-20091117/

An Editors' draft showing the changes made in response to LC comments so
far is available at the URI:

http://www.w3.org/2003/entities/2007doc/Overview.html


I hope we have addressed all the points that you have raised.  As you
will know, the W3C process requires that we log the resolution of every
last call comment, so we would appreciate it if you could confirm via an
email to www-math list whether all the points you have raised have been
addressed satisfactorily.

David




> 
>    Now for the comments themselves:
> 
>    Title: "XML Entity definitions for Characters" looks very ambigous. I 
>    think something like "XML Entity Definitions for Characters used by 
>    MathML" or so would help the general public a lot to understand the 
>    context and coverage of the document.


Although parts of this document were derived from the MathML2 spec
sources, this is explicitly _not_ just for MathML. It includes several
entity sets that are not included in the MathML DTD (isogrk1, isogrk2,
isogrk4, xhtml1-lat1, xhtml1-special, xhtml1-symbol, html5-uppercase) So
as well as being used for MathML it can be used for HTML (HTML5 uses
these definitions for example) and serves as an update for the (now
cancelled) ISO/IEC document 9573-13 defining the ISO entity sets. It
was for example cited in the docbook documentation for use with docbook
(now that docbook5 is RelaxNG defined and does not have its own set of
entity definitions). Thus it is important that the title does not
mention MathML as it is explicitly not just for MathML.


> 
>    abstract: "This document defines several sets of names which are 
>    assigned to Unicode characters. Each of these sets is also implemented 
>    as a file of XML entity declarations.":
>    First, this says that the names are the main stuff, and the XML entities 
>    are just an implementation detail. This is a contradiction to the title, 
>    where XML entities are the main thing.

The statement you quote is factually true, however we have reworded
it to remove the implied relative importance of the different
aspects.

>    Second, "sets of names which are assigned to Unicode characters" is 
>    unclear as to whether a set of names is assigned to a Unicode 
>    character, or something else. The same problem is present elsewhere 
>    (e.g. first sentence of the Introduction)

This has been reworded to clarify this.


>    Third, all Unicode characters have official names (e.g. LATIN CAPITAL 
>    LETTER A for U+0041). These are a very important part of nailing down 
>    the identity of a character. It would be good if either the abstract or 
>    the Introduction or both would make clear that what you are dealing with 
>    are short mnemotic names that are different from the official Unicode names.

A comment pointing this out has been added to the introduction.

>    Fourth, names being *assigned* to Unicode characters doesn't sound 
>    right. This may be a programmer's viewpoint, but what you are doing, in 
>    terms of an average programmig language, is to assign Unicode 
>    codepoints/characters to entity names, not the other way round. XML 
>    entities in this sense are not much different from variables in a 
>    programming language, so it would help a lot to keep things straight.
> 

It is of course possible to view this mapping in either direction.
and in fact the mappings are implemented in both directions by the xml
entity files and the xslt character maps respectively. Although being a
many-many map these are not exact inverses. However as you say, it is
probably clearer to use the wording of assigning codepoints to names
rather than the other way round, and the document has been edited
accordingly wherever it used "assigned".

> 
>    Introduction:
>    "The W3C Math Working Group has been invited to take over the 
>    maintenance and development of these sets by the original standards 
>    committee (ISO/IECJTC1 SC34).": It should say somewhere that this 
>    document is the result of this "taking over".
> 
Well historically the document began before SC34 considered updating
9573-13 and a long time before they decided to cancel that project.
Informally they cancelled the project because this set was being more
actively maintained and although I was editing both documents I couldn't
keep to SC34 timescales as I couldn't get ahead of mathml3 and html5,
however we shouldn't speculate on the reasons behind the SC34
decision in the W3C REC track document.

> 
>    There should be a section on Notation, which explains things such as U+ 
>    and leading slashes (is that TEX?).
> 

It's pseudo TeX used (without explanation) in the original ISO standard.
The original ISO entity definitions only gave those descriptions (and no
unicode mappings) and the job really is to match those to unicode in the
most sane way possible subject to compatibility constraints. So I don't
want to change the entity description texts in any way as they are the
reference point for comparison to the ISO standards.

> 
>    Tables:
>    http://www.w3.org/2003/entities/2007doc/bycodes.html:
>    - Instead of U00009 and the like, please use the official U+0009 
>    notation, and do not use a hyphen for character sequences, as this may 
>    look like a character range.

We have revised the document to use U+ notation consistently. The U12345
ID form is just now used for internal linking, and for filenames, not
for referring to codepoints on the text or tables.


>    - Use a <table> so that this displays decently even with 
>    non-proportional fonts (you can then eliminate the ugly commas). There 
>    are lots of cases where <table> is misused in Web pages, but this is 
>    clearly a case where it is "misunused" or "misnonused" or whatever one 
>    would call the absence of the use of a feature when such use is clearly 
>    warranted.
>    - Use proper table headings
>    - For character sequences, use e.g. "LESS-THAN SIGN with COMBINING LONG 
>    VERTICAL LINE OVERLAY" rather than "LESS-THAN SIGN with vertical line"
> 

There were explicit requests from developers (when this table was in
MathML2) for an ascii file that could easily be tested against code,
the format that developed with the monospace layout but including some
hyperlinking is a compromise.


>    http://www.w3.org/2003/entities/2007doc/byalpha.html:
>    - Similar comments as for bycodes.html
>    - I don't understand why this table contains the origins/collections, 
>    but bycodes.html doesn't.
>    - I don't understand the lowercase stuff at the end of each line. It 
>    seems to be some kind of annotations, but in some cases is totally 
>    useless (e.g. [LATIN SMALL LETTER A WITH CIRCUMFLEX], latin capital 
>    letter A with circumflex)

The final field is the original ISO entity description. If it looks the
same as the unicode formal name than that is good, it isn't superfluous:
it is conformation that the entity has been paired with the right
unicode character. We note again that the original ISO entity definitions
_only_ gave those lower case descriptions not any unicode mapping.
However the order of the columns has now been changed so that this
entity description now comes after the entity name, with the Unicode
codepoint and formal name being the last two columns. Also information
has been added to the top of the file explaining what is in each
column.



>    - This table puts the official Unicode names in "[" and "]", but 
>    bycodes.html doesn't. Why? There should be no such gratutious differences.

Accepted as an editorial improvement.  Also the order of the columns has
been changed to put the entity description after the entity name rather
than after the Unicode formal name, and a paragraph describing the
column format has been added at the start of the page.


>    http://www.w3.org/2003/entities/2007doc/000.html and similar:
>    Please add a note to all the pages with lots of small glyphs that it may 
>    take time to load all the images to see all the glyphs. (one test run 
>    with Mozilla Firebug took 37 seconds on a broadband connection).

A suitable warning note has been added.


>    Please use a stable, final location for all these GIFs. It's okay to 
>    have an occasional "301 Moved Permanently" for a page, but it 
>    essentially doubles the number of objects your page has to download from 
>    256 to 512. Even the former isn't pretty, the later is definitely bad 
>    and totally unnecessary. (the redirects come from URIs of the form 
>    http://www.w3.org/2003/entities/glyphs/003/U003FF.png, the actual images 
>    seem to be at places such as 
>    http://www.w3.org/2003/entities/2007doc/glyphs/003/U003FF.png)

You happened to review the document while it was in transition, and the
redirects were put in place to keep everything working. Current builds
directly reference the new location of the png images, and the redirects
would only be used if someone has linked to the old locations.



>    Codepoints U+0000 through U+0010 (with three exceptions) are shown as 
>    "Unicode or XML Non-Character". They are valid control characters in 
>    Unicode.

yes they are valid in unicode but not in XML 1.0 hence "Unicode or XML"
but see below.


>     Strangely enough, there are also such cases (red background 
>    color) in the U+1D4xx and U+1D5xx 'blocks'. A codepoint such as U+1D53F 
>    is simply <reserved> in Unicode, the Unicode consortium could decide to 
>    allocate a character there in the future. This is no different at all 
>    from all the characters that you marked with a yellow background. The 
>    only codepoints that are actually non-characters in Unicode are cases 
>    such as U+FFFF and the like, but you don't have any of these. I 
>    therefore suggest that the red backgrounds in the U+1D4xx and U+1D5xx 
>    'blocks' have to be turned to yellow, and the text for the red 
>    background should be changed to "Characters not representable in XML 
>    1.0" or some such (most of them would be representable in XML 1.1).

All except 0000 would be representable in xml 1.1 as numeric references I think.
XML 1.1 came out after that text was written...
We don't want to mark the reserved "holes" in the 1Dxxx blocks the
same as completely unallocated codepoints.
The various cases are now separately distinguished (codepoint not usable
in xml 1,0, reserved codepoint in plane 1, unallocated codepoint) these
have been given different css classes and colours, and the key on each
table identifies the cases that occur on that page.


> 
>    For codepoints with a yellow background, the legend says "XML Character 
>    not currently described in Unicode". The term "XML Character" is really 
>    strange. XML uses Unicode, there are no "XML Characters".

"XML Characters" is intended to mean something matching the XML char
production, that is, a character usable as character data in XML,
which is a bit less than full unicode range as you know. However
the legend has been reworded as noted in the previous comment.


>     The cells with 
>    yellow backgrounds represent unassigned (reserved) Unicode codepoints. 
>    So the best legend would be "reserved Unicode codepoint (no character 
>    currently assigned)" or something similar.

Looking at it from a unicode viewpoint it makes sense to say it's a
codepoint to which no character is currently assigned. But looking at it
from an xml viewpoint it _is_ a character (or more exactly it
corresponds to well formed character data matching the char production)
but unicode has not assigned any interpretation for that character.
As noted above the tables now distinguish more cases, separating out the
control characters (not usable directly in XML) from the reserved codepoints.



> 
>    Putting the "Next" link above the "Previous" link at the top and bottom 
>    of these tables seems counterintuitive, because the overall flow is from 
>    top to bottom.
> 

The ordering was inconsistent, we have now consistently ordered these
links as suggested.

> 
>    For http://www.w3.org/2003/entities/2007doc/double-struck.html and similar:
> 
>    Why do some rows have a yellow background? There's no explanation, so 
>    the reader is left guessing.

They are highlighting the cases that are in the BMP not in the
(possibly?) expected runs in the 1Dxxx block. This was explained in a
note at the start of the section (in the overview document) however we
have added an additional footnote at the bottom of each affected page.


> 
>    Why do some of these characters not have any corresponding entity names 
>    at all?

Because, as stated explicitly in the introduction, this specification
doesn't define any new names, it only allocates unicode code points to
names previously thought up by ISO or the W3C.

> 
> 
>    Section 3:
> 
>    Title: An "Unicode Character Block": As you can see from 
>    http://unicode.org/Public/UNIDATA/Blocks.txt, Unicode blocks are not of 
>    equal size of 256 characters, and are not all alligned on boundaries 
>    divisible by 256. But the reader can easily get such an impression. The 
>    title, or the text below it, should be changed to reflect this, unless 
>    (which would be more appropriate for the document (see next comment), 
>    but may be difficult in terms of production costs) actual Unicode blocks 
>    are used.
> 

Yes in the table of contents all the block names that occur in the 256
square are listed, with "(continued)" added when the blocks run over.
The section title has been changed to use "Ranges" rather than "Blocks"
to avoid any impression that the 256 squares are Blocks.

>    I don't understand why Arabic presentation forms are (as indicated by 
>    the yellow background) available in the STIX fonts, when basic Arabic 
>    isn't. Turning things around, would a font for Math or Science have to 
>    support these? The sentence "The following tables display Unicode ranges 
>    containing the characters that are most used in mathematics." at the 
>    start of section 3 seems to suggest so.

Given the list of blocks most used in science/mathematics (eg as listed
in unicode report 25) every 256-aligned range that covers those blocks
is listed, which means that some additional characters are shown in the
tables. The exact details of the Arabic support are somewhat in flux as
there are unicode proposals to add variant forms (in a similar manner to
the variants for latin and greek in 1d4xx and 1d5xx) and as for the
latin/greek cases there is some discussion as to whether existing
variant letters in the BMP should be reused.


> 
>    Turning things around: Are these tables for all the 256-character-sized, 
>    aligned parts that contain one or more of the characters for which 
>    entities have been defined in this document? If yes, please say so. If 
>    no, please say what the differences are.
> 
As above; they are tables for all the 256-character-sized, aligned parts
that contain a math/science related block as listed in unicode tr 25.



> 
>    Section 5, first sentence: "there are some that use multiple character 
>    combinations": "multiple character combinations" is "multiple 
>    combinations of characters". However, characters are used in sequences, 
>    not in combinations. So "a sequence of multiple characters" or so would 
>    be better.
> 

OK, change made.

 
> 
> 
>    Editorial:
> 
>    - Please change 'definitions' to 'Definitions' it the title, or adopt 
>    any other W3C approved consistent casing convention. That such an 
>    inconsistency is 'traditional for this document' shouldn't be a reason 
>    to keep it.

agreed, d changed to D.

> 
>    Section 1, first sentence: "especially in scientific documents, 
>    especially in mathematics": Repetition; unclear about the relationship 
>    between the two clauses introduced by 'especially'.

agreed , this has been reworded.


>    Section 1, second sentence: "has grown in part because its notation 
>    continually changes": I suggest changing "changes" to "changed" to align 
>    the tenses.
> 

The tense of "changes" is intentional here. The evolution is still
in progress.



>    Section 1, first paragarph: "It is difficult to write science fluently" 
>    -> "It is difficult to write scientific texts fluently"; same later for 
>    "read science".


This has been reworded.

> 
>    Section 3, first sentence: "Certain characters are of of particular 
>    relevance": "of of" -> "of"

The spurious "of" has been deleted.


> 
>    Section 5, first sentence: "character, however" -> "character. However" 
>    or "character, but" (however starts a new sentence)
> 
I think "however" is being used as a conjunction there rather than start a
new sentence, however the phrase has been reworded.

Thanks again for the comments,


________________________________________________________________________
The Numerical Algorithms Group Ltd is a company registered in England
and Wales with company number 1249803. The registered office is:
Wilkinson House, Jordan Hill Road, Oxford OX2 8DR, United Kingdom.

This e-mail has been scanned for all viruses by Star. The service is
powered by MessageLabs. 
________________________________________________________________________
Received on Sunday, 6 December 2009 15:21:05 UTC