Re: "XML Entity Definitions for Characters" Last Call Draft from David Carlisle on 2009-11-17 (www-math@w3.org from November 2009)

From: David Carlisle <davidc@nag.co.uk>
Date: Tue, 17 Nov 2009 12:35:28 GMT
To: duerst@it.aoyama.ac.jp
Cc: www-math@w3.org, member-i18n-core@w3.org
Message-Id: <200911171235.nAHCZSfo030303@edinburgh.nag.co.uk>
Martin,

Thanks for your comments on the entities draft.
I've changed the CC list and will handle them as LC comments (as the LC
draft publication is imminent)

These are _personal_ first impressions not a formal response to the
coments.

David



> 
>    Now for the comments themselves:
> 
>    Title: "XML Entity definitions for Characters" looks very ambigous. I 
>    think something like "XML Entity Definitions for Characters used by 
>    MathML" or so would help the general public a lot to understand the 
>    context and coverage of the document.


Although parts of this document were derived from the MathML2 spec
sources, this is explictly _not_ just for MathML. It includes several
entity sets that are not included in the MathML DTD (isogrk1, isogrk2,
isogrk4, xhtml1-lat1, xhtml1-special, xhtml1-symbol, html5-uppercase) So
as well as being used for MathML it can be used for HTML (HTML5 uses
these definitions for example) and serves as an update for the (now
cancelled) ISO/IEC doocument 9573-13 defining the ISO entity sets. It
was for example cited in the docbook documentation for use with docbook
(now that docbook5 is relaxNG defined and does not have its own set of
entity definitions). Thus I think it important that the title does not
mention MathML.


> 
>    abstract: "This document defines several sets of names which are 
>    assigned to Unicode characters. Each of these sets is also implemented 
>    as a file of XML entity declarations.":
>    First, this says that the names are the main stuff, and the XML entities 
>    are just an implementation detail. This is a contradiction to the title, 
>    where XML entities are the main thing.

The statement you quote is factually true, we can probably reword it
a bit to remove the implied relative importance of the different
aspects. I assume it is the word "also" that troubles you?

>    Second, "sets of names which are assigned to Unicode characters" is 
>    unclear as to whether a set of names is assingned to a Unicode 
>    character, or something else. The same problem is present elsewhere 
>    (e.g. first sentence of the Introduction)

Yes, we can reword that to be more clear.


>    Third, all Unicode characters have official names (e.g. LATIN CAPITAL 
>    LETTER A for U+0041). These are a very important part of nailing down 
>    the identity of a character. It would be good if either the abstract or 
>    the Introduction or both would make clear that what you are dealing with 
>    are short mnemotic names that are different from the official Unicode names.
>    Fourth, names being *assigned* to Unicode characters doesn't sound 
>    right. This may be a programmer's viewpoint, but what you are doing, in 
>    terms of an average programmig language, is to assign Unicode 
>    codepoints/characters to entity names, not the other way round. XML 
>    entities in this sense are not much different from variables in a 
>    programming language, so it would help a lot to keep things straight.
> 
Yes, probably I think I agree with this as well, I'd need to look at
exactly what wording changes this would imply, but I'm sure we can make
this clearer editorially.

> 
>    Introduction:
>    "The W3C Math Working Group has been invited to take over the 
>    maintenance and development of these sets by the original standards 
>    committee (ISO/IECJTC1 SC34).": It should say somewhere that this 
>    document is the result of this "taking over".
> 
Well historically the document began before SC34 considered updating
9573-13 and a long time before they decided to cancel that project.
Informally they cancelled the project because this set was being more
actively maintained and although I was editing both documents I couldn't
keep to SC34 timescales as I couldn't get ahead of mathml3 and html5,
however we shouldn't speculate on the reasons behind the SC34
decision in the W3C REC track document.

> 
>    There should be a section on Notation, which explains things such as U+ 
>    and leading slashes (is that TEX?).
> 

It's pseudo TeX used (without explanation) in the original ISO standard.
The original ISO entity definitions only gave those descriptions (and no
unicode mappings) and the job really is to match those to unicode in the
most sane way possible subject to compatibility constraints. So I don't
want to change the entity description texts in any way as they are the
reference point for comparison to the ISO standards. However, if that
isn't clear we should say that somewhere in the document.

> 
>    Tables:
>    http://www.w3.org/2003/entities/2007doc/bycodes.html:
>    - Instead of U00009 and the like, please use the official U+0009 
>    notation, and do not use a hyphen for character sequences, as this may 
>    look like a character range.

Yes the U00009 notation are the internal IDs used for cross linking. In
virtually all places in the text I think we now use U+1234 notation
however the internal IDs are still showing up in byalpha and bycodes
lists, will fix.

>    - Use a <table> so that this displays decently even with 
>    non-proportional fonts (you can then eliminate the ugly commas). There 
>    are lots of cases where <table> is misused in Web pages, but this is 
>    clearly a case where it is "misunused" or "misnonused" or whatever one 
>    would call the absence of the use of a feature when such use is clearly 
>    warranted.
>    - Use proper table headings
>    - For character sequences, use e.g. "LESS-THAN SIGN with COMBINING LONG 
>    VERTICAL LINE OVERLAY" rather than "LESS-THAN SIGN with vertical line"
> 

There were explict requests from developers (when this table was in
MathML2) for an ascii file that could easily be tested against code,
the format that developed with the monospace layout but including some
hyperlinking is a compromise.


>    http://www.w3.org/2003/entities/2007doc/byalpha.html:
>    - Similar comments as for bycodes.html
>    - I don't understand why this table contains the origins/collections, 
>    but bycodes.html doesn't.
>    - I don't understand the lowercase stuff at the end of each line. It 
>    seems to be some kind of annotations, but in some cases is totally 
>    useless (e.g. [LATIN SMALL LETTER A WITH CIRCUMFLEX], latin capital 
>    letter A with circumflex)

The final field is the original ISO entity description. If it looks the
same as the unicode formal name than that is good, it isn't superflouous:
it is conformation that the entity has been paired with the right
unicode character. I note again that the original ISO entity definitions
_only_ gave those lower case descriptions not any unicode mapping.


>    - This table puts the official Unicode names in "[" and "]", but 
>    bycodes.html doesn't. Why? There should be no such gratutious differences.

Accepted as an editorial improvement. 


>    http://www.w3.org/2003/entities/2007doc/000.html and similar:
>    Please add a note to all the pages with lots of small glyphs that it may 
>    take time to load all the images to see all the glyphs. (one test run 
>    with Mozilla Firebug took 37 seconds on a broadband connection).

Yes OK (although people seem to be less patient than they used to be,
the first time I looked at those tables in a mathml draft over an
internet dialup collection it took more like an hour:-)


>    Please use a stable, final location for all these GIFs. It's okay to 
>    have an occasional "301 Moved Permanently" for a page, but it 
>    essentially doubles the number of objects your page has to download from 
>    256 to 512. Even the former isn't pretty, the later is definitely bad 
>    and totally unnecessary. (the redirects come from URIs of the form 
>    http://www.w3.org/2003/entities/glyphs/003/U003FF.png, the actual images 
>    seem to be at places such as 
>    http://www.w3.org/2003/entities/2007doc/glyphs/003/U003FF.png)

Those redirects were added within the last few days at the suggestion
of the w3c website team. Previously the pngs were copied, but copying
large numbers of binary files in a cvs tree isn't very nice. More recent
versions of the editors draft link directly to the new place rather than
through the redirect from the old place, and the version in TR space
_will_ have copies of the glyphs in the appropriate directory and not
have redirects. So the redirects are hopefully just a temporary
artifact.



>    Codepoints U+0000 through U+0010 (with three exceptions) are shown as 
>    "Unicode or XML Non-Character". They are valid control characters in 
>    Unicode.

yes they are valid in unicode but not in XML 1.0 hence "Unicode or XML"
but see below.


>     Strangely enough, there are also such cases (red background 
>    color) in the U+1D4xx and U+1D5xx 'blocks'. A codepoint such as U+1D53F 
>    is simply <reserved> in Unicode, the Unicode consortium could decide to 
>    allocate a character there in the future. This is no different at all 
>    from all the characters that you marked with a yellow background. The 
>    only codepoints that are actually non-characters in Unicode are cases 
>    such as U+FFFF and the like, but you don't have any of these. I 
>    therefore suggest that the red backgrounds in the U+1D4xx and U+1D5xx 
>    'blocks' have to be turned to yellow, and the text for the red 
>    background should be changed to "Characters not representable in XML 
>    1.0" or some such (most of them would be representable in XML 1.1).

All except 0000 would be representable in xml 1.1 as numeric references I think.
XML 1.1 came out after that text was written...
I don't want to mark the reserved "holes" in the 1Dxxx blocks the same
as the completely unallocated points but I agree that colouring them the
same as the control points may not be the best. I'll just split the
cases up and mark them all individually and have separate entries in the
legend for the different cases.

> 
>    For codepoints with a yellow background, the legend says "XML Character 
>    not currently described in Unicode". The term "XML Character" is really 
>    strange. XML uses Unicode, there are no "XML Characters".

"XML Characters" is intended to mean something matching the XML char
production, that is, a character usable as character data in XML,
which is a bit less than full unicode range as you know. We can
re-word to make this clearer.

>     The cells with 
>    yellow backgrounds represent unassigned (reserved) Unicode codepoints. 
>    So the best legend would be "reserved Unicode codepoint (no character 
>    currently assigned)" or something similar.

Looking at it from a unicode viewpoint it makes sense to say it's a
codepoint to which no character is currently assigned. But looking at it
from an xml viewpoint it _is_ a character (or more exactly it
corresponds to well formed character data matching the char production)
but unicode has not assigned any interpretation for that character.
I'll see if we can come up with some better wording.


> 
>    Putting the "Next" link above the "Previous" link at the top and bottom 
>    of these tables seems counterintuitive, because the overall flow is from 
>    top to bottom.
> 
> 
>    For http://www.w3.org/2003/entities/2007doc/double-struck.html and similar:
> 
>    Why do some rows have a yellow background? There's no explanation, so 
>    the reader is left guessing.

They are highlighting the cases that are in the BMP not in the
(possibly?) expected runs in the 1Dxxx block. Will add a note at the
bottom of the page that says this.


> 
>    Why do some of these characters not have any corresponding entity names 
>    at all?

Because, as stated explictly in the introduction, this specification
doesn't define any new names, it only allocates unicode code points to
names previously thought up by ISO or the W3C.

> 
> 
>    Section 3:
> 
>    Title: An "Unicode Character Block": As you can see from 
>    http://unicode.org/Public/UNIDATA/Blocks.txt, Unicode blocks are not of 
>    equal size of 256 characters, and are not all alligned on boundaries 
>    divisible by 256. But the reader can easily get such an impression. The 
>    title, or the text below it, should be changed to reflect this, unless 
>    (which would be more appropriate for the document (see next comment), 
>    but may be difficult in terms of production costs) actual Unicode blocks 
>    are used.
> 

Yes in the table of contents I have put all the block names that occur
in the 256 square and added (continued) when the blocks run over, but I
agree the section title might lead one to think that "block" meant the
256-aligned ranges. We should think of a better title.

>    I don't understand why Arabic presentation forms are (as indicated by 
>    the yellow background) available in the STIX fonts, when basic Arabic 
>    isn't. Turning things around, would a font for Math or Science have to 
>    support these? The sentence "The following tables display Unicode ranges 
>    containing the characters that are most used in mathematics." at the 
>    start of section 3 seems to suggest so.

Given the list of blocks most used in science/mathematics (eg as listed
in unicode report 25) I list every 256-aligned range that covers those
which means that some additional characters are shown in the tables
which doesn't do any harm and is I think clearer than not showing
them. They are all valid characters after all. The exact details of the
Arabic support are somewhat in flux as there are unicode proposlas to
add variant forms (in a similar manner to the variants for latin and greek
in 1d4xx and 1d5xx) and as for the latin/greek cases there is some
discussion as to whether existing variant letters in the BMP should be
reused.


> 
>    Turning things around: Are these tables for all the 256-character-sized, 
>    aligned parts that contain one or more of the characters for which 
>    entities have been defined in this document? If yes, please say so. If 
>    no, please say what the differences are.
> 
As above; they are tables for all the 256-character-sized, aligned parts
that contain a math/science related block as listed in unicode tr 25.
We can certainly say that explictly if it makes it clearer.


> 
>    Section 5, first sentence: "there are some that use multiple character 
>    combinations": "multiple character combinations" is "multiple 
>    combinations of characters". However, characters are used in sequences, 
>    not in combinations. So "a sequence of multiple characters" or so would 
>    be better.
> 

OK

 
> 
> 
>    Editorial:
> 
>    - Please change 'definitions' to 'Definitions' it the title, or adopt 
>    any other W3C approved consistent casing convention. That such an 
>    inconsistency is 'traditional for this document' shouldn't be a reason 
>    to keep it.

agreed

> 
>    Section 1, first sentence: "especially in scientific documents, 
>    especially in mathematics": Repetition; unclear about the relationship 
>    between the two clauses introduced by 'especially'.

agreed 


>    Section 1, second sentence: "has grown in part because its notation 
>    continually changes": I suggest changing "changes" to "changed" to align 
>    the tenses.
> 

yes, or at least reword somehow to address this.

>    Section 1, first paragarph: "It is difficult to write science fluently" 
>    -> "It is difficult to write scientific texts fluently"; same later for 
>    "read science".

agreed I think
> 
>    Section 3, first sentence: "Certain characters are of of particular 
>    relevance": "of of" -> "of"

oops. will fix.

> 
>    Section 5, first sentence: "character, however" -> "character. However" 
>    or "character, but" (however starts a new sentence)
> 
I think "however" is being used as a conjunction there rather than start a
new sentence, but if the phrase isn't clear we can reword to avoid any
ambiguity rather tha argue about english grammar.


Thanks again for the comments,


David

________________________________________________________________________
The Numerical Algorithms Group Ltd is a company registered in England
and Wales with company number 1249803. The registered office is:
Wilkinson House, Jordan Hill Road, Oxford OX2 8DR, United Kingdom.

This e-mail has been scanned for all viruses by Star. The service is
powered by MessageLabs. 
________________________________________________________________________
Received on Tuesday, 17 November 2009 12:38:51 UTC