[Bug 10067] this only lists entities whose replacement text is a single character, for example many of the negated operators, for example

http://www.w3.org/Bugs/Public/show_bug.cgi?id=10067

Ian 'Hixie' Hickson <ian@hixie.ch> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |RESOLVED
         Resolution|                            |NEEDSINFO

--- Comment #5 from Henri Sivonen <hsivonen@iki.fi> 2010-09-27 08:28:32 UTC ---
How many named character names would the change add? Would the first two
letters of the additional names be evenly distributed? How long would the
additional expansions be in terms of a) UTF-16 code units and b) UTF-8 code
units? Are the names that aren't currently in HTML5 actually shown to be useful
for XML MathML authoring?

For the implementation in Gecko, it would be bad to introduce a large number of
names that shared the first two letters with commonly used named characters.
(Names starting with lt, gt, qu, nb or am would probably be the worst.) Also,
the implementation in Gecko now assumes the expansion is always one or two
UTF-16 code units.

I'd be the most OK with adding names whose first two letters don't collide with
pre-existing names and whose expansions aren't be longer than two UTF-16 code
units. For other kinds of additional names, I'd be interested in the expected
benefit of the complication.

--- Comment #6 from Ian 'Hixie' Hickson <ian@hixie.ch> 2010-09-27 09:10:28 UTC ---
The character reference names and values would be:

 name: nvlt; value: U0003C-020D2
 name: bne; value: U0003D-020E5
 name: nvgt; value: U0003E-020D2
 name: fjlig; value: U00066-0006A
 name: ThickSpace; value: U0205F-0200A
 name: nrarrw; value: U0219D-00338
 name: npart; value: U02202-00338
 name: nang; value: U02220-020D2
 name: caps; value: U02229-0FE00
 name: cups; value: U0222A-0FE00
 name: nvsim; value: U0223C-020D2
 name: race; value: U0223D-00331
 name: acE; value: U0223E-00333
 name: NotEqualTilde; value: U02242-00338
 name: nesim; value: U02242-00338
 name: napid; value: U0224B-00338
 name: nvap; value: U0224D-020D2
 name: NotHumpDownHump; value: U0224E-00338
 name: nbump; value: U0224E-00338
 name: nbumpe; value: U0224F-00338
 name: NotHumpEqual; value: U0224F-00338
 name: nedot; value: U02250-00338
 name: bnequiv; value: U02261-020E5
 name: nvle; value: U02264-020D2
 name: nvge; value: U02265-020D2
 name: nlE; value: U02266-00338
 name: nleqq; value: U02266-00338
 name: ngE; value: U02267-00338
 name: ngeqq; value: U02267-00338
 name: NotGreaterFullEqual; value: U02267-00338
 name: lvnE; value: U02268-0FE00
 name: lvertneqq; value: U02268-0FE00
 name: gvnE; value: U02269-0FE00
 name: gvertneqq; value: U02269-0FE00
 name: nLtv; value: U0226A-00338
 name: NotLessLess; value: U0226A-00338
 name: nLt; value: U0226A-020D2
 name: nGtv; value: U0226B-00338
 name: NotGreaterGreater; value: U0226B-00338
 name: nGt; value: U0226B-020D2
 name: NotSucceedsTilde; value: U0227F-00338
 name: vnsub; value: U02282-020D2
 name: nsubset; value: U02282-020D2
 name: NotSubset; value: U02282-020D2
 name: vnsup; value: U02283-020D2
 name: nsupset; value: U02283-020D2
 name: NotSuperset; value: U02283-020D2
 name: vsubne; value: U0228A-0FE00
 name: varsubsetneq; value: U0228A-0FE00
 name: vsupne; value: U0228B-0FE00
 name: varsupsetneq; value: U0228B-0FE00
 name: NotSquareSubset; value: U0228F-00338
 name: NotSquareSuperset; value: U02290-00338
 name: sqcaps; value: U02293-0FE00
 name: sqcups; value: U02294-0FE00
 name: nvltrie; value: U022B4-020D2
 name: nvrtrie; value: U022B5-020D2
 name: nLl; value: U022D8-00338
 name: nGg; value: U022D9-00338
 name: lesg; value: U022DA-0FE00
 name: gesl; value: U022DB-0FE00
 name: notindot; value: U022F5-00338
 name: notinE; value: U022F9-00338
 name: nrarrc; value: U02933-00338
 name: NotLeftTriangleBar; value: U029CF-00338
 name: NotRightTriangleBar; value: U029D0-00338
 name: ncongdot; value: U02A6D-00338
 name: napE; value: U02A70-00338
 name: nles; value: U02A7D-00338
 name: NotLessSlantEqual; value: U02A7D-00338
 name: nleqslant; value: U02A7D-00338
 name: nges; value: U02A7E-00338
 name: NotGreaterSlantEqual; value: U02A7E-00338
 name: ngeqslant; value: U02A7E-00338
 name: NotNestedLessLess; value: U02AA1-00338
 name: NotNestedGreaterGreater; value: U02AA2-00338
 name: smtes; value: U02AAC-0FE00
 name: lates; value: U02AAD-0FE00
 name: npre; value: U02AAF-00338
 name: npreceq; value: U02AAF-00338
 name: NotPrecedesEqual; value: U02AAF-00338
 name: nsce; value: U02AB0-00338
 name: nsucceq; value: U02AB0-00338
 name: NotSucceedsEqual; value: U02AB0-00338
 name: nsubE; value: U02AC5-00338
 name: nsubseteqq; value: U02AC5-00338
 name: nsupE; value: U02AC6-00338
 name: nsupseteqq; value: U02AC6-00338
 name: vsubnE; value: U02ACB-0FE00
 name: varsubsetneqq; value: U02ACB-0FE00
 name: vsupnE; value: U02ACC-0FE00
 name: varsupsetneqq; value: U02ACC-0FE00
 name: nparsl; value: U02AFD-020E5

--- Comment #7 from Henri Sivonen <hsivonen@iki.fi> 2010-09-27 12:12:42 UTC ---
(In reply to comment #6)
> The character reference names and values would be:

>  name: nLt; value: U0226A-020D2

Given that these proposed names always expand to 2 BMP characters, they aren't
worse than the pre-existing astral characters in UTF-16. Also, it seems the
first two letters don't collide too badly with the first two letters of the
most common named characters, so that seems OK, too. It also looks like the
longest of the proposed names is substantially longer than the longest existing
name, so the need to buffer in case of mismatch at the last possible point
doesn't get substantially worse.

There are a couple of unfortunate characteristics, but I guess they aren't too
bad:
 1) Many names start with "No". That is, the first two letters don't provide as
much uniqueness as one might hope. Anyway, chances are these names won't become
too popular on the Web scale, so it probably won't matter if these aren't
carefully optimized in Gecko.

 2) &nLt; is 5 bytes in UTF-8 but its expansion is 6 bytes. This changes the
buffering nature of named characters when the buffers are in UTF-8: The output
buffer may have to be larger than the input buffer. However, this problem
already exists when U+0000 is turned into U+FFFD, so the worst case for UTF-8
is already worse (output 3 times the size of input) than what these new names
require (output 1.2 times the size of input).

I don't have immediate objections the addition of these named characters.

However, the sheer size of the list is already rather excessive. I hope the
Math WG isn't planning on adding more names over time. If this list is just
going to grow and grow, maybe we should just say "no" now. OTOH, if there's a
promise that the list doesn't get bigger after this, I guess these additions
can be lived with.

--- Comment #8 from Henri Sivonen <hsivonen@iki.fi> 2010-09-27 12:15:24 UTC ---
s/is substantially/is NOT substantially/

--- Comment #9 from David Carlisle <davidc@nag.co.uk> 2010-09-27 13:24:35 UTC ---
(In reply to comment #7)
e immediate objections the addition of these named characters.
> 
> However, the sheer size of the list is already rather excessive. I hope the
> Math WG isn't planning on adding more names over time. If this list is just
> going to grow and grow, maybe we should just say "no" now. OTOH, if there's a
> promise that the list doesn't get bigger after this, I guess these additions
> can be lived with.

It's dangerous to predict the future but I can promise there is absolutely no
intention of ever extending this list. MathML3 added no new names, MathML2
added just 1 (I think) so all but asympeq come from MathML1 in 1998 (and the
vast majority of them come from the earlier ISO entity sets).

As I commented this morning in IRC (but it didn't make the log for some reason)
We have the (self imposed) constraint that we never remove a name because if an
xml document gets used with a catalog that switches in a newer dtd the entity
would become undefined, so the entire document would be rejected as not well
formed.

HTML doesn't have the draconian error handling and the names were not
previously in html so the pressures on you are slightly different.

Some workflows (and my sanity) are probably helped if the lists are exactly the
same on the html and xml sides, but if the html entities are going to be a
subset, then (a) this should be mentioned somewhere in the html5 spec spec (and
I'd mention it and list the html5 ones in the editors draft (at least) of the
xml entities spec) and (b) there are probably some other ones that you could
drop in addition to the multiple character ones, specifically
NegativeMediumSpace;     U+0200B     &#8203; 
NegativeThickSpace;     U+0200B     &#8203; 
NegativeThinSpace;     U+0200B     &#8203; 
NegativeVeryThinSpace;     U+0200B
all expanding to zero width space. the only reason they are there was because
they were in mathml1 and kept as noted above.

MathML1 used the private use area for the majority of its characters, based on
the STIX submission to Unicode. These negative spaces were in the submission
but not accepted into Unicode when the other math characters went in in Unicode
3.1 and 3.2, which left them with nowhere to go to once we stopped using the
private use area. (Arguably they should have gone to the replacement character,
but zero width space had better behaviour in the systems of the time).

--- Comment #10 from Ian 'Hixie' Hickson <ian@hixie.ch> 2010-09-29 18:47:00 UTC ---
EDITOR'S RESPONSE: This is an Editor's Response to your comment. If you are
satisfied with this response, please change the state of this bug to CLOSED. If
you have additional information and would like the editor to reconsider, please
reopen this bug. If you would like to escalate the issue to the full HTML
Working Group, please add the TrackerRequest keyword to this bug, and suggest
title and text for the tracker issue; or you may create a tracker issue
yourself, if you are able to do so. For more details, see this document:
   http://dev.w3.org/html5/decision-policy/decision-policy.html

Status: Accepted
Change Description: see diff given below
Rationale: Concurred with reporter's comments.

Please keep an eye out for parts of the spec that assume a character reference
is one codepoint and let me know of any I missed.

-- 
Configure bugmail: http://www.w3.org/Bugs/Public/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the QA contact for the bug.

Received on Wednesday, 29 September 2010 18:47:05 UTC