- From: <bugzilla@jessica.w3.org>
- Date: Wed, 29 Sep 2010 18:47:01 +0000
- To: public-html-bugzilla@w3.org
http://www.w3.org/Bugs/Public/show_bug.cgi?id=10067 Ian 'Hixie' Hickson <ian@hixie.ch> changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |NEEDSINFO --- Comment #5 from Henri Sivonen <hsivonen@iki.fi> 2010-09-27 08:28:32 UTC --- How many named character names would the change add? Would the first two letters of the additional names be evenly distributed? How long would the additional expansions be in terms of a) UTF-16 code units and b) UTF-8 code units? Are the names that aren't currently in HTML5 actually shown to be useful for XML MathML authoring? For the implementation in Gecko, it would be bad to introduce a large number of names that shared the first two letters with commonly used named characters. (Names starting with lt, gt, qu, nb or am would probably be the worst.) Also, the implementation in Gecko now assumes the expansion is always one or two UTF-16 code units. I'd be the most OK with adding names whose first two letters don't collide with pre-existing names and whose expansions aren't be longer than two UTF-16 code units. For other kinds of additional names, I'd be interested in the expected benefit of the complication. --- Comment #6 from Ian 'Hixie' Hickson <ian@hixie.ch> 2010-09-27 09:10:28 UTC --- The character reference names and values would be: name: nvlt; value: U0003C-020D2 name: bne; value: U0003D-020E5 name: nvgt; value: U0003E-020D2 name: fjlig; value: U00066-0006A name: ThickSpace; value: U0205F-0200A name: nrarrw; value: U0219D-00338 name: npart; value: U02202-00338 name: nang; value: U02220-020D2 name: caps; value: U02229-0FE00 name: cups; value: U0222A-0FE00 name: nvsim; value: U0223C-020D2 name: race; value: U0223D-00331 name: acE; value: U0223E-00333 name: NotEqualTilde; value: U02242-00338 name: nesim; value: U02242-00338 name: napid; value: U0224B-00338 name: nvap; value: U0224D-020D2 name: NotHumpDownHump; value: U0224E-00338 name: nbump; value: U0224E-00338 name: nbumpe; value: U0224F-00338 name: NotHumpEqual; value: U0224F-00338 name: nedot; value: U02250-00338 name: bnequiv; value: U02261-020E5 name: nvle; value: U02264-020D2 name: nvge; value: U02265-020D2 name: nlE; value: U02266-00338 name: nleqq; value: U02266-00338 name: ngE; value: U02267-00338 name: ngeqq; value: U02267-00338 name: NotGreaterFullEqual; value: U02267-00338 name: lvnE; value: U02268-0FE00 name: lvertneqq; value: U02268-0FE00 name: gvnE; value: U02269-0FE00 name: gvertneqq; value: U02269-0FE00 name: nLtv; value: U0226A-00338 name: NotLessLess; value: U0226A-00338 name: nLt; value: U0226A-020D2 name: nGtv; value: U0226B-00338 name: NotGreaterGreater; value: U0226B-00338 name: nGt; value: U0226B-020D2 name: NotSucceedsTilde; value: U0227F-00338 name: vnsub; value: U02282-020D2 name: nsubset; value: U02282-020D2 name: NotSubset; value: U02282-020D2 name: vnsup; value: U02283-020D2 name: nsupset; value: U02283-020D2 name: NotSuperset; value: U02283-020D2 name: vsubne; value: U0228A-0FE00 name: varsubsetneq; value: U0228A-0FE00 name: vsupne; value: U0228B-0FE00 name: varsupsetneq; value: U0228B-0FE00 name: NotSquareSubset; value: U0228F-00338 name: NotSquareSuperset; value: U02290-00338 name: sqcaps; value: U02293-0FE00 name: sqcups; value: U02294-0FE00 name: nvltrie; value: U022B4-020D2 name: nvrtrie; value: U022B5-020D2 name: nLl; value: U022D8-00338 name: nGg; value: U022D9-00338 name: lesg; value: U022DA-0FE00 name: gesl; value: U022DB-0FE00 name: notindot; value: U022F5-00338 name: notinE; value: U022F9-00338 name: nrarrc; value: U02933-00338 name: NotLeftTriangleBar; value: U029CF-00338 name: NotRightTriangleBar; value: U029D0-00338 name: ncongdot; value: U02A6D-00338 name: napE; value: U02A70-00338 name: nles; value: U02A7D-00338 name: NotLessSlantEqual; value: U02A7D-00338 name: nleqslant; value: U02A7D-00338 name: nges; value: U02A7E-00338 name: NotGreaterSlantEqual; value: U02A7E-00338 name: ngeqslant; value: U02A7E-00338 name: NotNestedLessLess; value: U02AA1-00338 name: NotNestedGreaterGreater; value: U02AA2-00338 name: smtes; value: U02AAC-0FE00 name: lates; value: U02AAD-0FE00 name: npre; value: U02AAF-00338 name: npreceq; value: U02AAF-00338 name: NotPrecedesEqual; value: U02AAF-00338 name: nsce; value: U02AB0-00338 name: nsucceq; value: U02AB0-00338 name: NotSucceedsEqual; value: U02AB0-00338 name: nsubE; value: U02AC5-00338 name: nsubseteqq; value: U02AC5-00338 name: nsupE; value: U02AC6-00338 name: nsupseteqq; value: U02AC6-00338 name: vsubnE; value: U02ACB-0FE00 name: varsubsetneqq; value: U02ACB-0FE00 name: vsupnE; value: U02ACC-0FE00 name: varsupsetneqq; value: U02ACC-0FE00 name: nparsl; value: U02AFD-020E5 --- Comment #7 from Henri Sivonen <hsivonen@iki.fi> 2010-09-27 12:12:42 UTC --- (In reply to comment #6) > The character reference names and values would be: > name: nLt; value: U0226A-020D2 Given that these proposed names always expand to 2 BMP characters, they aren't worse than the pre-existing astral characters in UTF-16. Also, it seems the first two letters don't collide too badly with the first two letters of the most common named characters, so that seems OK, too. It also looks like the longest of the proposed names is substantially longer than the longest existing name, so the need to buffer in case of mismatch at the last possible point doesn't get substantially worse. There are a couple of unfortunate characteristics, but I guess they aren't too bad: 1) Many names start with "No". That is, the first two letters don't provide as much uniqueness as one might hope. Anyway, chances are these names won't become too popular on the Web scale, so it probably won't matter if these aren't carefully optimized in Gecko. 2) ≪⃒ is 5 bytes in UTF-8 but its expansion is 6 bytes. This changes the buffering nature of named characters when the buffers are in UTF-8: The output buffer may have to be larger than the input buffer. However, this problem already exists when U+0000 is turned into U+FFFD, so the worst case for UTF-8 is already worse (output 3 times the size of input) than what these new names require (output 1.2 times the size of input). I don't have immediate objections the addition of these named characters. However, the sheer size of the list is already rather excessive. I hope the Math WG isn't planning on adding more names over time. If this list is just going to grow and grow, maybe we should just say "no" now. OTOH, if there's a promise that the list doesn't get bigger after this, I guess these additions can be lived with. --- Comment #8 from Henri Sivonen <hsivonen@iki.fi> 2010-09-27 12:15:24 UTC --- s/is substantially/is NOT substantially/ --- Comment #9 from David Carlisle <davidc@nag.co.uk> 2010-09-27 13:24:35 UTC --- (In reply to comment #7) e immediate objections the addition of these named characters. > > However, the sheer size of the list is already rather excessive. I hope the > Math WG isn't planning on adding more names over time. If this list is just > going to grow and grow, maybe we should just say "no" now. OTOH, if there's a > promise that the list doesn't get bigger after this, I guess these additions > can be lived with. It's dangerous to predict the future but I can promise there is absolutely no intention of ever extending this list. MathML3 added no new names, MathML2 added just 1 (I think) so all but asympeq come from MathML1 in 1998 (and the vast majority of them come from the earlier ISO entity sets). As I commented this morning in IRC (but it didn't make the log for some reason) We have the (self imposed) constraint that we never remove a name because if an xml document gets used with a catalog that switches in a newer dtd the entity would become undefined, so the entire document would be rejected as not well formed. HTML doesn't have the draconian error handling and the names were not previously in html so the pressures on you are slightly different. Some workflows (and my sanity) are probably helped if the lists are exactly the same on the html and xml sides, but if the html entities are going to be a subset, then (a) this should be mentioned somewhere in the html5 spec spec (and I'd mention it and list the html5 ones in the editors draft (at least) of the xml entities spec) and (b) there are probably some other ones that you could drop in addition to the multiple character ones, specifically NegativeMediumSpace; U+0200B ​ NegativeThickSpace; U+0200B ​ NegativeThinSpace; U+0200B ​ NegativeVeryThinSpace; U+0200B all expanding to zero width space. the only reason they are there was because they were in mathml1 and kept as noted above. MathML1 used the private use area for the majority of its characters, based on the STIX submission to Unicode. These negative spaces were in the submission but not accepted into Unicode when the other math characters went in in Unicode 3.1 and 3.2, which left them with nowhere to go to once we stopped using the private use area. (Arguably they should have gone to the replacement character, but zero width space had better behaviour in the systems of the time). --- Comment #10 from Ian 'Hixie' Hickson <ian@hixie.ch> 2010-09-29 18:47:00 UTC --- EDITOR'S RESPONSE: This is an Editor's Response to your comment. If you are satisfied with this response, please change the state of this bug to CLOSED. If you have additional information and would like the editor to reconsider, please reopen this bug. If you would like to escalate the issue to the full HTML Working Group, please add the TrackerRequest keyword to this bug, and suggest title and text for the tracker issue; or you may create a tracker issue yourself, if you are able to do so. For more details, see this document: http://dev.w3.org/html5/decision-policy/decision-policy.html Status: Accepted Change Description: see diff given below Rationale: Concurred with reporter's comments. Please keep an eye out for parts of the spec that assume a character reference is one codepoint and let me know of any I missed. -- Configure bugmail: http://www.w3.org/Bugs/Public/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the QA contact for the bug.
Received on Wednesday, 29 September 2010 18:47:05 UTC