- From: John T. Whelan <whelan@physics.utah.edu>
- Date: Wed, 5 Aug 1998 08:51:53 -0600
- To: www-html@w3.org
Chris Simon writes: >In the Welsh language, "y" and "w" are vowels and can both >take a circumflex accent. Since Welsh *is* catered for by >the LANG attributes ("cy" - "Cymraeg") it seems strange to >be missing two characters from the HTML 4.0 spec. >I would love to see 4 new character entities: > Ŷ > ŷ > Ŵ > ŵ I have a similar wish list involving the Czech characters ř and &ccaron. All of these characters are a part of the Unicode standard [1], so they can be represented with decimal or hexadecimal numeric character references, so for instance ŷ is currently represented as ŷ or &x177; (although the latter, hex, reference, will not be recognized as valid by the current generation of validators, despite being legal). I find character entities superior to numeric references, when the former exist, since the fallback rendering on a browser which doesn't recognize them is a lot easier to interpret. For instance, "Dvořak" reads more naturally than "Dvořak" or "Dvořak" and is not ambiguous like "Dvo?ak". What's curious is that a few of the Latin Extended-A [2] Unicode characters (namely, Œ œ š Š and Ÿ) were included as character references [3] in the HTML 4.0 spec [4], while many others, which would seem to have obvious entity names by analogy, such as ř and ŵ, were not. Was this simply in the interest of not making the entity file too big, or what? I've appended to this message a list of the LXA entities which seem to have been "left out" and which have obvious names (e.g., ŵ by analogy to â etc. ā is by analogy with the spacing macron ¯). A reasonable feature to build into a future HTML spec (HTML 4.1 would be nice) would be character entity references for all LXA Unicode characters. In addition to the "undisputable" entities appended to this message, that would require conventions for the following diacritical marks: breve E.g., Ă for Ă=Ă, although &Abrev; would also be possible (hence the need for a convention!) ogonek E.g., Ą for Ą=Ą dot (above) Unfortunately, the use of ⋅ for ⋅ (mathematical dot operator) rules out the most likely choice. Perhaps &Cdoton; for Ċ=Ċ middle dot Perhaps, by analogy, &Ldotin; for Ŀ=Ŀ stroke E.g., Đ for Đ=Đ preceding apastrophe (appearing only in ʼn=ʼn) double acute appearing in e.g. Ő=Ŋ as well as for the single letters dotless i, kra, eng and long s, which might be defined as Dec Hex Prop Char ı ı ı (or &i;) ĸ ĸ &kra; Ŋ Ŋ Ŋ ŋ ŋ ŋ ſ ſ &ess; (or &s;) Defining those accents and entities, and making the generalizations proposed here, would make all Latin Extended-A Unicode characters available as character entity references and not just numeric ones. There are also a few caron-accented characters in Latin Extended-B [5], but it seems reasonable to abide by Unicode's definition of the most commonly used additional Latin entities. John Whelan whelan@iname.com http://www.slack.net/~whelan/ References: [1] http://charts.unicode.org/Unicode.charts/normal/Unicode.html [2] http://charts.unicode.org/Unicode.charts/normal/U0100.html [3] http://www.htmlhelp.com/reference/html40/entities/ [4] http://www.w3.org/TR/REC-html-40 [5] http://charts.unicode.org/Unicode.charts/normal/U0180.html Appendix: Proposed Generalization of Latin Extended-A Character Entities Dec Hex Prop Char Dec Hex Prop Char Ā Ā Ā Ņ Ņ Ņ ā ā ā ņ ņ ņ Ć Ć Ĉ Ň Ň Ň ć ć ĉ ň ň ň Č Č Č Ō Ō Ō č č č ō ō ō Ď Ď Ď Ŕ Ŕ Ŕ ď ď ď ŕ ŕ ŕ Ē Ē Ē Ŗ Ŗ Ŗ ē ē ē ŗ ŗ ŗ Ě Ě Ě Ř Ř Ř ě ě ě ř ř ř Ĝ Ĝ Ĝ Ś Ś Ś ĝ ĝ ĝ ś ś ś Ģ Ģ Ģ Ŝ Ŝ Ŝ ģ ģ &gcedil; ŝ ŝ ŝ Ĥ Ĥ Ĥ Ş Ş Ş ĥ ĥ ĥ ş ş ş Ĩ Ĩ Ĩ Ţ Ţ Ţ ĩ ĩ ĩ ţ ţ ţ Ī Ī Ī Ť Ť Ť ī ī ī ť ť ť IJ IJ IJ Ũ Ũ Ũ ij ij ij ũ ũ ũ Ĵ Ĵ Ĵ Ū Ū Ū ĵ ĵ ĵ ū ū ū Ķ Ķ Ķ Ů Ů Ů ķ ķ ķ ů ů ů Ĺ Ĺ Ĺ Ŵ Ŵ Ŵ ĺ ĺ ĺ ŵ ŵ ŵ Ļ Ļ Ļ Ŷ Ŷ Ŷ ļ ļ ļ ŷ ŷ ŷ Ľ Ľ Ľ Ź Ź Ź ľ ľ ľ ź ź ź Ń Ń Ń Ž Ž Ž ń ń ń ž ž ž
Received on Wednesday, 5 August 1998 10:51:36 UTC