Re: New character entities

Chris Simon writes:

>In the Welsh language, "y" and "w" are vowels and can both 
>take a circumflex accent.  Since Welsh *is* catered for by 
>the LANG attributes ("cy" - "Cymraeg") it seems strange to 
>be missing two characters from the HTML 4.0 spec.

>I would love to see 4 new character entities:

>  Ŷ
>  ŷ
>  Ŵ
>  ŵ

	I have a similar wish list involving the Czech characters
ř and &ccaron.  All of these characters are a part of the
Unicode standard [1], so they can be represented with decimal or
hexadecimal numeric character references, so for instance ŷ is
currently represented as ŷ or &x177; (although the latter, hex,
reference, will not be recognized as valid by the current generation
of validators, despite being legal).

	I find character entities superior to numeric references, when
the former exist, since the fallback rendering on a browser which
doesn't recognize them is a lot easier to interpret.  For instance,
"Dvořak" reads more naturally than "Dvořak" or
"Dvořak" and is not ambiguous like "Dvo?ak".

	What's curious is that a few of the Latin Extended-A [2]
Unicode characters (namely, Œ œ š Š and
Ÿ) were included as character references [3] in the HTML 4.0 spec
[4], while many others, which would seem to have obvious entity names
by analogy, such as ř and ŵ, were not.  Was this simply
in the interest of not making the entity file too big, or what?  I've
appended to this message a list of the LXA entities which seem to have
been "left out" and which have obvious names (e.g., ŵ by analogy
to â etc.  ā is by analogy with the spacing macron
¯).  A reasonable feature to build into a future HTML spec (HTML
4.1 would be nice) would be character entity references for all LXA
Unicode characters.  In addition to the "undisputable" entities
appended to this message, that would require conventions for the
following diacritical marks:

breve
	E.g., Ă for Ă=Ă, although &Abrev; would also
be possible (hence the need for a convention!)

ogonek
	E.g., Ą for Ą=Ą

dot (above)
	Unfortunately, the use of ⋅ for ⋅ (mathematical dot
operator) rules out the most likely choice.  Perhaps &Cdoton; for
Ċ=Ċ

middle dot
	Perhaps, by analogy, &Ldotin; for Ŀ=Ŀ

stroke
	E.g., Đ for Đ=Đ

preceding apastrophe
	(appearing only in ʼn=ʼn)

double acute
	appearing in e.g. Ő=Ŋ

as well as for the single letters dotless i, kra, eng and long s,
which might be defined as

Dec   	Hex    	Prop Char
ı	ı	ı (or &i;)
ĸ	ĸ	&kra;
Ŋ	Ŋ	Ŋ
ŋ	ŋ	ŋ
ſ	ſ	&ess; (or &s;)

Defining those accents and entities, and making the generalizations
proposed here, would make all Latin Extended-A Unicode characters
available as character entity references and not just numeric ones.

	There are also a few caron-accented characters in Latin
Extended-B [5], but it seems reasonable to abide by Unicode's
definition of the most commonly used additional Latin entities.

					John Whelan
					whelan@iname.com
					http://www.slack.net/~whelan/

References:
[1] http://charts.unicode.org/Unicode.charts/normal/Unicode.html
[2] http://charts.unicode.org/Unicode.charts/normal/U0100.html
[3] http://www.htmlhelp.com/reference/html40/entities/
[4] http://www.w3.org/TR/REC-html-40
[5] http://charts.unicode.org/Unicode.charts/normal/U0180.html

Appendix: Proposed Generalization of Latin Extended-A Character Entities

Dec   	Hex    	Prop Char		Dec   	Hex    	Prop Char
Ā	Ā	Ā  		Ņ	Ņ Ņ
ā	ā	ā  		ņ	ņ ņ
Ć	Ć Ĉ  		Ň	Ň Ň
ć	ć ĉ  		ň	ň ň
Č	Č Č 		Ō	Ō Ō
č	č č 		ō	ō ō
Ď	Ď Ď 		Ŕ	Ŕ Ŕ
ď	ď ď 		ŕ	ŕ ŕ
Ē	Ē Ē  		Ŗ	Ŗ Ŗ
ē	ē ē  		ŗ	ŗ ŗ
Ě	Ě Ě 		Ř	Ř Ř
ě	ě ě 		ř	ř ř
Ĝ	Ĝ Ĝ  		Ś	Ś Ś
ĝ	ĝ ĝ  		ś	ś ś
Ģ	Ģ Ģ 		Ŝ	Ŝ Ŝ
ģ	ģ &gcedil; 		ŝ	ŝ ŝ
Ĥ	Ĥ Ĥ  		Ş	Ş Ş
ĥ	ĥ ĥ  		ş	ş ş
Ĩ	Ĩ Ĩ 		Ţ	Ţ Ţ
ĩ	ĩ ĩ 		ţ	ţ ţ
Ī	Ī Ī  		Ť	Ť Ť
ī	ī ī  		ť	ť ť
IJ  IJ IJ  		Ũ	Ũ Ũ
ij  ij ij  		ũ	ũ ũ
Ĵ  Ĵ Ĵ  		Ū	Ū Ū
ĵ  ĵ ĵ  		ū	ū ū
Ķ  Ķ Ķ 		Ů	Ů Ů
ķ  ķ ķ 		ů	ů ů
Ĺ  Ĺ Ĺ 		Ŵ	Ŵ	Ŵ
ĺ  ĺ ĺ 		ŵ	ŵ	ŵ
Ļ  Ļ Ļ 		Ŷ	Ŷ	Ŷ
ļ  ļ ļ 		ŷ	ŷ	ŷ
Ľ  Ľ Ľ 		Ź	Ź	Ź
ľ  ľ ľ 		ź	ź	ź
Ń	Ń Ń 		Ž	Ž	Ž
ń	ń ń 		ž	ž	ž

Received on Wednesday, 5 August 1998 10:51:36 UTC