[XML-Entities] Entity Definitions, STIX fonts and usage analysis

I was pleased to see the "XML Entity Definitions for Characters" document become a W3C Recommendation recently. Incidentally, the STIX fonts seem to be getting near to release as well. As both the W3C entity definitions and STIX fonts are intended for use in scientific documents, I'd like scientific publishers to be able to combine the entity definitions, STIX fonts, and CSS' @font-face rules to publish scientific documents as HTML using Unicode characters (rather than the current image replacements) for non-Latin characters.

I ran two analyses to help with this:

===========

a) An analysis of font support for the Unicode characters defined in w3centities-f.ent: 
http://alf.hubmed.org/2010/04/w3c-entity-definitions/w3c-entities-font-coverage.pdf

Each column shows the character in a particular font, if it exists. The STIXGeneral font shown is RC6.
(After the first page, the column headings are slightly mis-aligned for some reason - printing to PDF in Firefox on OS X - but it should still be clear enough which column corresponds to which font.)

Summary: there are about 6 characters in w3centities-f.ent that are missing from the STIXGeneral font, but in general it provides excellent coverage and the characters look good.

===========

b) A count of usage of named entities in Nature Publishing Group's article XML: 
http://alf.hubmed.org/2010/04/w3c-entity-definitions/npg-named-entity-usage.pdf

The colums are: 
1) the number of times this named entity has been used; 
2) the entity name; 
3) the Unicode character rendered in the STIXGeneral font (if a mapping exists in w3centities-f.ent); 
4) the path that NPG uses for files corresponding to that named entity (included because it provides a bit of textual description of that character); 
5) the image currently used by NPG to represent that named entity when publishing HTML.

Summary: there are several named entities that NPG uses frequently that aren't defined in w3centities-f.ent. We'll look at mapping those to the appropriate Unicode codepoints (most of them should be simple, as they're a Latin or Greek character with a circumflex, macron, dot, caron or tilde; others are mostly filled circles or boxes that will probably have to map to images). If anyone's already done this mapping, I'd be interested to see the results.

===========

alf

Received on Wednesday, 21 April 2010 12:11:56 UTC