Re: Problem with LANG keyword from David Woolley on 2003-09-24 (www-html@w3.org from September 2003)

From: David Woolley <david@djwhome.demon.co.uk>
Date: Wed, 24 Sep 2003 23:05:35 +0100 (BST)
To: www-html@w3.org
Message-Id: <200309242205.h8OM5Zm14204@djwhome.demon.co.uk>

> 
> Hmmm, nice I did not think about that. So the use of "&#...;" is actually
> should be used for a very specific list of symbols.

&#....; always represents the ISO 10646 (loosely Unicode) code point.

In very old versions of HTML it was the 256 character initial subset, 
which is identical to ISO 8859/1.  Most of the control characters and
some other control-like characters are not allowed.   In particular,
although generated by certain common authoring tools, &#146; and
&#147; are control characters and not permitted.

The conceptual process is:

- if the character set is in the real HTTP content-type header, note that;
- otherwise, if the document appears to be in 16 bit Unicode or an ASCII
  superset, scan it for a meta for content type, and extract the 
  character set;
- if neither succeeds in extracting a character set, the document is in
  error, and here the spec contradicts itself by saying that the browser
  must not use a default but suggesting that it may use heuristics
  (to me a default is a heuristic);
- translate the whole document from the character set identified above into
  ISO 10646;
- parse it, including expanding any numeric entities;
- render it;
- convert the result into platform fonts that includes the appropriate
  character, using CSS font hinting, but not so as to force a false encoding
  - specifying 5<span style="font-face: Symbol">m</span>V should produce
  five millivolts, not the five microvolts that is likely to appear on 
  many browsers - browser that handle other fonts correctly and likely to
  deliberately misinterpret Symbol.

Received on Wednesday, 24 September 2003 18:12:28 UTC