Re: HTML - i18n / NCR & charsets

Albert Lunde (Albert-Lunde@nwu.edu)
Tue, 26 Nov 1996 16:16:11 -0600 (CST)


Message-Id: <199611262216.AA284246571@casbah.acns.nwu.edu>
Subject: Re: HTML - i18n / NCR & charsets
To: Dirk.vanGulik@jrc.it (Dirk.vanGulik)
Date: Tue, 26 Nov 1996 16:16:11 -0600 (CST)
Cc: www-html@w3.org, dirkx@elec.jrc.it
In-Reply-To: <9611261813.AA08437@ jrc.it> from "Dirk.vanGulik" at Nov 26, 96 07:13:23 pm
From: Albert-Lunde@nwu.edu (Albert Lunde)

Dirk.vanGulik wrote:
> If HTML-i18n is to go ahead, without any signaling about the NCRs
> target charset change (i.e in Unicode rather than the announced
> charset); then IMHO this should at least be mensioned in the draft
> as it break existing, widespread, practice, which prior to this
> i18n draft could not be signalled as 'wrong' or 'illegal'.

The intepretation of numeric character references was specified
in the HTML 2.0 RFC (to lay the ground for internationalization).

From section 1.2.1. "Documents" of RFC 1866 , the HTML 2.0 spec

> * Its document character set includes [ISO-8859-1] and
> agrees with [ISO-10646]; that is, each code position listed
> in 13, "The HTML Coded Character Set" is included, and each
> code position in the document character set is mapped to the
> same character as [ISO-10646] designates for that code
> position.
>
>  NOTE - The document character set is somewhat
>  independent of the character encoding scheme used to
>  represent a document. For example, the `ISO-2022-JP'
>  character encoding scheme can be used for HTML
>  documents, since its repertoire is a subset of the
>  [ISO-10646] repertoire. The critical distinction is
>  that numeric character references agree with
>  [ISO-10646] regardless of how the document is
>  encoded.

From section 6.1. "The HTML Document Character Set"

>  NOTE - To support non-western writing systems, a larger character
>  repertoire will be specified in a future version of HTML. The
>  document character set will be [ISO-10646], or some subset that
>  agrees with [ISO-10646]; in particular, all numeric character
>  references must use code positions assigned by [ISO-10646].

The "existing practice" you refer to is wrong (and hard to reconcile with SGML).This issue was discussed in the html-wg starting at least 6 months prior 
to the release of the final internet draft (it's one of the issues that
delayed the spec coming out, in say Feburary 95.) so people, and
especially vendors, have had fair warning.

-- 
    Albert Lunde                      Albert-Lunde@nwu.edu