Re: iso-8859-1-Windows-3.1-Latin-1 from Terje Bless on 2001-05-09 (www-validator@w3.org from May 2001)

From: Terje Bless <link@tss.no>
Date: Wed, 9 May 2001 02:38:51 +0200
To: Thanasis Kinias <tkinias@asu.edu>
Cc: "Bailey, Bruce" <Bruce.Bailey@ed.gov>, "'Liam Quinn'" <liam@htmlhelp.com>, "'gerald et al.'" <www-validator@w3.org>
Message-ID: <20010509024705-b01010701-463db02c@192.168.1.6>

On 08.05.01 at 16:59, Thanasis Kinias <tkinias@asu.edu> wrote:

>I understood that &#146; etc. were not valid SGML (or XML) entities, so
>regardless of your declared charset code using them cannot be valid.  The
>validator's output supports this.  If &#8217; were used instead it would
>be valid, right?  The entity refers to the character set (UNICODE) not the
>encode (Windows-1252, Latin-1, whatever).
>
>OTOH, with the encoding properly declared the character could be input
>directly (not escaped with &#146;) and you would be valid,

That is correct. From <URL:http://www.w3.org/TR/html4/charset.html>:

# 5 HTML Document Representation
# [...]
#
# 5.2.1 Choosing an encoding
#
# Authoring tools (e.g., text editors) may encode HTML documents
# in the character encoding of their choice, and the choice
# largely depends on the conventions used by the system software.
# These tools may employ any convenient encoding that covers most
# of the characters contained in the document, provided the
# encoding is correctly labeled. Occasional characters that fall
# outside this encoding may still be represented by character
# references. These always refer to the document character set,
# not the character encoding.
# [...]
#
# 5.3.1 Numeric character references
#
# Numeric character references specify the code position of a
# character in the document character set. Numeric character
# references may take two forms:
#
# * The syntax "&#D;", where D is a decimal number, refers
#   to the ISO 10646 decimal character number D.
# * The syntax "&#xH;" or "&#XH;", where H is a hexadecimal
#   number, refers to the ISO 10646 hexadecimal character
#   number H. Hexadecimal numbers in numeric character
#   references are case-insensitive.

>OTOH, with the encoding properly declared the character could be input
>directly (not escaped with &#146;) and you would be valid, just not
>cross-platform accessible.  You'd probably be noncompliant with WCAG,
>but not be invalid.

I don't really see why the WCAG would say that as use of native charsets
for the authoring or hosting platform are the expected mode of operation.
The IANA also requires the publication of a transformation from each
charset into ISO-10646 before accepting a new charset registration[0].
Using IANA registered, and MIME approved, charsets should be unproblematic.

That it is preferable to use UTF-8 or ISO-8859-* where possible is another
issue.

Also please note that this no longer applies in quite the same way when you
move to X(HT)ML.

>If I'm misunderstanding this please set me straight because I've got to
>cover this in a seminar I'm preparing -- I have to explain to folks why
>the validator pukes on &#146; etc.

No, you've got it right, but Bruce wanted to use a specific character from
the iso-8859-1-Windows-3.1-Latin-1 charset and was suffering under the
restriction that the solution be usable with certain older browsers in
addition to beeing standards compliant. UNICODE (the obvious solution) is
poorly supported even in current browsers, much less older ones. Thus, the
solution is to use raw 8bit windows-1252 for the document.

Received on Tuesday, 8 May 2001 20:47:17 UTC