Re: Charsets revisited from Tim Greenwood on 1996-01-24 (ietf-http-wg@w3.org from January to March 1996)

From: Tim Greenwood <greenwd@openmarket.com>
Date: Wed, 24 Jan 1996 18:53:45 -0500
To: Nickolay Saukh <nms@nns.ru>
Cc: http-wg%cuckoo.hpl.hp.com@hplb.hpl.hp.com
Message-Id: <199601242353.SAA11495@relay.openmarket.com>

At 01:16 AM 1/25/96 +0300, you wrote:
>> >Suppose I would like to get description of chess game in russian.
>> >I know about special chess characters in Unicode. But if Unicode
>> >is not available, then iso8859-5 would be sufficient.
>> 
>> Section 12 of HTTP 1.1 (Nov 22) says "If multiple representations 
>> exist that only vary by Content-Encoding, then the smallest representation
>> (lowest bs) is preferred." 
>
>Well, they vary by special chess characters in Unicode and their
>rough approximation in iso-8859-5.

Content equivilance is both a hard philosophical and easy protocol issue. 

For simplicity consider text only content. We have

Abstract idea
  transformed to
Language
 transformed to
Text
 transformed to
Character set encoding
 Possibly transformed to 
Content coding
 Possibly transformed to 
Transfer coding

At any point in this list we have the possibility of multiple representations
of the higher entity. Transformations may be performed by the server on the
fly in response to a request, or multiple transformation representations may
be stored. My understanding of the protocol is that identical URL's denote
identical abstract ideas - multiple representations may be stored,
differentiated on request by HTML headers and on resoponse by entity
headers. It is the content provider who is making the claim for identical
content of Entity-Body at the abstract level. For your chess example if the
content provider has decided that the "rough approximation in iso-8859-5"
and the representation in Unicode are multiple representations of the same
abstract idea, then we have content equivilance and the lowest bs rule for
deciding which character set to provide holds. If the content provider
decides that these are two separate abstract ideas then the two
representations have different URL's and none of this applies. Language
variants delimited by differing Content-Language entity headers are a more
interesting example of multiple representation for abstract idea. See
"Godel, Escher, Bach" for a discussion of equivilance of multiple language
representations of abstract ideas.

If the transformation is performed on the fly then equivilance is presumed
from the transformation algorithims used. It is interesting that the
algorithims to be used are not necessarily specified by the standard. For
example ISO8859-5 to Unicode. The Unicode Consortium publishes a set of
tables which I would recommend we consdier to be 'the standard' but other,
disputed, conversion tables have been seen, especially for ideographic based
writing systems.
-------------------------------------
Tim Greenwood        Open Market Inc
617 679 0320         greenwd@openmarket.com

Received on Wednesday, 24 January 1996 15:58:25 UTC