Re: Character encodings in headers [i74][was: Straw-man charter for http-bis]

Mark Nottingham said:
>> UTF-8 has virtually
>> the same footprint in terms of bytes as ISO-8859-1: All bytes
>> above 0x7F may be used. Implementations that have to deal with
>> ISO-8859-1 usually do this by just being 8-bit-transparent;
>> that works for UTF-8, too.
> If utf-8 is a subset of iso-8859-1, it would work; but I don't think  
> that's the case (not that I'm an expert in this area, by any means).

It's not.

Printable text in ISO-8859-n (for all n) consists of a sequence of
characters, each of which is either:

    one octet in the range 20 to 7E
    one octet in the range A0 to FF

Printable text in UTF-8 consists of a sequence of characters, each of which
is either:

    one octet in the range 20 to 7E
    one octet in the range C2 to E4 followed by between 1 and 3 octets
              in the range 80 to BF (the first octet tells you how many [*])

In both cases, 20 to 7E are the ASCII characters. In both cases, codes like
09 (HTAB) and 0A (LF) have the same meaning. In ISO-8859-n the meaning of
codes A0 to FF depends on the value of n. In UTF-8 each sequence has a
unique meaning that never changes.

The syntax in 2616 allows any octet in the range 20 to FF except 7F; both
of these are subsets of that.

(*) To be precise:
     one octet C2 to DF followed by one   octet  in the range 80 to BF, or
     one octet E0 to E4 followed by two   octets in the range 80 to BF, or
     one octet F0 to F7 followed by three octets in the range 80 to BF.

-- 
Clive D.W. Feather  | Work:  <clive@demon.net>   | Tel:    +44 20 8495 6138
Internet Expert     | Home:  <clive@davros.org>  | Fax:    +44 870 051 9937
Demon Internet      | WWW: http://www.davros.org | Mobile: +44 7973 377646
THUS plc            |                            |

Received on Monday, 20 August 2007 10:29:06 UTC