- From: Clive D.W. Feather <clive@demon.net>
- Date: Mon, 20 Aug 2007 11:24:06 +0100
- To: Mark Nottingham <mnot@mnot.net>
- Cc: Martin Duerst <duerst@it.aoyama.ac.jp>, Paul Hoffman <phoffman@imc.org>, Apps Discuss <discuss@apps.ietf.org>, Felix Sasaki <fsasaki@w3.org>, "ietf-http-wg@w3.org Group" <ietf-http-wg@w3.org>, Richard Ishida <ishida@w3.org>
Mark Nottingham said: >> UTF-8 has virtually >> the same footprint in terms of bytes as ISO-8859-1: All bytes >> above 0x7F may be used. Implementations that have to deal with >> ISO-8859-1 usually do this by just being 8-bit-transparent; >> that works for UTF-8, too. > If utf-8 is a subset of iso-8859-1, it would work; but I don't think > that's the case (not that I'm an expert in this area, by any means). It's not. Printable text in ISO-8859-n (for all n) consists of a sequence of characters, each of which is either: one octet in the range 20 to 7E one octet in the range A0 to FF Printable text in UTF-8 consists of a sequence of characters, each of which is either: one octet in the range 20 to 7E one octet in the range C2 to E4 followed by between 1 and 3 octets in the range 80 to BF (the first octet tells you how many [*]) In both cases, 20 to 7E are the ASCII characters. In both cases, codes like 09 (HTAB) and 0A (LF) have the same meaning. In ISO-8859-n the meaning of codes A0 to FF depends on the value of n. In UTF-8 each sequence has a unique meaning that never changes. The syntax in 2616 allows any octet in the range 20 to FF except 7F; both of these are subsets of that. (*) To be precise: one octet C2 to DF followed by one octet in the range 80 to BF, or one octet E0 to E4 followed by two octets in the range 80 to BF, or one octet F0 to F7 followed by three octets in the range 80 to BF. -- Clive D.W. Feather | Work: <clive@demon.net> | Tel: +44 20 8495 6138 Internet Expert | Home: <clive@davros.org> | Fax: +44 870 051 9937 Demon Internet | WWW: http://www.davros.org | Mobile: +44 7973 377646 THUS plc | |
Received on Monday, 20 August 2007 10:29:06 UTC