CSS encoding detection and GSM 03.30 from Benjamin Hawkes-Lewis on 2009-06-06 (www-style@w3.org from June 2009)

From: Benjamin Hawkes-Lewis <bhawkeslewis@googlemail.com>
Date: Sun, 07 Jun 2009 00:46:19 +0100
To: W3C Emailing list for WWW Style <www-style@w3.org>
Message-ID: <4A2AFFCB.9050408@googlemail.com>
Hello, I am trying to implement CSS 2.1 encoding detection as defined by:

http://www.w3.org/TR/CSS2/syndata.html#charset

(Specifically, I'm using curl to grab headers and content for 
stylesheets, then (when there's no character encoding to be extracted 
from the headers), sniffing the content according to the SHOULD 
requirements of CSS 2.1.)

I'm puzzled by the byte stream sniffing defined for GSM 03.38. (I 
haven't looked at the other cases requiring transcoding yet.)

Correspondents as unfamiliar with this encoding as I was may wish to 
consult:

The ETSI spec:
http://pda.etsi.org/exchangefolder/ts_100900v070200p.pdf

Plain English explanations:

    * http://www.dreamfabric.com/sms/
    * http://www.atmel.com/dyn/resources/prod_documents/doc8016.pdf

Official Unicode mapping: 
http://www.unicode.org/Public/MAPPINGS/ETSI/GSM0338.TXT

The byte stream pattern "00 63 68 61 72 73 65 74 20 22 (YY)* 22 3B" 
matches the hexadecimal values of certain codes in GSM 03.38's code page.

For example, '@charset "GSM0338";' is "00 63 68 61 72 73 65 74 20 22 47 
53 4D 30 33 33 38 22 3B".

But as far as I can tell, when actually written, these codes would 
always be encoded as 7 bits packed into octets, such that the 
hexadecimal values in the pattern are actually spread across multiple 
octets.

So the consuming processor would actually see "80 31 3A 2C 9F 97 E9 20 
D1 71 DA 84 CD 66 38 D1 0E", which would not be matched by the pattern. 
Only a further process of unpacking the septets from the octets could 
reveal the hexadecimal values that would match the pattern.

So does this mean that the term "byte" is used in CSS 2.1 in a general 
abstract sense rather the common specific sense of "octet" (8 bytes)? If 
so, it would help if this were clarified. The PNG spec clarifies that it 
is using "byte" to mean octet:

http://www.w3.org/2003/glossary/keyword/All/?keywords=byte

How are implementations supposed to resolve arbitrary octet streams into 
CSS 2.1 "bytes"? For example, how is an implementation supposed to know 
when to decode a given octet stream into unpacked septets /before/ 
trying to match the patterns from the spec against it?

I've got a few other, more minor questions about GSM 03.30 and CSS:

1. The spec says "The name must be a charset name as described in the 
IANA registry". Given GSM 03.30 doesn't have a name in the IANA 
registry, why doesn't this mean that "as specified, transcoded from GSM 
03.38 to ASCII" is a condition that could never be met, since a 
specification like "GSM0338" or "GSM03.38" would violate this MUST 
requirement and have to be discarded?

2. I've noticed that Unicode deliberately redefined the meaning of GSM 
03.30 hex 09 from uppercase to lowercase c cedilla. Currently it seems 
undefined whether CSS-conforming software should interpret hex 09 
according to the GSM specification (capital c cedilla) or Unicode 
(lowercase c cedilla). Should it be defined?

3. Does anyone have any examples of CSS encoded as GSM 03.30 7-bit from 
the wild? Searching around, I haven't find any reference to this other 
than in the CSS spec. (I'd like to test my implementation against 
something real.)

That neither the official test suite nor the Microsoft test suite seem 
to cover GSM 03.38 makes it especially hard to know what to do:

http://www.w3.org/Style/CSS/Test/CSS2.1/current/

http://samples.msdn.microsoft.com/ietestcenter/css.htm

--
Benjamin Hawkes-Lewis
Received on Saturday, 6 June 2009 23:47:03 UTC