- From: Benjamin Hawkes-Lewis <bhawkeslewis@googlemail.com>
- Date: Sun, 07 Jun 2009 00:46:19 +0100
- To: W3C Emailing list for WWW Style <www-style@w3.org>
Hello, I am trying to implement CSS 2.1 encoding detection as defined by: http://www.w3.org/TR/CSS2/syndata.html#charset (Specifically, I'm using curl to grab headers and content for stylesheets, then (when there's no character encoding to be extracted from the headers), sniffing the content according to the SHOULD requirements of CSS 2.1.) I'm puzzled by the byte stream sniffing defined for GSM 03.38. (I haven't looked at the other cases requiring transcoding yet.) Correspondents as unfamiliar with this encoding as I was may wish to consult: The ETSI spec: http://pda.etsi.org/exchangefolder/ts_100900v070200p.pdf Plain English explanations: * http://www.dreamfabric.com/sms/ * http://www.atmel.com/dyn/resources/prod_documents/doc8016.pdf Official Unicode mapping: http://www.unicode.org/Public/MAPPINGS/ETSI/GSM0338.TXT The byte stream pattern "00 63 68 61 72 73 65 74 20 22 (YY)* 22 3B" matches the hexadecimal values of certain codes in GSM 03.38's code page. For example, '@charset "GSM0338";' is "00 63 68 61 72 73 65 74 20 22 47 53 4D 30 33 33 38 22 3B". But as far as I can tell, when actually written, these codes would always be encoded as 7 bits packed into octets, such that the hexadecimal values in the pattern are actually spread across multiple octets. So the consuming processor would actually see "80 31 3A 2C 9F 97 E9 20 D1 71 DA 84 CD 66 38 D1 0E", which would not be matched by the pattern. Only a further process of unpacking the septets from the octets could reveal the hexadecimal values that would match the pattern. So does this mean that the term "byte" is used in CSS 2.1 in a general abstract sense rather the common specific sense of "octet" (8 bytes)? If so, it would help if this were clarified. The PNG spec clarifies that it is using "byte" to mean octet: http://www.w3.org/2003/glossary/keyword/All/?keywords=byte How are implementations supposed to resolve arbitrary octet streams into CSS 2.1 "bytes"? For example, how is an implementation supposed to know when to decode a given octet stream into unpacked septets /before/ trying to match the patterns from the spec against it? I've got a few other, more minor questions about GSM 03.30 and CSS: 1. The spec says "The name must be a charset name as described in the IANA registry". Given GSM 03.30 doesn't have a name in the IANA registry, why doesn't this mean that "as specified, transcoded from GSM 03.38 to ASCII" is a condition that could never be met, since a specification like "GSM0338" or "GSM03.38" would violate this MUST requirement and have to be discarded? 2. I've noticed that Unicode deliberately redefined the meaning of GSM 03.30 hex 09 from uppercase to lowercase c cedilla. Currently it seems undefined whether CSS-conforming software should interpret hex 09 according to the GSM specification (capital c cedilla) or Unicode (lowercase c cedilla). Should it be defined? 3. Does anyone have any examples of CSS encoded as GSM 03.30 7-bit from the wild? Searching around, I haven't find any reference to this other than in the CSS spec. (I'd like to test my implementation against something real.) That neither the official test suite nor the Microsoft test suite seem to cover GSM 03.38 makes it especially hard to know what to do: http://www.w3.org/Style/CSS/Test/CSS2.1/current/ http://samples.msdn.microsoft.com/ietestcenter/css.htm -- Benjamin Hawkes-Lewis
Received on Saturday, 6 June 2009 23:47:03 UTC