- From: Benjamin Hawkes-Lewis <bhawkeslewis@googlemail.com>
- Date: Sun, 07 Jun 2009 00:46:19 +0100
- To: W3C Emailing list for WWW Style <www-style@w3.org>
Hello, I am trying to implement CSS 2.1 encoding detection as defined by:
http://www.w3.org/TR/CSS2/syndata.html#charset
(Specifically, I'm using curl to grab headers and content for
stylesheets, then (when there's no character encoding to be extracted
from the headers), sniffing the content according to the SHOULD
requirements of CSS 2.1.)
I'm puzzled by the byte stream sniffing defined for GSM 03.38. (I
haven't looked at the other cases requiring transcoding yet.)
Correspondents as unfamiliar with this encoding as I was may wish to
consult:
The ETSI spec:
http://pda.etsi.org/exchangefolder/ts_100900v070200p.pdf
Plain English explanations:
* http://www.dreamfabric.com/sms/
* http://www.atmel.com/dyn/resources/prod_documents/doc8016.pdf
Official Unicode mapping:
http://www.unicode.org/Public/MAPPINGS/ETSI/GSM0338.TXT
The byte stream pattern "00 63 68 61 72 73 65 74 20 22 (YY)* 22 3B"
matches the hexadecimal values of certain codes in GSM 03.38's code page.
For example, '@charset "GSM0338";' is "00 63 68 61 72 73 65 74 20 22 47
53 4D 30 33 33 38 22 3B".
But as far as I can tell, when actually written, these codes would
always be encoded as 7 bits packed into octets, such that the
hexadecimal values in the pattern are actually spread across multiple
octets.
So the consuming processor would actually see "80 31 3A 2C 9F 97 E9 20
D1 71 DA 84 CD 66 38 D1 0E", which would not be matched by the pattern.
Only a further process of unpacking the septets from the octets could
reveal the hexadecimal values that would match the pattern.
So does this mean that the term "byte" is used in CSS 2.1 in a general
abstract sense rather the common specific sense of "octet" (8 bytes)? If
so, it would help if this were clarified. The PNG spec clarifies that it
is using "byte" to mean octet:
http://www.w3.org/2003/glossary/keyword/All/?keywords=byte
How are implementations supposed to resolve arbitrary octet streams into
CSS 2.1 "bytes"? For example, how is an implementation supposed to know
when to decode a given octet stream into unpacked septets /before/
trying to match the patterns from the spec against it?
I've got a few other, more minor questions about GSM 03.30 and CSS:
1. The spec says "The name must be a charset name as described in the
IANA registry". Given GSM 03.30 doesn't have a name in the IANA
registry, why doesn't this mean that "as specified, transcoded from GSM
03.38 to ASCII" is a condition that could never be met, since a
specification like "GSM0338" or "GSM03.38" would violate this MUST
requirement and have to be discarded?
2. I've noticed that Unicode deliberately redefined the meaning of GSM
03.30 hex 09 from uppercase to lowercase c cedilla. Currently it seems
undefined whether CSS-conforming software should interpret hex 09
according to the GSM specification (capital c cedilla) or Unicode
(lowercase c cedilla). Should it be defined?
3. Does anyone have any examples of CSS encoded as GSM 03.30 7-bit from
the wild? Searching around, I haven't find any reference to this other
than in the CSS spec. (I'd like to test my implementation against
something real.)
That neither the official test suite nor the Microsoft test suite seem
to cover GSM 03.38 makes it especially hard to know what to do:
http://www.w3.org/Style/CSS/Test/CSS2.1/current/
http://samples.msdn.microsoft.com/ietestcenter/css.htm
--
Benjamin Hawkes-Lewis
Received on Saturday, 6 June 2009 23:47:03 UTC