- From: Ian Hickson <ian@hixie.ch>
- Date: Wed, 18 Feb 2004 02:44:23 +0000 (UTC)
- To: Boris Zbarsky <bzbarsky@MIT.EDU>
- Cc: Bert Bos <bert@w3.org>, www-style@w3.org
On Tue, 17 Feb 2004, Boris Zbarsky wrote: > > So in other words, in the proposed setup one would need to do the > following to be robust wrt new encodings that may appear: > > 1) See whether there is a BOM. > 2) Parse the @charset rule. > 3) If there is an @charset rule and a BOM, encode U+FEFF using the > charset from the @charset rule and see whether this agrees with the > BOM's representation. If it does, use the @charset charset. If > not, guess a charset based on the BOM's encoding. That would be compliant, I think. It should be equivalent to the following algorithm. Follow the steps given until the set of encodings declared in step zero has only one remaining encoding. If a step would reduce the number of encodings to zero, skip that step. 0) Set the set of encodings to include all known encodings. 1) If there is an HTTP Content-Type header, reduce the set of encodings to the set of encodings that the Content-Type header covered. (e.g. if it said "text/css;charset=utf-16" then the set would be UTF-16LE, UTF-16BE.) 2) See if you can detect a BOM. If so, use that to reduce the set of encodings to the the set of encodings that have that BOM. 3) See if you can detect an '@charset' in one of the encodings in the set. If so, see if it says that the encoding is one of the encodings in the current set. If so, reduce the set to the encodings that match what the @charset rule specified. 4) Do the same again, using any metadata from the linking mechanism, such as <link charset="">. 5) If there is still more than one encoding in the set, and you have a refering document or stylesheet, and it has a known encoding that is one of the encodings in the set, then reduce the set to just that encoding. 6) Use a UA-dependent mechanism to narrow the set down to one encoding. Sound right? I believe this is exactly equivalent to what the spec says now, but in more detail than we want for 2.1. Maybe CSS3 could use a more explicit algorithm like the above. The one thing that that algorithm doesn't say is how to cope with the case where in step 3, you detect an @charset, and the given encoding is in the set, but the set of encodings that would detect the @charset and the set of encodings that are covered by the given encoding do not overlap. >> As far as I know the only encodings that can represent U+FEFF are: > > I should clarify that I am by no means an intl expert. Hence all the > questions. It's just that I tend to assume that any situation like the > one we have with encodings will deteriorate (that is, more random > encodings will appear, possibly overlapping existing ones in most > undesirable ways). One would hope, given the existence of Unicode, that we will not be seeing new encodings any more (except in specialist fields such as Punycode for IDN, but that doesn't really count). -- Ian Hickson )\._.,--....,'``. fL U+1047E /, _.. \ _\ ;`._ ,. http://index.hixie.ch/ `._.-(,_..'--(,_..'`-.;.'
Received on Tuesday, 17 February 2004 21:44:25 UTC