RE: Handling unrecognized or unsupported charset from Mark Moore on 2004-07-15 (www-style@w3.org from July 2004)

From: Mark Moore <mark.moore@notlimited.com>
Date: Thu, 15 Jul 2004 13:44:45 -0700
To: <www-style@w3.org>
Message-Id: <20040715204726.C1C1DA0719@frink.w3.org>
> From: www-style-request@w3.org [mailto:www-style-request@w3.org] On Behalf
> Of Boris Zbarsky
> Sent: Thursday, July 15, 2004 11:46 AM


> No, it will not.  It will not even be able to tokenize the sheet.  The
> step
> right before tokenization is to convert the sheet to Unicode and then work
> with
> the Unicode character stream, not the byte stream.  If the conversion to
> Unicode
> cannot be performed, tokenization cannot even start.

I appreciate the point you raise, Boris.  I hadn't considered the
requirements of section 4.4.1, "Conforming user agents must correctly map to
Unicode all characters in any character encodings that they recognize (or
they must behave as if they did)."

I don't see how tokenization will be prevented, though, because A) the CR
doesn't specify what should happen if the conforming UA *doesn't* recognize
the character encoding, and B) the UA is only required to "behave as if" it
performed the conversion.

Is there some other requirement I'm missing that specifies conformant UA's
must terminate tokenization when presented with a style sheet in an
unrecognized character encoding?


> The only way to attempt to deal short of discarding the sheet is to assume
> some
> other charset and use that.  Say take the charset from the next step of
> the
> charset selection algorithm.

I agree with you absolutely!  This is where I think section 4.4 needs to be
more clarified.

The *best* strategy (IMHO) would be to specify that conformant UA's "ignore"
style sheets whose character encoding is not recognized by the UA (where the
charset is determined by the documented 5 step prioritization algorithm).

Your idea of allowing (or requiring) the UA to take the next step in the 5
step selection algorithm would be a reasonable "best effort" attempt as long
as the UA was restricted from advancing past step 3.

Any other algorithm that allowed a conformant UA to try and guess what
character encoding might be similar to the unrecognized encoding sounds very
unpredictable.

For instance, if the HTTP header specified a valid but unrecognized charset,
and the UA used a UTF-8 mapping to decode the byte stream (as would be
required by step 5), the resultant character and token stream would almost
certainly be boloxed.

The best bet would be to require the UA to toss style sheets with specified,
but unrecognized character encodings.


> > In this case, the @charset rule should be considered invalid, and the UA
> > should continue parsing immediately after the terminating semicolon (or
> > block) as described in section 4.1.5.
> 
> This is not specified anywhere in the spec.  Are you suggesting that it be
> specified?

Yes, I am suggesting some added clarification.

If I'm reading the CR correctly, this requirement is partially specified in
section 4.1.5, and section 4.1.1.  "A CSS user agent that encounters an
unrecognized at-rule must ignore the whole of the at-rule and continue
parsing after it."

Specifically, if the @charset rule doesn't parse as the token string
"ATKEYWORD STRING ;", it's hard to see how a conformant UA can do anything
but continue parsing the style sheet immediately after the semicolon or
block that terminates the invalid @charset production, and completely ignore
the unrecognized at-rule (e.g. "@charset 'UTF-8' screen;" might be a
reasonable future extension).

The unspecified case is if the STRING token cannot possibly represent a
valid IANA character set name.  This would be the case if the charset name
was more than 40 characters long, empty, or contained a character outside
the printable US-ASCII character codes.

The invalid @charset rule is more interesting (IMHO) than the invalid IANA
character set name since the bad IANA ID is likely to be caught during
development.

On the other hand, without specifying how conformant UA's should handle
invalid @charset rules more completely, the CR limits future CSS expansion.

As currently specified, conformant UA's are required to ignore the invalid
@charset rule and continue parsing the remainder of the style sheet using a
character mapping that may or may not be related to the one the style sheet
uses.

At the very least, it's ambiguous whether an invalid @charset rule requires
the UA to continue trying to determine the style sheet's character encoding.

Requiring conformant UA's to discard/ignore style sheets when either the
encoding is unrecognized, or the @charset rule is invalid makes things more
predictable, flexible, and benefits from a very simple implementation.
Received on Thursday, 15 July 2004 16:47:26 UTC