Re: Limiting the size of the @charset byte sequence from Bjoern Hoehrmann on 2014-01-28 (www-style@w3.org from January 2014)

From: Bjoern Hoehrmann <derhoermi@gmx.net>
Date: Tue, 28 Jan 2014 19:31:28 +0100
To: "Martin J. Dürst" <duerst@it.aoyama.ac.jp>
Cc: Henri Sivonen <hsivonen@hsivonen.fi>, www-style list <www-style@w3.org>, www International <www-international@w3.org>
Message-ID: <63nfe91bha8ts1abit1ogd9jkdh9gsqbku@hive.bjoern.hoehrmann.de>

* Martin J. Dürst wrote:
>On 2014/01/28 3:41, Simon Sapin wrote:
>> On 27/01/2014 00:20, Henri Sivonen wrote:
>>> It's a terribly bad idea to define an internal character encoding
>>> declaration syntax in such a way that the syntax definition doesn't
>>> guarantee the syntax to fit within a string of bytes shorter than N
>>> bytes with a small value for N. For this reason, it's a bad idea to
>>> allow an arbitrary number of whitespace characters between '@chaset'
>>> and the quote. Unfortunately, CSS still fails at making the length of
>>> the declaration bounded, because "get an encoding" trims white space.
>>> Gecko imposes a bound on the length anyway.

>I would put things like what Henry says below into the Character Model 
>(http://www.w3.org/TR/charmod/) if I were editing it now. Maybe it's 
>worth filing an erratum, so that the issue is at least documented for 
>the case that a new version of the Character Model ever gets worked on.
>
>I have cc'ed  www-i18n-comments@w3.org as requested on 
>http://www.w3.org/2005/02/charmod-fundamentals-errata.html.

Arbitrary limits are bad design and often harder to implement correctly
than something without arbitrary limits. The obvious parsing device for
something like the `@charset` rule is a DFA, and a DFA that halts after
N input bytes is a lot more complex than one that does not. Similarily,
if the byte count is maintained separately, you have nasty control logic
to deal with, with many opportunities for bugs that are much harder to
locate than statically analysing the DFA graph.

The next thing that would need limiting, then, is the length of labels,
so this would add a class of errors that are not otherwise possible, and
just what would the parser do when encountering

  @charset "toolonglabel";

that is different from processing a short-enough but unrecognised label?
I can't think of anything sensible, and if reading arbitrarily long en-
coding labels is not an issue, then skipping arbitrarily long sequences
of white space immediately before it should not be a problem either. In-
deed, insisting that `@charset` comes immediately at the beginning is
bad for authors, if experience with XML is any indication (a newline in
front of XML declarations is a common problem authors encounter, e.g.
when they generate the XML from a PHP script).
-- 
Björn Höhrmann · mailto:bjoern@hoehrmann.de · http://bjoern.hoehrmann.de
Am Badedeich 7 · Telefon: +49(0)160/4415681 · http://www.bjoernsworld.de
25899 Dagebüll · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/

Received on Tuesday, 28 January 2014 18:32:00 UTC