Re: Limiting the size of the @charset byte sequence from Bjoern Hoehrmann on 2014-01-30 (www-international@w3.org from January to March 2014)

From: Bjoern Hoehrmann <derhoermi@gmx.net>
Date: Thu, 30 Jan 2014 02:45:53 +0100
To: Zack Weinberg <zackw@panix.com>
Cc: www-style list <www-style@w3.org>, www International <www-international@w3.org>
Message-ID: <9t6je9d523ht4j90u4mr8fivc2kilu2mnr@hive.bjoern.hoehrmann.de>

* Zack Weinberg wrote:
>When CSS is delivered over the network as a discrete resource (instead
>of being embedded in a larger document), UAs need to be able to decide
>on the encoding before the entire resource has been delivered, so that
>they can begin parsing the style sheet as quickly as possible.  (If
>you are about to quibble with that presupposition, be aware that
>complex webapps may involve tens of megabytes of machine-generated
>CSS.)  When the encoding directive is in-band, that involves chopping
>off the first N bytes of the document and handing it to the special
>@charset parser.  If the standard does not specify an exact value for
>N, UAs may disagree on the interpretation of style sheets, and worse,
>may be inconsistent *with themselves* based on network latency; i.e.
>encoding directives too deep into the document might be honored on one
>page load, not honored on the next, just because the second packet of
>the HTTP response took too long to arrive on the second load.  I am
>not aware of this having been an actual problem for CSS, but it
>definitely was for HTML, and that is where Henri is coming from (he
>wrote Gecko's current HTML parser).

HTML is a very special case because implementations allowed all sorts of
character data to precede encoding declarations. It is no fun to keep on
parsing for an encoding declaration that never comes, so a limit is able
to work around this defect. That is not needed for formats like CSS and
XML which require the encoding to be declared prior to any notable
character data (otherwise the document is rejected or any declarations
are ignored).

>Your comments about DFAs, etc. miss the mark because the limit is
>being imposed by the network layer, not the parser.  The
>implementation is something like
>
>   on network receive {
>       append current packet to buffer
>       if (len(buffer) > 1024) {
>           invoke @charset parser on first 1024 bytes
>           begin streaming data to full parser
>       }
>   }

Processing Unicode signatures and `@charset` declarations is a simple
regular pattern match, there is no need to buffer anything other than
the encoding identifier (and that is needed only when the list of en-
codings is not statically known); you can process the packets as they
come in. http://bjoern.hoehrmann.de/utf-8/decoder/dfa/ might help to
illustrate the point.
-- 
Björn Höhrmann · mailto:bjoern@hoehrmann.de · http://bjoern.hoehrmann.de
Am Badedeich 7 · Telefon: +49(0)160/4415681 · http://www.bjoernsworld.de
25899 Dagebüll · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/

Received on Thursday, 30 January 2014 01:46:28 UTC