- From: Bjoern Hoehrmann <derhoermi@gmx.net>
- Date: Thu, 30 Jan 2014 02:45:53 +0100
- To: Zack Weinberg <zackw@panix.com>
- Cc: www-style list <www-style@w3.org>, www International <www-international@w3.org>
* Zack Weinberg wrote: >When CSS is delivered over the network as a discrete resource (instead >of being embedded in a larger document), UAs need to be able to decide >on the encoding before the entire resource has been delivered, so that >they can begin parsing the style sheet as quickly as possible. (If >you are about to quibble with that presupposition, be aware that >complex webapps may involve tens of megabytes of machine-generated >CSS.) When the encoding directive is in-band, that involves chopping >off the first N bytes of the document and handing it to the special >@charset parser. If the standard does not specify an exact value for >N, UAs may disagree on the interpretation of style sheets, and worse, >may be inconsistent *with themselves* based on network latency; i.e. >encoding directives too deep into the document might be honored on one >page load, not honored on the next, just because the second packet of >the HTTP response took too long to arrive on the second load. I am >not aware of this having been an actual problem for CSS, but it >definitely was for HTML, and that is where Henri is coming from (he >wrote Gecko's current HTML parser). HTML is a very special case because implementations allowed all sorts of character data to precede encoding declarations. It is no fun to keep on parsing for an encoding declaration that never comes, so a limit is able to work around this defect. That is not needed for formats like CSS and XML which require the encoding to be declared prior to any notable character data (otherwise the document is rejected or any declarations are ignored). >Your comments about DFAs, etc. miss the mark because the limit is >being imposed by the network layer, not the parser. The >implementation is something like > > on network receive { > append current packet to buffer > if (len(buffer) > 1024) { > invoke @charset parser on first 1024 bytes > begin streaming data to full parser > } > } Processing Unicode signatures and `@charset` declarations is a simple regular pattern match, there is no need to buffer anything other than the encoding identifier (and that is needed only when the list of en- codings is not statically known); you can process the packets as they come in. http://bjoern.hoehrmann.de/utf-8/decoder/dfa/ might help to illustrate the point. -- Björn Höhrmann · mailto:bjoern@hoehrmann.de · http://bjoern.hoehrmann.de Am Badedeich 7 · Telefon: +49(0)160/4415681 · http://www.bjoernsworld.de 25899 Dagebüll · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/
Received on Thursday, 30 January 2014 01:46:28 UTC