- From: Zack Weinberg <zweinberg@mozilla.com>
- Date: Thu, 6 Aug 2009 15:46:11 -0700
- To: Bert Bos <bert@w3.org>
- Cc: www-style@w3.org
Bert Bos <bert@w3.org> wrote: > > I don't think the comment parsing is a problem for a modern computer > either. If the thing you're parsing is indeed a style sheet with a > typo, rather than a stream of random bytes whose CSS parse tree > doesn't interest anybody anyway, then buffering the unclosed comment > will maybe cost you a few tens of kilobytes of memory. Not something > to worry about. (If you already don't have enough memory to store the > style sheet as a text string, you're unlikely to have enough for its > DOM...) My concern is not buffering, but code complexity and performance of the scanner. If you're using a machine-generated DFA-based scanner, like those produced by the 'flex' utility, you can improve its performance by eliminating all instances of "backing up" from the lexicon. Ideally, every proper prefix of a valid token is itself a valid token; a DFA scanner can then visit each character in the input exactly once. It is almost as good if every situation where that's not true can be detected with fixed-length "trailing context" rules -- these move the overhead from the scanner's innermost loop (more work for every character processed) to the actions (more work only when you hit one of the situations where a few characters of back-up is required). But trailing context is no good for recovering from invalid URI, comment, and string tokens, because you would need a general regular expression in the trailer rather than a fixed-length string. (The flex manual refers to this as "variable trailing context" and says that it is even slower than the generic backing-up logic. I admit to not having measured any of this stuff; I'm taking the flex manual's performance observations as truth.) Similar concerns apply to hand-coded scanners like the one Mozilla actually uses. An arbitrary backup mechanism is always going to be more complicated and slower than a few calls to ungetc() [or moral equivalent] sprinkled here and there. --- It turns out that my original proposal did not go far enough. I should have actually sat down and fed scanners through 'flex' until I got one with no backing up. I didn't do that because it's tedious and I thought I saw all the problems; I was wrong. This is not just a problem for performance, David Baron has pointed out that the error recovery behavior for unterminated url() is not what we would really want, under my proposal. -- Unlike him, I think URI should remain a single token. For one thing, that would avoid reopening the issue of whether comments are allowed inside. For another, if we had an "unquoted contents of url()" token, that would encourage future specifications to use it in places other than inside url(); I would prefer that that never happen (see e.g. my objection to css3-images, http://lists.w3.org/Archives/Public/www-style/2009Jul/0121.html ). It is true that Mozilla currently processes url(...) as three tokens, but that's our bug and we should just fix it. Anyway, I've now done the necessary grunt work to be confident that this is the change we should make. An implementor using 'flex' that follows section 4.1.1 + these changes (including the advice at the very end) to the letter will get a scanner that does not need to back up, and the error recovery for URI with a missing close paren is consistent with error recovery for FUNCTION with a missing close paren. (The error recovery for URI with a missing close quote, but a close paren on the next line, may not be perfectly consistent with the error recovery for FUNCTION containing an invalid string token with the close paren on the next line, but I don't think we need to worry about that too hard.) * Replace the existing {invalid}, {invalid1} and {invalid2} macros with: {badstring1} \"([^\n\r\f\\"]|\\{nl}|{escape})*\\? {badstring2} \'([^\n\r\f\\']|\\{nl}|{escape})*\\? {badstring} {badstring1}|{badstring2} (Besides the name change, the only difference is the addition of the trailing \\? on the first two) * Add the following new macros: {badcomment1} \/\*[^*]*\*+([^/*][^*]*\*+)* {badcomment2} \/\*[^*]*(\*+[^/*][^*]*)* {badcomment} {badcomment1}|{badcomment2} {baduri1} url\({w}([!#$%&*-~]|{nonascii}|{escape})*{w} {baduri2} url\({w}{badstring} {baduri3} url\({w}{string}{w} {baduri} {baduri1}|{baduri2}|{baduri3} * Replace the INVALID production with BAD_STRING {badstring} BAD_COMMENT {badcomment} BAD_URI {baduri} * Add prose to §4.2 explaining that while each of these is an automatic syntax error, the recovery behavior is different: BAD_STRING triggers the "unexpected end of string" rule, BAD_COMMENT is just discarded, and BAD_URI is treated as an unexpected FUNCTION token. [ I don't see actual text anywhere that says that an unexpected FUNCTION token counts as an open parenthesis for pair-matching. ] * Analogous changes to Appendix G. (Actually I'd prefer it if we threw away the lexical part of Appendix G altogether (or replaced it with an *exact* translation to flex syntax of the §4.1.1 rules) and rewrote the parser part in terms of the core tokenization rules, plus a identifier-to-keyword mapping phase, but I recognize that's a huge change to take now.) * Possibly also add a note somewhere in Appendix G to the effect that the following additional 'flex' rules, while not required for correctness, may improve scanner performance: \</! return DELIM; /* partial CDO */ \</!- return DELIM; [@-]/- return DELIM; /* CDC, ATKEYWORD */ [@#-]/\\ return DELIM; /* HASH, IDENT, ATKEYWORD */ @/-\\ return DELIM; /* ATKEYWORD */ [0-9]+/\. return NUM; /* NUM */ {num}/[-\\] return NUM; /* DIMENSION */ {num}/-\\ return NUM; /* DIMENSION */ u/\+ return IDENT; /* UNICODE-RANGE */ u\+[0-9a-f?]{1,6}/- return UNICODE_RANGE; {ident}/\\ return IDENT; @{ident}/\\ return ATKEYWORD; {num}{ident}/\\ return DIMENSION; #{name}/\\ return HASH; zw
Received on Thursday, 6 August 2009 22:46:54 UTC