Re: [CSSWG] Minutes and Resolutions 2009-08-05

Bert Bos <bert@w3.org> wrote:
> 
> I don't think the comment parsing is a problem for a modern computer 
> either. If the thing you're parsing is indeed a style sheet with a 
> typo, rather than a stream of random bytes whose CSS parse tree
> doesn't interest anybody anyway, then buffering the unclosed comment
> will maybe cost you a few tens of kilobytes of memory. Not something
> to worry about. (If you already don't have enough memory to store the
> style sheet as a text string, you're unlikely to have enough for its
> DOM...)

My concern is not buffering, but code complexity and performance of the
scanner.

If you're using a machine-generated DFA-based scanner, like those
produced by the 'flex' utility, you can improve its performance by
eliminating all instances of "backing up" from the lexicon.  Ideally,
every proper prefix of a valid token is itself a valid token; a DFA
scanner can then visit each character in the input exactly once.  It is
almost as good if every situation where that's not true can be
detected with fixed-length "trailing context" rules -- these move the
overhead from the scanner's innermost loop (more work for every
character processed) to the actions (more work only when you hit one of
the situations where a few characters of back-up is required).

But trailing context is no good for recovering from invalid URI,
comment, and string tokens, because you would need a general regular
expression in the trailer rather than a fixed-length string.  (The flex
manual refers to this as "variable trailing context" and says that it
is even slower than the generic backing-up logic.  I admit to not
having measured any of this stuff; I'm taking the flex manual's
performance observations as truth.)

Similar concerns apply to hand-coded scanners like the one Mozilla
actually uses.  An arbitrary backup mechanism is always going to be more
complicated and slower than a few calls to ungetc() [or moral
equivalent] sprinkled here and there.

---

It turns out that my original proposal did not go far enough.  I should
have actually sat down and fed scanners through 'flex' until I got one
with no backing up.  I didn't do that because it's tedious and I
thought I saw all the problems; I was wrong.  This is not just a
problem for performance, David Baron has pointed out that the error
recovery behavior for unterminated url() is not what we would really
want, under my proposal.

 -- Unlike him, I think URI should remain a single token.  For one
thing, that would avoid reopening the issue of whether comments are
allowed inside.  For another, if we had an "unquoted contents of
url()" token, that would encourage future specifications to use it in
places other than inside url(); I would prefer that that never happen
(see e.g. my objection to css3-images,
http://lists.w3.org/Archives/Public/www-style/2009Jul/0121.html ).
It is true that Mozilla currently processes url(...) as three tokens,
but that's our bug and we should just fix it.

Anyway, I've now done the necessary grunt work to be confident that this
is the change we should make.  An implementor using 'flex' that follows
section 4.1.1 + these changes (including the advice at the very end) to
the letter will get a scanner that does not need to back up, and the
error recovery for URI with a missing close paren is consistent with
error recovery for FUNCTION with a missing close paren.  (The error
recovery for URI with a missing close quote, but a close paren on the
next line, may not be perfectly consistent with the error recovery for
FUNCTION containing an invalid string token with the close paren on the
next line, but I don't think we need to worry about that too hard.)

* Replace the existing {invalid}, {invalid1} and {invalid2} macros with:

{badstring1} \"([^\n\r\f\\"]|\\{nl}|{escape})*\\?
{badstring2} \'([^\n\r\f\\']|\\{nl}|{escape})*\\?
{badstring} {badstring1}|{badstring2}

(Besides the name change, the only difference is the addition of the
trailing \\? on the first two)

* Add the following new macros:

{badcomment1} \/\*[^*]*\*+([^/*][^*]*\*+)*
{badcomment2} \/\*[^*]*(\*+[^/*][^*]*)*
{badcomment} {badcomment1}|{badcomment2}

{baduri1} url\({w}([!#$%&*-~]|{nonascii}|{escape})*{w}
{baduri2} url\({w}{badstring}
{baduri3} url\({w}{string}{w}
{baduri} {baduri1}|{baduri2}|{baduri3}

* Replace the INVALID production with

BAD_STRING {badstring}
BAD_COMMENT {badcomment}
BAD_URI  {baduri}

* Add prose to §4.2 explaining that while each of these is an automatic
  syntax error, the recovery behavior is different: BAD_STRING triggers
  the "unexpected end of string" rule, BAD_COMMENT is just discarded,
  and BAD_URI is treated as an unexpected FUNCTION token.

  [ I don't see actual text anywhere that says that an unexpected
    FUNCTION token counts as an open parenthesis for pair-matching. ]

* Analogous changes to Appendix G.  (Actually I'd prefer it if we threw
  away the lexical part of Appendix G altogether (or replaced it with
  an *exact* translation to flex syntax of the §4.1.1 rules) and rewrote
  the parser part in terms of the core tokenization rules, plus a
  identifier-to-keyword mapping phase, but I recognize that's a huge
  change to take now.)

* Possibly also add a note somewhere in Appendix G to the effect that
  the following additional 'flex' rules, while not required for
  correctness, may improve scanner performance:

\</!  return DELIM;   /* partial CDO */
\</!-  return DELIM;
[@-]/-  return DELIM; /* CDC, ATKEYWORD */
[@#-]/\\ return DELIM; /* HASH, IDENT, ATKEYWORD */
@/-\\  return DELIM;   /* ATKEYWORD */

[0-9]+/\. return NUM; /* NUM */
{num}/[-\\] return NUM; /* DIMENSION */
{num}/-\\ return NUM; /* DIMENSION */

u/\+          return IDENT; /* UNICODE-RANGE */
u\+[0-9a-f?]{1,6}/- return UNICODE_RANGE;

{ident}/\\  return IDENT;
@{ident}/\\  return ATKEYWORD;
{num}{ident}/\\  return DIMENSION;
#{name}/\\  return HASH;

zw

Received on Thursday, 6 August 2009 22:46:54 UTC