[CSSWG] Minutes Tokyo F2F 2013-06-06 Thu AM II/PM I: Syntax

CSS3 Syntax

   - Discussed @charset handling and removing unused encodings.
     Conclusion to leave in issue and ask for feedback whether to
     add additional encoding patterns.

   - RESOLVED: Charset propagation from linking document is same-origin.
               Leave issue open until we're more sure of this being stable.

   - RESOLVED: NUL gets turned into Replacement char.

   - Confirmed that 'non-ascii' is updated to include all non-ASCII chars

   - RESOLVED: loosen the rules describing the way comments are serialized

   - "Safe but pointless" change to UNICODE-RANGE token was discussed.
     No resolution recorded.

   - Reviewed addition of attribute-matching tokens for reducing lookahead.

   - Reviewed addition of COMMA token.

   - Reviewed change to NUMBER token (to include sign).

   - Reviewed change to bracket-matching of url() notations.

   - dbaron asked whether CDO/CDC were correctly fixed. No response
     was recorded in the minutes.

   - RESOLVED: loosen the rules describing the way comments are serialized

   - RESOLVED: escaping in An+B falls out of tokenization

   - Syntax is missing section that actually defines interpreting CSS
     syntax into style rules, at rules, etc.

   - RESOLVED: Add Simon Sapin as co-editor.

====== Full minutes below ======

CSS3 Syntax Part I: Overview and @charset handling
Scribe: fantasai

   TabAtkins: Rewriting from grammar approach to parser approach
   <jerenkrantz> http://dev.w3.org/csswg/css-syntax/
   TabAtkins: grammar tried to match everything, not just well-formed things,
              and was too complicated and didn't quite handle everything anyway
   TabAtkins: Syntax defines Tokenizer, then Parser
   TabAtkins: Includes hooks for other specs to include a certain type of thing
   TabAtkins: I would like to review some things with group, then ask for FPWD
   TabAtkins: Probably as correct as 2.1 at this point
   TabAtkins: WebKit uses augmented grammar to parse, and it's horribly broken.

   TabAtkins: Wanted to go over changes from 2.1
   <dbaron> we're reviewing http://dev.w3.org/csswg/css-syntax/#changes as of 4ce7b66b553a

   TabAtkins: First batch of changes are about parsing
   TabAtkins: 2.1 defined some interesting rules for detecting @charset
   TabAtkins: Brought into line with charset handling in rest of platform
   r12a: Does that mean it doesn't handle UTF16
   TabAtkins: Can use UTF16 by having BOM, but won't recognize @charset
   fantasai: How interoperable is this, compared to 2.1?
   TabAtkins: Dropping some obscure encodings probably ok, since implementations
              don't support them
   TabAtkins: Wrt web-compat, should be just fine
   TabAtkins: dropping things like EBCDIC
   liam: Did you check for corporate intranet stuff?
   TabAtkins: I think likelihood of EBCDIC is close to zero
   r12a: I thought same thing wrt bidi things, but ppl came back with
         supercomputers and stuff in Hebrew that were using old encoding
   <jerenkrantz> FWIW, IBM keeps ensuring that EBCDIC support is working in httpd.
                 Someone at IBM would be able to give an idea of its reach.

   Bert: It was there. You're trying to remove it?
   TabAtkins: Browsers didn't recognize it?
   Bert: You can't remove things.
   Bert: This is supposed to be stable
   Bert: We promised not to make versions to CSS, just to add things
   Bert: Don't think we should remove features just because not used on the Web.
   Bert: Remove features when they're wrong.
   [discussion of changes]
   Bert: Not an argument to keep on making mistakes.
   r12a: When HTML5 did some charset stuff, wanted to do it to stop people
         using non-UTF-8.
   r12a: Is that your motivation?
   TabAtkins: Highly sympathetic to moving to UTF-8, but main issue is
              making this much simpler
   glazou: If all style sheets were UTF-8, would be much easier for authoring

   dbaron: Checking for UTF-16 rules were widely tested.
   dbaron: We changed recently. Ran into no web-compat problems, but ran into
           certification problems.
   dbaron: Sorry, what I said is opposite.
   dbaron: When we implemented what this draft says, and this caused us to
           start passing th GCF certification suite
   dbaron: It has a test of a UTF-16 file that claims an encoding name that
           doesn't exist.
   <dbaron> the thing I was just describing was
            being fixed by
   <dbaron> er, actually, the encoding name does exist but it's an alias
            for UTF-16BE, while the file is UTF-16LE

   SimonSapin: Would like to point out that this is only about @charset.
               Can do anything you want with HTTP headers.
   SimonSapin: Also, relevant part of CSS2.1 gives a table of byte patterns.
               Allows UAs to remove some, or to add some.
   r12a: What did you say was the first step in detecting charset?
   r12a: Thought we changed so that BOM is first
   TabAtkins: BOM is checked first by decoding algorithm, before @charset
   TabAtkins: this list finds the fallback encoding
   r12a: Do you forbid use of @charset in UTF-16 documents?
   TabAtkins: No, you just ignore it
   TabAtkins: Never invalid to put @charset
   plinss: What if UTF-16 without BOM but with @charset?
   TabAtkins: Defer to encoding standard...
   r12a: you can do it, but say you shouldn't
   TabAtkins: Specifying anything other than ascii-compatible encoding
              is not useful.
   r12a: Since already defined, why not keep them?
   r12a: I share Bert's unease here
   r12a: I agree it would be great to move to UTF-8 everywhere.
   TabAtkins: If implementations get bug reports on things should ask for
              standard to be updated.
   plinss: Would prefer to include this table
   plinss: I know there are implementations that implement that entire table,
           because I made one
   TabAtkins: Gecko doesn't anymore, don't think WebKit does either
   <glazou> WeasyPrint implements that table
   dbaron: Gecko only implemented ascii-compatible cases and UTF-16 cases.
   dbaron: never did UTF-32 or EBCDIC
   dbaron: Might've done UTF-32 long ago, but that code ripped out a long
           time ago
   dbaron: ripped out support for UTF-32 entirely
   TabAtkins: My preferred approach is to keep it as-is right now.
              If there's a problem, we'll see bug reports.
   TabAtkins: Alter as necessary.
   plinss: Not really happy with that approach. Break it and see who complains?
   TabAtkins: Don't want to include rows that browsers don't implement.
   plinss: But some that are significant

   TabAtkins: Leave this in, with an issue maybe?
   TabAtkins: Issue that we may need to add additional charset encoding patterns?
   plinss: Maybe add onto that that we're explicitly requesting feedback on this
   <Bert> (The issue can ask which @charset lines can be deprecated.)

   <SimonSapin> WeasyPrint implements part of that table because I didnít
                know any better. Some cursing was involved.
   plinss: Agree we probably don't need EBCDIC, and a private browser could
           implement that for an intranet.
   plinss: But more concerned wrt UTF-16 issues
   plinss: Web is huge. Small percentage is still a lot of pages
   TabAtkins: encoding will detect
   plinss: Shouldn't be sniffing it
   plinss: If there's a document that doesn't have BOM and has @charset,
           shouldn't be sniffing it. Should use @charset if it's there.
   SimonSapin: Can also rely on HTTP headers
   plinss: We have problems e.g. in our test suite, files that work on the
           server, but not locally.
   plinss: CSS should not, don't think that we require style sheet to be
           served over HTTP.

   TabAtkins: Step 3&4, where you take charset from encoding document.
              Should that only work for same-origin?
   r12a: Value of 4 is people who don't understand charsets or setting http
   TabAtkins: So, take charset from referring document only if it's same-origin.
   TabAtkins: Ok, I'll say yes for now, object later
   Bert: what's issue?
   TabAtkins: Sometimes if you can force  a document into a different encoding,
              can extract info from it.
   TabAtkins: This would prevent that kind of thing
   Bert: What could you do with a style sheet in wrong encoding?
   TabAtkins: Don't have specific example, but this kind of problem has been
              an issue in other technologies.
   Bert: But you're only changing your own document, not someone else's.
   dbaron: Suppose has wiki, where input CSS that should be sanitized
   dbaron: Could have fun encoding, maybe UTF-7 or Shift-JIS or something,
           that can create cross-site scripting vulnerabilities
   dbaron: Encode control characters for programming languages as benign
           ascii chars
   TabAtkins: Can't give a specific attack scenario, just know there's
              problems in other languages
   Bert: Concerned about people putting stylesheet on one server, seems to
         work, then moves it to other server, doesn't work.
   TabAtkins: If you have a charset problem, use UTF-8
   dbaron: Either way I'd prefer to leave an issue in there until we're
           confident it's stable.
   <dbaron> (w.r.t. the same-origin thing)
   RESOLVED: Make it same-origin, leave issue

<br type=lunch/>
<!--#include Fonts Part I -->

CSS3 Syntax Part II: Other Changes from 2.1
Scribe: Bert

NUL chars in CSS

   TabAtkins: tokenization changes
   TabAtkins:: To match HTML, NUL chars are converted into replacement chars.
   TabAtkins: Don't know why HTML does it, but at least it means CSS inside
              HTML is same as CSS in a file.
   fantasai: Replacement char is valid identifier char, could be weird.
   TabAtkins: Nobody puts NUL in a style sheet....
   * fantasai probably would have turned it into a space, less intrusive
   RESOLVED: NUL gets turned into Replacement char.

non-ascii Token

   tab: non-ascii range
   SimonSapin: We already resolved on that.
   glenn: Isn't the name non-ascii wrong then?
   TabAtkins: No it now it includes *all* non-ascii.

Tokenizing Comments

   TabAtkins: comments
   TabAtkins: tokenizer never emits them.
   plinss: What about comments between idents?
   TabAtkins: Yes, serializer forces an empty comment there.
   plinss: OK, but does it change parsing behavior anywhere?
   [tab draws on whiteboard]
   tab: It should just be an internal simplification
   Bert: Not sure what you mean.
   Bert: there is no tokenization process
   [argument between Bert and Tab/peterl/etc.]
   bert: [worried about how people interpret "not emit"]
   TabAtkins: It describes how you serialize.
   glazou: We said in the past that one way to preserve comments was to
           preserve them combined at the end.
   glazou: Important for editors.
   glazou: comments should be in original position, but cannot always be done.
   RESOLVED: loosen the rules describing the way comments are serialized

Unicode-Range Tokenization

   <dbaron> http://dbaron.org/css/test/2013/urange-token
   TabAtkins: CSS 2 defines unicode range tokens in lazy way
   fantasai: The difference is detectable.
   dbaron: By means of counter increment.
   fantasai: No real reason to not keep the same.
   fantasai: Do not change the token.
   fantasai: Under your rules some old declarations are not longer valid,
             and it is detectable.
   tab: No, not detectable.
   tab: U+2?3?4?

   fantasai: We have to do range checking to determine validity anyway,
             so it's not like tokenization guarantees validity.
   TabAtkins: No, [describes three cases]
   dbaron: css3-fonts doesn't say it gets thrown out.
   jdaggett: what gets thrown out?
   <dbaron> U+400-3ff
   dbaron: fonts spec not clear
   <dbaron> unicode-range: U+100-1ff, U+400-3ff
   dbaron: Is [above] a syntax error?
   dbaron: Is that empty range or invalid and throw away whole line?
   [tab and fantasai disagree on previous discussions]
   TabAtkins: Invalidate whole descriptor.
   dbaron: Is it just an empty range or invalid?
   jdaggett: Wording to say that range has to contain valid chars.
   dbaron: No implementation conformance requirement.
   <fantasai> http://lists.w3.org/Archives/Public/www-style/2013May/0564.html
   jdaggett: you say wording is unclear.
   dbaron: I can send comment about that, later.

   TabAtkins: The only relevance of my change is for mixture of question marks
              and digits.
   Tabatkins: CSS 2.1 created invalid range, which was then thrown away.
   TabAtkins: Final behavior is the same..
   fantasai: Why change. Let's not do tokenization changes.
   TabAtkins: Easier to parse.
   bert: depends on parsing system.
   tab: In mine token is always correct.
   TabAtkins: Unicode range token only valid in one property.
   plinss: Then I'm not so bothered.
   Bert: But why change it.
   fantasai: Current is fine.
   dbaron: I don't believe the change is undetectable.
   <dbaron> I also don't mind the change.
   <fantasai> Gecko implements the 2.1 definition

   Tab: Maybe in future.
   plinss: That's is why it gets important.
   TabAtkins: But I have to write more text to accept the old syntax.
   * fantasai doesn't like changing things like this just because someone
              felt like it, should only change it if there's a real
              benefit to it imho

   glenn: Font family quoted string that is empty: seems not prohibited.
   dbaron: You might have a font with that name.
   glenn: Far fetched, but it is an answer...
   TabAtkins: Not a syntax question, question for fonts.

   plinss: Whether you can test the unicode range syntax difference now
           is not the question.
   plinss: Your change maybe gives us more flexibility to not add white
           space in future properties.
   fantasai: Not really, some unicode ranges end in digits, so not helpful.
   dbaron: No, not helpful, there are too many different kinds of unicode
           range tokens. Often require a space anyway.
   plinss: Then the change is irrelevant.

   <dbaron> I agree the change is safe, and I agree it's pointless.
   bert: But why change it if is not broken?
   [3 hands for tab's proposal 2 against]


   TabAtkins: bad_url token and bad_string token.
   plinss: What we talked about yesterday?
   TabAtkins: Yes.
   liam: Current UAs.
   TabAtkins: Our parsing is wrong I believe.
   glenn: doesn't seem right.
   glenn: Wouldn't call it valid.
   plinss: It is not gray. We know when to throw a property away.

Attribute-Matching Tokens

   TabAtkins: New attribute matching operators are imported into tokenizer.
   Bert: The new attribute operators are not tokens, they are two tokens.
   Tab: But selectors defined them as tokens.
   Bert: No, selectors defined how to parse selectors, not how to parse css.
   plinss: Don't agree with bert's argument, but accept the point.
   plinss: They will always be a token  if we take tab's proposal, and we
           remove the possibility of using them as as part of different syntax
   TabAtkins: it doesn't reduce possibilities, but it might complicate grammars
              in the future.
   plinss: There is an impact.
   plinss: There is a general point about selector syntax leaking into rest
           of syntax.
   TabAtkins: "||" is used in selectors 4
   TabAtkins: Made it into a token.
   TabAtkins: Because it clashes with namespace selectors.
   plinss: Are we giving up too much to avoid 2-token look-ahead?
   TabAtkins: I don't know.
   TabAtkins: I heard 1-token was nice.
   plinss: How important is it. You build it and forget about it.
   dbaron: It is slower.
   dbaron: There's a small performance cost to more lookahead, and I don't
           think it's worth it.

COMMA token

   TabAtkins: I added a comma token.
   TabAtkins: There was a colon and a semicolon already.
   [Discussion about where COMMA is used. It is used in other modules,
    not in syntax itself]

Number tokens

   TabAtkins: numbers now include sign, scientific notation is allowed
   <Bert> (I don't see why we need tokenizer for number)

Invalid url() Parsing

   <TabAtkins> url(foo bar)
   TabAtkins: bad uri
   <TabAtkins> bad-url("url(foo ")
   <TabAtkins> url(foo bar[baz)
   TabAtkins: this case [above] will be different.
   TabAtkins this is unlikely to have any bad effects.
   TabAtkins Have no bug reports.
   fantasai: You're parsing invalid stuff anyway.
   fantasai: So if someone found a problem with your behavior, they'd be
             fixing their style sheet, not filing a bug.
   dbaron: It is about parens *after* the piece that already makes it invalid.
   plinss: I want the token to close according to error recovery rules.
           Respect paren matching.
   dbaron: this is recent -- something we were fiddling with to quite late
           in CSS 2.1, but it simplifes.
   dbaron: Matches some implementations.
   Bert: I don't understand yet...
   [dbaron explains]
   <dbaron> url(foo bar[)
   <dbaron> url(foo "bar[)
   plinss: ok, I'm happy with it too
   [discussion about where the error recovery picks up]
   <fantasai> Ok, this makes sense to me seeing this example
   Bert: I don't care about  error recovery, so I'm fine, but be aware that
         you change the meaning of existing style sheets.
   TabAtkins: Yes, we had no bug reports, so don't think it is a problem.

   TabAtkins: CSS 2 grammar doesn't cover all inputs.
   TabAtkins: Error recovery not defined.
   TabAtkins: So I'm defining error recovery.
   dbaron: I'd like to review this more.


   dbaron: Did you fix the cases where CSS 2.1 didn't allow CDO and CDC
           in some places?
   Bert: As long as it is only about error recovery for CDO/CDC in more
         places, fine with me.

An+B Notation
Scribe: fantasai

   TabAtkins: The notation is incompatible with CSS tokenization
   TabAtkins: For example, 5n-3 tokenizes as a dimension
   TabAtkins: But we need to split it up into 5, n, -, 3
   TabAtkins: Implementations right now have to guess where it ends ...
              so far easy because of close-parens
   TabAtkins: Then reserialize and reparse
   TabAtkins: Wanted to redefine an+b in terms of CSS tokens
   TabAtkins: I can get it almost identical to defined behavior
   TabAtkins: But there are two small changes

   dbaron: Think the WS change is not small
   TabAtkins: First change is that +n or +n+2 ok
   TabAtkins: but + n not ok
   TabAtkins: Similarly -n vs. - n
   TabAtkins: Issue is that when you tokenize these, +n becomes 2 tokens
   TabAtkins: so does + n
   TabAtkins: -n and - n tokenize differently
   TabAtkins: Property grammars ignore white space
   TabAtkins: So can't distinguish
   fantasai: I don't think ths needs to change.
   dbaron: Tab just made up this idea of property grammars
   TabAtkins: propdef grammars don't talk about ws
   dbaron: I don't want to change the space rules here
   TabAtkins asks how parsers work
   dbaron says Gecko has a flag about whether ws is paid attention to
   In WebKit, apparently WS tokens are omitted automatically when you ask
     for a token
   fantasai: How can that possible if you correctly parse selectors?
   <dbaron> div*p
   Bert: Also numbers, numbers with a sign can't have space between number
         and sign
   [side discussion of NUMBER tokens]
   dino: White space is gone by the time you try to parse things
   TabAtkins: How do you parse calc() and selectors then?
   krit: [ ... something about whitespace and ident in WebKit ... ]
   RESOLVED: No spacing changes in An+B syntax

   TabAtkins: +n-\33
   TabAtkins: parses as "+" "n-3"
   TabAtkins: which looks like An+B
   TabAtkins: So, change to allow escaping in cases where the tokenization
              derives from idents.
   TabAtkins: Which probably matches implementations
   Bert: So escaping before reparsing?
   TabAtkins: no reparsing
   TabAtkins: escaping was handled much earlier in parsing process
   TabAtkins: Disallowing it would create a major layering violation
   TabAtkins: By the time you care about whether something is an+b,
              you've lost the original representation
   plinss: So this scenario produces a valid ident, with content "n-3"
           as the string
   plinss: How does that become an an+b?
   dbaron: Spec says if you have ident "-n-" followed by digits, then
           it's valid an+b
   dbaron: I think this is far saner than the route we went down with URL,
           where we made it a separate token
   TabAtkins: All of these are equivalent, except they tokenize
              completely differently:
               n - 3
               n- 3
               n -3
   [Discussion of parsing mechanics among Bert, plinss, and Tab]
   RESOLVED: escaping in An+B falls out of tokenization


   TabAtkins: That's all the syntax changes that we've reviewed now
   TabAtkins: So, can I publish yet?
   dbaron: I think the efforts here are worthwhile, but I would like to
           review the bracket-matching part before FPWD. I haven't had time yet
   dbaron: I'm worried because I think once we have FPWD, people will
           only look at this, not at CSS2.1
   dbaron: So if we don't fix something now, it will be hard to fix later
   dbaron: And I'm worried because it was originally reverse-engineered
           from the worst implemenation (WebKit)
   TabAtkins: I won't deny that.

   Bert: I can't understand this draft, apart from the railroad diagrams.
         Can you put more of those, or put grammar rules?
   TabAtkins: I think I have diagrams for everything in the spec
   <Bert> (Seems all the production rules have diagrams.)

   fantasai: does your spec define what a style rule is, what a declaration
             is, what these things mean?
   TabAtkins: yes
   SimonSapin: No, it doesn't
   fantasai: I think we need to add such a section, since this part of 2.1
             logically belongs here.
   TabAtkins: Maybe I'll expand the Description of CSS's Syntax section
   SimonSapin: And make it normative
   fantasai: We need something that defines how to interpret a CSS style sheet.

   plinss: You asked for FPWD, but there's an existing css3-syntax draft
   fantasai: From process POV, no. But effectively it's a new draft.
   plinss: Just making sure that we plan to replace what's out there and
           is 10 years old
   TabAtkins: yes


   TabAtkins: I propose adding Simon as Syntax co-editor.
   RESOLVED: Simon Sapin is Syntax co-editor.

Received on Wednesday, 3 July 2013 00:23:14 UTC