- From: fantasai <fantasai.lists@inkedblade.net>
- Date: Tue, 02 Jul 2013 17:22:45 -0700
- To: "www-style@w3.org" <www-style@w3.org>
CSS3 Syntax ----------- - Discussed @charset handling and removing unused encodings. Conclusion to leave in issue and ask for feedback whether to add additional encoding patterns. - RESOLVED: Charset propagation from linking document is same-origin. Leave issue open until we're more sure of this being stable. - RESOLVED: NUL gets turned into Replacement char. - Confirmed that 'non-ascii' is updated to include all non-ASCII chars - RESOLVED: loosen the rules describing the way comments are serialized - "Safe but pointless" change to UNICODE-RANGE token was discussed. No resolution recorded. - Reviewed addition of attribute-matching tokens for reducing lookahead. - Reviewed addition of COMMA token. - Reviewed change to NUMBER token (to include sign). - Reviewed change to bracket-matching of url() notations. - dbaron asked whether CDO/CDC were correctly fixed. No response was recorded in the minutes. - RESOLVED: loosen the rules describing the way comments are serialized - RESOLVED: escaping in An+B falls out of tokenization - Syntax is missing section that actually defines interpreting CSS syntax into style rules, at rules, etc. - RESOLVED: Add Simon Sapin as co-editor. ====== Full minutes below ====== CSS3 Syntax Part I: Overview and @charset handling ================================================== Scribe: fantasai TabAtkins: Rewriting from grammar approach to parser approach <jerenkrantz> http://dev.w3.org/csswg/css-syntax/ TabAtkins: grammar tried to match everything, not just well-formed things, and was too complicated and didn't quite handle everything anyway TabAtkins: Syntax defines Tokenizer, then Parser TabAtkins: Includes hooks for other specs to include a certain type of thing TabAtkins: I would like to review some things with group, then ask for FPWD TabAtkins: Probably as correct as 2.1 at this point TabAtkins: WebKit uses augmented grammar to parse, and it's horribly broken. TabAtkins: Wanted to go over changes from 2.1 <dbaron> we're reviewing http://dev.w3.org/csswg/css-syntax/#changes as of 4ce7b66b553a https://dvcs.w3.org/hg/csswg/raw-file/4ce7b66b553a/css-syntax/Overview.html TabAtkins: First batch of changes are about parsing TabAtkins: 2.1 defined some interesting rules for detecting @charset TabAtkins: Brought into line with charset handling in rest of platform r12a: Does that mean it doesn't handle UTF16 TabAtkins: Can use UTF16 by having BOM, but won't recognize @charset fantasai: How interoperable is this, compared to 2.1? TabAtkins: Dropping some obscure encodings probably ok, since implementations don't support them TabAtkins: Wrt web-compat, should be just fine TabAtkins: dropping things like EBCDIC liam: Did you check for corporate intranet stuff? TabAtkins: I think likelihood of EBCDIC is close to zero r12a: I thought same thing wrt bidi things, but ppl came back with supercomputers and stuff in Hebrew that were using old encoding systems <jerenkrantz> FWIW, IBM keeps ensuring that EBCDIC support is working in httpd. Someone at IBM would be able to give an idea of its reach. Bert: It was there. You're trying to remove it? TabAtkins: Browsers didn't recognize it? Bert: You can't remove things. Bert: This is supposed to be stable Bert: We promised not to make versions to CSS, just to add things Bert: Don't think we should remove features just because not used on the Web. Bert: Remove features when they're wrong. [discussion of changes] Bert: Not an argument to keep on making mistakes. r12a: When HTML5 did some charset stuff, wanted to do it to stop people using non-UTF-8. r12a: Is that your motivation? TabAtkins: Highly sympathetic to moving to UTF-8, but main issue is making this much simpler glazou: If all style sheets were UTF-8, would be much easier for authoring environments. dbaron: Checking for UTF-16 rules were widely tested. dbaron: We changed recently. Ran into no web-compat problems, but ran into certification problems. dbaron: Sorry, what I said is opposite. dbaron: When we implemented what this draft says, and this caused us to start passing th GCF certification suite dbaron: It has a test of a UTF-16 file that claims an encoding name that doesn't exist. <dbaron> the thing I was just describing was https://bugzilla.mozilla.org/show_bug.cgi?id=859706 being fixed by https://bugzilla.mozilla.org/show_bug.cgi?id=796882 <dbaron> er, actually, the encoding name does exist but it's an alias for UTF-16BE, while the file is UTF-16LE SimonSapin: Would like to point out that this is only about @charset. Can do anything you want with HTTP headers. SimonSapin: Also, relevant part of CSS2.1 gives a table of byte patterns. Allows UAs to remove some, or to add some. r12a: What did you say was the first step in detecting charset? r12a: Thought we changed so that BOM is first TabAtkins: BOM is checked first by decoding algorithm, before @charset TabAtkins: this list finds the fallback encoding ... r12a: Do you forbid use of @charset in UTF-16 documents? TabAtkins: No, you just ignore it TabAtkins: Never invalid to put @charset plinss: What if UTF-16 without BOM but with @charset? TabAtkins: Defer to encoding standard... r12a: you can do it, but say you shouldn't TabAtkins: Specifying anything other than ascii-compatible encoding is not useful. r12a: Since already defined, why not keep them? r12a: I share Bert's unease here r12a: I agree it would be great to move to UTF-8 everywhere. TabAtkins: If implementations get bug reports on things should ask for standard to be updated. plinss: Would prefer to include this table plinss: I know there are implementations that implement that entire table, because I made one TabAtkins: Gecko doesn't anymore, don't think WebKit does either <glazou> WeasyPrint implements that table dbaron: Gecko only implemented ascii-compatible cases and UTF-16 cases. dbaron: never did UTF-32 or EBCDIC dbaron: Might've done UTF-32 long ago, but that code ripped out a long time ago dbaron: ripped out support for UTF-32 entirely TabAtkins: My preferred approach is to keep it as-is right now. If there's a problem, we'll see bug reports. TabAtkins: Alter as necessary. plinss: Not really happy with that approach. Break it and see who complains? TabAtkins: Don't want to include rows that browsers don't implement. plinss: But some that are significant TabAtkins: Leave this in, with an issue maybe? TabAtkins: Issue that we may need to add additional charset encoding patterns? plinss: Maybe add onto that that we're explicitly requesting feedback on this <Bert> (The issue can ask which @charset lines can be deprecated.) <SimonSapin> WeasyPrint implements part of that table because I didn’t know any better. Some cursing was involved. plinss: Agree we probably don't need EBCDIC, and a private browser could implement that for an intranet. plinss: But more concerned wrt UTF-16 issues plinss: Web is huge. Small percentage is still a lot of pages TabAtkins: encoding will detect plinss: Shouldn't be sniffing it plinss: If there's a document that doesn't have BOM and has @charset, shouldn't be sniffing it. Should use @charset if it's there. SimonSapin: Can also rely on HTTP headers plinss: We have problems e.g. in our test suite, files that work on the server, but not locally. plinss: CSS should not, don't think that we require style sheet to be served over HTTP. TabAtkins: Step 3&4, where you take charset from encoding document. Should that only work for same-origin? r12a: Value of 4 is people who don't understand charsets or setting http headers ... TabAtkins: So, take charset from referring document only if it's same-origin. Yay/nay? TabAtkins: Ok, I'll say yes for now, object later Bert: what's issue? TabAtkins: Sometimes if you can force a document into a different encoding, can extract info from it. TabAtkins: This would prevent that kind of thing Bert: What could you do with a style sheet in wrong encoding? TabAtkins: Don't have specific example, but this kind of problem has been an issue in other technologies. Bert: But you're only changing your own document, not someone else's. dbaron: Suppose has wiki, where input CSS that should be sanitized dbaron: Could have fun encoding, maybe UTF-7 or Shift-JIS or something, that can create cross-site scripting vulnerabilities dbaron: Encode control characters for programming languages as benign ascii chars TabAtkins: Can't give a specific attack scenario, just know there's problems in other languages Bert: Concerned about people putting stylesheet on one server, seems to work, then moves it to other server, doesn't work. TabAtkins: If you have a charset problem, use UTF-8 dbaron: Either way I'd prefer to leave an issue in there until we're confident it's stable. <dbaron> (w.r.t. the same-origin thing) RESOLVED: Make it same-origin, leave issue <br type=lunch/> <!--#include Fonts Part I --> CSS3 Syntax Part II: Other Changes from 2.1 =========================================== Scribe: Bert NUL chars in CSS ---------------- TabAtkins: tokenization changes TabAtkins:: To match HTML, NUL chars are converted into replacement chars. TabAtkins: Don't know why HTML does it, but at least it means CSS inside HTML is same as CSS in a file. fantasai: Replacement char is valid identifier char, could be weird. TabAtkins: Nobody puts NUL in a style sheet.... * fantasai probably would have turned it into a space, less intrusive RESOLVED: NUL gets turned into Replacement char. non-ascii Token --------------- tab: non-ascii range SimonSapin: We already resolved on that. glenn: Isn't the name non-ascii wrong then? TabAtkins: No it now it includes *all* non-ascii. Tokenizing Comments ------------------- TabAtkins: comments TabAtkins: tokenizer never emits them. plinss: What about comments between idents? TabAtkins: Yes, serializer forces an empty comment there. plinss: OK, but does it change parsing behavior anywhere? [tab draws on whiteboard] tab: It should just be an internal simplification Bert: Not sure what you mean. Bert: there is no tokenization process [argument between Bert and Tab/peterl/etc.] bert: [worried about how people interpret "not emit"] TabAtkins: It describes how you serialize. glazou: We said in the past that one way to preserve comments was to preserve them combined at the end. glazou: Important for editors. glazou: comments should be in original position, but cannot always be done. RESOLVED: loosen the rules describing the way comments are serialized Unicode-Range Tokenization -------------------------- <dbaron> http://dbaron.org/css/test/2013/urange-token TabAtkins: CSS 2 defines unicode range tokens in lazy way fantasai: The difference is detectable. dbaron: By means of counter increment. fantasai: No real reason to not keep the same. fantasai: Do not change the token. fantasai: Under your rules some old declarations are not longer valid, and it is detectable. tab: No, not detectable. tab: U+2?3?4? fantasai: We have to do range checking to determine validity anyway, so it's not like tokenization guarantees validity. TabAtkins: No, [describes three cases] dbaron: css3-fonts doesn't say it gets thrown out. jdaggett: what gets thrown out? <dbaron> U+400-3ff dbaron: fonts spec not clear <dbaron> unicode-range: U+100-1ff, U+400-3ff dbaron: Is [above] a syntax error? dbaron: Is that empty range or invalid and throw away whole line? [tab and fantasai disagree on previous discussions] TabAtkins: Invalidate whole descriptor. dbaron: Is it just an empty range or invalid? jdaggett: Wording to say that range has to contain valid chars. dbaron: No implementation conformance requirement. <fantasai> http://lists.w3.org/Archives/Public/www-style/2013May/0564.html jdaggett: you say wording is unclear. dbaron: I can send comment about that, later. TabAtkins: The only relevance of my change is for mixture of question marks and digits. Tabatkins: CSS 2.1 created invalid range, which was then thrown away. TabAtkins: Final behavior is the same.. fantasai: Why change. Let's not do tokenization changes. TabAtkins: Easier to parse. bert: depends on parsing system. tab: In mine token is always correct. TabAtkins: Unicode range token only valid in one property. plinss: Then I'm not so bothered. Bert: But why change it. fantasai: Current is fine. dbaron: I don't believe the change is undetectable. <dbaron> I also don't mind the change. <fantasai> Gecko implements the 2.1 definition Tab: Maybe in future. plinss: That's is why it gets important. TabAtkins: But I have to write more text to accept the old syntax. * fantasai doesn't like changing things like this just because someone felt like it, should only change it if there's a real benefit to it imho glenn: Font family quoted string that is empty: seems not prohibited. dbaron: You might have a font with that name. glenn: Far fetched, but it is an answer... TabAtkins: Not a syntax question, question for fonts. plinss: Whether you can test the unicode range syntax difference now is not the question. plinss: Your change maybe gives us more flexibility to not add white space in future properties. fantasai: Not really, some unicode ranges end in digits, so not helpful. dbaron: No, not helpful, there are too many different kinds of unicode range tokens. Often require a space anyway. plinss: Then the change is irrelevant. <dbaron> I agree the change is safe, and I agree it's pointless. bert: But why change it if is not broken? [3 hands for tab's proposal 2 against] bad_url/bad_string ------------------ TabAtkins: bad_url token and bad_string token. plinss: What we talked about yesterday? TabAtkins: Yes. liam: Current UAs. TabAtkins: Our parsing is wrong I believe. glenn: doesn't seem right. glenn: Wouldn't call it valid. plinss: It is not gray. We know when to throw a property away. Attribute-Matching Tokens ------------------------- TabAtkins: New attribute matching operators are imported into tokenizer. Bert: The new attribute operators are not tokens, they are two tokens. Tab: But selectors defined them as tokens. Bert: No, selectors defined how to parse selectors, not how to parse css. plinss: Don't agree with bert's argument, but accept the point. plinss: They will always be a token if we take tab's proposal, and we remove the possibility of using them as as part of different syntax TabAtkins: it doesn't reduce possibilities, but it might complicate grammars in the future. plinss: There is an impact. plinss: There is a general point about selector syntax leaking into rest of syntax. TabAtkins: "||" is used in selectors 4 TabAtkins: Made it into a token. TabAtkins: Because it clashes with namespace selectors. plinss: Are we giving up too much to avoid 2-token look-ahead? TabAtkins: I don't know. TabAtkins: I heard 1-token was nice. plinss: How important is it. You build it and forget about it. dbaron: It is slower. dbaron: There's a small performance cost to more lookahead, and I don't think it's worth it. COMMA token ----------- TabAtkins: I added a comma token. TabAtkins: There was a colon and a semicolon already. [Discussion about where COMMA is used. It is used in other modules, not in syntax itself] Number tokens ------------- TabAtkins: numbers now include sign, scientific notation is allowed <Bert> (I don't see why we need tokenizer for number) Invalid url() Parsing --------------------- <TabAtkins> url(foo bar) TabAtkins: bad uri <TabAtkins> bad-url("url(foo ") <TabAtkins> url(foo bar[baz) TabAtkins: this case [above] will be different. TabAtkins this is unlikely to have any bad effects. TabAtkins Have no bug reports. fantasai: You're parsing invalid stuff anyway. fantasai: So if someone found a problem with your behavior, they'd be fixing their style sheet, not filing a bug. dbaron: It is about parens *after* the piece that already makes it invalid. plinss: I want the token to close according to error recovery rules. Respect paren matching. dbaron: this is recent -- something we were fiddling with to quite late in CSS 2.1, but it simplifes. dbaron: Matches some implementations. Bert: I don't understand yet... [dbaron explains] <dbaron> url(foo bar[) <dbaron> url(foo "bar[) plinss: ok, I'm happy with it too [discussion about where the error recovery picks up] <fantasai> Ok, this makes sense to me seeing this example Bert: I don't care about error recovery, so I'm fine, but be aware that you change the meaning of existing style sheets. TabAtkins: Yes, we had no bug reports, so don't think it is a problem. TabAtkins: CSS 2 grammar doesn't cover all inputs. TabAtkins: Error recovery not defined. TabAtkins: So I'm defining error recovery. dbaron: I'd like to review this more. CDO/CDC ------- dbaron: Did you fix the cases where CSS 2.1 didn't allow CDO and CDC in some places? Bert: As long as it is only about error recovery for CDO/CDC in more places, fine with me. An+B Notation ------------- Scribe: fantasai TabAtkins: The notation is incompatible with CSS tokenization TabAtkins: For example, 5n-3 tokenizes as a dimension TabAtkins: But we need to split it up into 5, n, -, 3 TabAtkins: Implementations right now have to guess where it ends ... so far easy because of close-parens TabAtkins: Then reserialize and reparse TabAtkins: Wanted to redefine an+b in terms of CSS tokens TabAtkins: I can get it almost identical to defined behavior TabAtkins: But there are two small changes dbaron: Think the WS change is not small TabAtkins: First change is that +n or +n+2 ok TabAtkins: but + n not ok TabAtkins: Similarly -n vs. - n TabAtkins: Issue is that when you tokenize these, +n becomes 2 tokens TabAtkins: so does + n TabAtkins: -n and - n tokenize differently TabAtkins: Property grammars ignore white space TabAtkins: So can't distinguish fantasai: I don't think ths needs to change. dbaron: Tab just made up this idea of property grammars TabAtkins: propdef grammars don't talk about ws dbaron: I don't want to change the space rules here TabAtkins asks how parsers work dbaron says Gecko has a flag about whether ws is paid attention to In WebKit, apparently WS tokens are omitted automatically when you ask for a token fantasai: How can that possible if you correctly parse selectors? <dbaron> div*p Bert: Also numbers, numbers with a sign can't have space between number and sign [side discussion of NUMBER tokens] dino: White space is gone by the time you try to parse things TabAtkins: How do you parse calc() and selectors then? krit: [ ... something about whitespace and ident in WebKit ... ] RESOLVED: No spacing changes in An+B syntax TabAtkins: +n-\33 TabAtkins: parses as "+" "n-3" TabAtkins: which looks like An+B TabAtkins: So, change to allow escaping in cases where the tokenization derives from idents. TabAtkins: Which probably matches implementations Bert: So escaping before reparsing? TabAtkins: no reparsing TabAtkins: escaping was handled much earlier in parsing process TabAtkins: Disallowing it would create a major layering violation TabAtkins: By the time you care about whether something is an+b, you've lost the original representation plinss: So this scenario produces a valid ident, with content "n-3" as the string plinss: How does that become an an+b? dbaron: Spec says if you have ident "-n-" followed by digits, then it's valid an+b dbaron: I think this is far saner than the route we went down with URL, where we made it a separate token TabAtkins: All of these are equivalent, except they tokenize completely differently: n - 3 n- 3 n -3 n-3 [Discussion of parsing mechanics among Bert, plinss, and Tab] RESOLVED: escaping in An+B falls out of tokenization Overall ------- TabAtkins: That's all the syntax changes that we've reviewed now TabAtkins: So, can I publish yet? dbaron: I think the efforts here are worthwhile, but I would like to review the bracket-matching part before FPWD. I haven't had time yet dbaron: I'm worried because I think once we have FPWD, people will only look at this, not at CSS2.1 dbaron: So if we don't fix something now, it will be hard to fix later dbaron: And I'm worried because it was originally reverse-engineered from the worst implemenation (WebKit) TabAtkins: I won't deny that. Bert: I can't understand this draft, apart from the railroad diagrams. Can you put more of those, or put grammar rules? TabAtkins: I think I have diagrams for everything in the spec <Bert> (Seems all the production rules have diagrams.) fantasai: does your spec define what a style rule is, what a declaration is, what these things mean? TabAtkins: yes SimonSapin: No, it doesn't fantasai: I think we need to add such a section, since this part of 2.1 logically belongs here. TabAtkins: Maybe I'll expand the Description of CSS's Syntax section SimonSapin: And make it normative fantasai: We need something that defines how to interpret a CSS style sheet. plinss: You asked for FPWD, but there's an existing css3-syntax draft fantasai: From process POV, no. But effectively it's a new draft. plinss: Just making sure that we plan to replace what's out there and is 10 years old TabAtkins: yes Editorship ---------- TabAtkins: I propose adding Simon as Syntax co-editor. RESOLVED: Simon Sapin is Syntax co-editor.
Received on Wednesday, 3 July 2013 00:23:14 UTC