[CSSWG] Minutes Tokyo F2F 2013-06-06 Thu AM II/PM I: Syntax from fantasai on 2013-07-03 (www-style@w3.org from July 2013)

From: fantasai <fantasai.lists@inkedblade.net>
Date: Tue, 02 Jul 2013 17:22:45 -0700
To: "www-style@w3.org" <www-style@w3.org>
Message-ID: <51D36ED5.3060206@inkedblade.net>

CSS3 Syntax
-----------

- Discussed @charset handling and removing unused encodings.
Conclusion to leave in issue and ask for feedback whether to
add additional encoding patterns.

- RESOLVED: Charset propagation from linking document is same-origin.
Leave issue open until we're more sure of this being stable.

- RESOLVED: NUL gets turned into Replacement char.

- Confirmed that 'non-ascii' is updated to include all non-ASCII chars

- RESOLVED: loosen the rules describing the way comments are serialized

- "Safe but pointless" change to UNICODE-RANGE token was discussed.
No resolution recorded.

- Reviewed addition of attribute-matching tokens for reducing lookahead.

- Reviewed addition of COMMA token.

- Reviewed change to NUMBER token (to include sign).

- Reviewed change to bracket-matching of url() notations.

- dbaron asked whether CDO/CDC were correctly fixed. No response
was recorded in the minutes.

- RESOLVED: loosen the rules describing the way comments are serialized

- RESOLVED: escaping in An+B falls out of tokenization

- Syntax is missing section that actually defines interpreting CSS
syntax into style rules, at rules, etc.

- RESOLVED: Add Simon Sapin as co-editor.

====== Full minutes below ======

CSS3 Syntax Part I: Overview and @charset handling
==================================================
Scribe: fantasai

TabAtkins: Rewriting from grammar approach to parser approach
<jerenkrantz> http://dev.w3.org/csswg/css-syntax/
TabAtkins: grammar tried to match everything, not just well-formed things,
and was too complicated and didn't quite handle everything anyway
TabAtkins: Syntax defines Tokenizer, then Parser
TabAtkins: Includes hooks for other specs to include a certain type of thing
TabAtkins: I would like to review some things with group, then ask for FPWD
TabAtkins: Probably as correct as 2.1 at this point
TabAtkins: WebKit uses augmented grammar to parse, and it's horribly broken.

TabAtkins: Wanted to go over changes from 2.1
<dbaron> we're reviewing http://dev.w3.org/csswg/css-syntax/#changes as of 4ce7b66b553a
https://dvcs.w3.org/hg/csswg/raw-file/4ce7b66b553a/css-syntax/Overview.html

TabAtkins: First batch of changes are about parsing
TabAtkins: 2.1 defined some interesting rules for detecting @charset
TabAtkins: Brought into line with charset handling in rest of platform
r12a: Does that mean it doesn't handle UTF16
TabAtkins: Can use UTF16 by having BOM, but won't recognize @charset
fantasai: How interoperable is this, compared to 2.1?
TabAtkins: Dropping some obscure encodings probably ok, since implementations
don't support them
TabAtkins: Wrt web-compat, should be just fine
TabAtkins: dropping things like EBCDIC
liam: Did you check for corporate intranet stuff?
TabAtkins: I think likelihood of EBCDIC is close to zero
r12a: I thought same thing wrt bidi things, but ppl came back with
supercomputers and stuff in Hebrew that were using old encoding
systems
<jerenkrantz> FWIW, IBM keeps ensuring that EBCDIC support is working in httpd.
Someone at IBM would be able to give an idea of its reach.

Bert: It was there. You're trying to remove it?
TabAtkins: Browsers didn't recognize it?
Bert: You can't remove things.
Bert: This is supposed to be stable
Bert: We promised not to make versions to CSS, just to add things
Bert: Don't think we should remove features just because not used on the Web.
Bert: Remove features when they're wrong.
[discussion of changes]
Bert: Not an argument to keep on making mistakes.
r12a: When HTML5 did some charset stuff, wanted to do it to stop people
using non-UTF-8.
r12a: Is that your motivation?
TabAtkins: Highly sympathetic to moving to UTF-8, but main issue is
making this much simpler
glazou: If all style sheets were UTF-8, would be much easier for authoring
environments.

dbaron: Checking for UTF-16 rules were widely tested.
dbaron: We changed recently. Ran into no web-compat problems, but ran into
certification problems.
dbaron: Sorry, what I said is opposite.
dbaron: When we implemented what this draft says, and this caused us to
start passing th GCF certification suite
dbaron: It has a test of a UTF-16 file that claims an encoding name that
doesn't exist.
<dbaron> the thing I was just describing was
https://bugzilla.mozilla.org/show_bug.cgi?id=859706
being fixed by
https://bugzilla.mozilla.org/show_bug.cgi?id=796882
<dbaron> er, actually, the encoding name does exist but it's an alias
for UTF-16BE, while the file is UTF-16LE

SimonSapin: Would like to point out that this is only about @charset.
Can do anything you want with HTTP headers.
SimonSapin: Also, relevant part of CSS2.1 gives a table of byte patterns.
Allows UAs to remove some, or to add some.
r12a: What did you say was the first step in detecting charset?
r12a: Thought we changed so that BOM is first
TabAtkins: BOM is checked first by decoding algorithm, before @charset
TabAtkins: this list finds the fallback encoding
...
r12a: Do you forbid use of @charset in UTF-16 documents?
TabAtkins: No, you just ignore it
TabAtkins: Never invalid to put @charset
plinss: What if UTF-16 without BOM but with @charset?
TabAtkins: Defer to encoding standard...
r12a: you can do it, but say you shouldn't
TabAtkins: Specifying anything other than ascii-compatible encoding
is not useful.
r12a: Since already defined, why not keep them?
r12a: I share Bert's unease here
r12a: I agree it would be great to move to UTF-8 everywhere.
TabAtkins: If implementations get bug reports on things should ask for
standard to be updated.
plinss: Would prefer to include this table
plinss: I know there are implementations that implement that entire table,
because I made one
TabAtkins: Gecko doesn't anymore, don't think WebKit does either
<glazou> WeasyPrint implements that table
dbaron: Gecko only implemented ascii-compatible cases and UTF-16 cases.
dbaron: never did UTF-32 or EBCDIC
dbaron: Might've done UTF-32 long ago, but that code ripped out a long
time ago
dbaron: ripped out support for UTF-32 entirely
TabAtkins: My preferred approach is to keep it as-is right now.
If there's a problem, we'll see bug reports.
TabAtkins: Alter as necessary.
plinss: Not really happy with that approach. Break it and see who complains?
TabAtkins: Don't want to include rows that browsers don't implement.
plinss: But some that are significant

TabAtkins: Leave this in, with an issue maybe?
TabAtkins: Issue that we may need to add additional charset encoding patterns?
plinss: Maybe add onto that that we're explicitly requesting feedback on this
<Bert> (The issue can ask which @charset lines can be deprecated.)

<SimonSapin> WeasyPrint implements part of that table because I didn’t
know any better. Some cursing was involved.
plinss: Agree we probably don't need EBCDIC, and a private browser could
implement that for an intranet.
plinss: But more concerned wrt UTF-16 issues
plinss: Web is huge. Small percentage is still a lot of pages
TabAtkins: encoding will detect
plinss: Shouldn't be sniffing it
plinss: If there's a document that doesn't have BOM and has @charset,
shouldn't be sniffing it. Should use @charset if it's there.
SimonSapin: Can also rely on HTTP headers
plinss: We have problems e.g. in our test suite, files that work on the
server, but not locally.
plinss: CSS should not, don't think that we require style sheet to be
served over HTTP.

TabAtkins: Step 3&4, where you take charset from encoding document.
Should that only work for same-origin?
r12a: Value of 4 is people who don't understand charsets or setting http
headers
...
TabAtkins: So, take charset from referring document only if it's same-origin.
Yay/nay?
TabAtkins: Ok, I'll say yes for now, object later
Bert: what's issue?
TabAtkins: Sometimes if you can force a document into a different encoding,
can extract info from it.
TabAtkins: This would prevent that kind of thing
Bert: What could you do with a style sheet in wrong encoding?
TabAtkins: Don't have specific example, but this kind of problem has been
an issue in other technologies.
Bert: But you're only changing your own document, not someone else's.
dbaron: Suppose has wiki, where input CSS that should be sanitized
dbaron: Could have fun encoding, maybe UTF-7 or Shift-JIS or something,
that can create cross-site scripting vulnerabilities
dbaron: Encode control characters for programming languages as benign
ascii chars
TabAtkins: Can't give a specific attack scenario, just know there's
problems in other languages
Bert: Concerned about people putting stylesheet on one server, seems to
work, then moves it to other server, doesn't work.
TabAtkins: If you have a charset problem, use UTF-8
dbaron: Either way I'd prefer to leave an issue in there until we're
confident it's stable.
<dbaron> (w.r.t. the same-origin thing)
RESOLVED: Make it same-origin, leave issue

CSS3 Syntax Part II: Other Changes from 2.1
===========================================
Scribe: Bert

NUL chars in CSS
----------------

TabAtkins: tokenization changes
TabAtkins:: To match HTML, NUL chars are converted into replacement chars.
TabAtkins: Don't know why HTML does it, but at least it means CSS inside
HTML is same as CSS in a file.
fantasai: Replacement char is valid identifier char, could be weird.
TabAtkins: Nobody puts NUL in a style sheet....
* fantasai probably would have turned it into a space, less intrusive
RESOLVED: NUL gets turned into Replacement char.

non-ascii Token
---------------

tab: non-ascii range
SimonSapin: We already resolved on that.
glenn: Isn't the name non-ascii wrong then?
TabAtkins: No it now it includes *all* non-ascii.

Tokenizing Comments
-------------------

TabAtkins: comments
TabAtkins: tokenizer never emits them.
plinss: What about comments between idents?
TabAtkins: Yes, serializer forces an empty comment there.
plinss: OK, but does it change parsing behavior anywhere?
[tab draws on whiteboard]
tab: It should just be an internal simplification
Bert: Not sure what you mean.
Bert: there is no tokenization process
[argument between Bert and Tab/peterl/etc.]
bert: [worried about how people interpret "not emit"]
TabAtkins: It describes how you serialize.
glazou: We said in the past that one way to preserve comments was to
preserve them combined at the end.
glazou: Important for editors.
glazou: comments should be in original position, but cannot always be done.
RESOLVED: loosen the rules describing the way comments are serialized

Unicode-Range Tokenization
--------------------------

<dbaron> http://dbaron.org/css/test/2013/urange-token
TabAtkins: CSS 2 defines unicode range tokens in lazy way
fantasai: The difference is detectable.
dbaron: By means of counter increment.
fantasai: No real reason to not keep the same.
fantasai: Do not change the token.
fantasai: Under your rules some old declarations are not longer valid,
and it is detectable.
tab: No, not detectable.
tab: U+2?3?4?

fantasai: We have to do range checking to determine validity anyway,
so it's not like tokenization guarantees validity.
TabAtkins: No, [describes three cases]
dbaron: css3-fonts doesn't say it gets thrown out.
jdaggett: what gets thrown out?
<dbaron> U+400-3ff
dbaron: fonts spec not clear
<dbaron> unicode-range: U+100-1ff, U+400-3ff
dbaron: Is [above] a syntax error?
dbaron: Is that empty range or invalid and throw away whole line?
[tab and fantasai disagree on previous discussions]
TabAtkins: Invalidate whole descriptor.
dbaron: Is it just an empty range or invalid?
jdaggett: Wording to say that range has to contain valid chars.
dbaron: No implementation conformance requirement.
<fantasai> http://lists.w3.org/Archives/Public/www-style/2013May/0564.html
jdaggett: you say wording is unclear.
dbaron: I can send comment about that, later.

TabAtkins: The only relevance of my change is for mixture of question marks
and digits.
Tabatkins: CSS 2.1 created invalid range, which was then thrown away.
TabAtkins: Final behavior is the same..
fantasai: Why change. Let's not do tokenization changes.
TabAtkins: Easier to parse.
bert: depends on parsing system.
tab: In mine token is always correct.
TabAtkins: Unicode range token only valid in one property.
plinss: Then I'm not so bothered.
Bert: But why change it.
fantasai: Current is fine.
dbaron: I don't believe the change is undetectable.
<dbaron> I also don't mind the change.
<fantasai> Gecko implements the 2.1 definition

Tab: Maybe in future.
plinss: That's is why it gets important.
TabAtkins: But I have to write more text to accept the old syntax.
* fantasai doesn't like changing things like this just because someone
felt like it, should only change it if there's a real
benefit to it imho

glenn: Font family quoted string that is empty: seems not prohibited.
dbaron: You might have a font with that name.
glenn: Far fetched, but it is an answer...
TabAtkins: Not a syntax question, question for fonts.

plinss: Whether you can test the unicode range syntax difference now
is not the question.
plinss: Your change maybe gives us more flexibility to not add white
space in future properties.
fantasai: Not really, some unicode ranges end in digits, so not helpful.
dbaron: No, not helpful, there are too many different kinds of unicode
range tokens. Often require a space anyway.
plinss: Then the change is irrelevant.

<dbaron> I agree the change is safe, and I agree it's pointless.
bert: But why change it if is not broken?
[3 hands for tab's proposal 2 against]

bad_url/bad_string
------------------

TabAtkins: bad_url token and bad_string token.
plinss: What we talked about yesterday?
TabAtkins: Yes.
liam: Current UAs.
TabAtkins: Our parsing is wrong I believe.
glenn: doesn't seem right.
glenn: Wouldn't call it valid.
plinss: It is not gray. We know when to throw a property away.

Attribute-Matching Tokens
-------------------------

TabAtkins: New attribute matching operators are imported into tokenizer.
Bert: The new attribute operators are not tokens, they are two tokens.
Tab: But selectors defined them as tokens.
Bert: No, selectors defined how to parse selectors, not how to parse css.
plinss: Don't agree with bert's argument, but accept the point.
plinss: They will always be a token if we take tab's proposal, and we
remove the possibility of using them as as part of different syntax
TabAtkins: it doesn't reduce possibilities, but it might complicate grammars
in the future.
plinss: There is an impact.
plinss: There is a general point about selector syntax leaking into rest
of syntax.
TabAtkins: "||" is used in selectors 4
TabAtkins: Made it into a token.
TabAtkins: Because it clashes with namespace selectors.
plinss: Are we giving up too much to avoid 2-token look-ahead?
TabAtkins: I don't know.
TabAtkins: I heard 1-token was nice.
plinss: How important is it. You build it and forget about it.
dbaron: It is slower.
dbaron: There's a small performance cost to more lookahead, and I don't
think it's worth it.

COMMA token
-----------

TabAtkins: I added a comma token.
TabAtkins: There was a colon and a semicolon already.
[Discussion about where COMMA is used. It is used in other modules,
not in syntax itself]

Number tokens
-------------

TabAtkins: numbers now include sign, scientific notation is allowed
<Bert> (I don't see why we need tokenizer for number)

Invalid url() Parsing
---------------------

<TabAtkins> url(foo bar)
TabAtkins: bad uri
<TabAtkins> bad-url("url(foo ")
<TabAtkins> url(foo bar[baz)
TabAtkins: this case [above] will be different.
TabAtkins this is unlikely to have any bad effects.
TabAtkins Have no bug reports.
fantasai: You're parsing invalid stuff anyway.
fantasai: So if someone found a problem with your behavior, they'd be
fixing their style sheet, not filing a bug.
dbaron: It is about parens *after* the piece that already makes it invalid.
plinss: I want the token to close according to error recovery rules.
Respect paren matching.
dbaron: this is recent -- something we were fiddling with to quite late
in CSS 2.1, but it simplifes.
dbaron: Matches some implementations.
Bert: I don't understand yet...
[dbaron explains]
<dbaron> url(foo bar[)
<dbaron> url(foo "bar[)
plinss: ok, I'm happy with it too
[discussion about where the error recovery picks up]
<fantasai> Ok, this makes sense to me seeing this example
Bert: I don't care about error recovery, so I'm fine, but be aware that
you change the meaning of existing style sheets.
TabAtkins: Yes, we had no bug reports, so don't think it is a problem.

TabAtkins: CSS 2 grammar doesn't cover all inputs.
TabAtkins: Error recovery not defined.
TabAtkins: So I'm defining error recovery.
dbaron: I'd like to review this more.

CDO/CDC
-------

dbaron: Did you fix the cases where CSS 2.1 didn't allow CDO and CDC
in some places?
Bert: As long as it is only about error recovery for CDO/CDC in more
places, fine with me.

An+B Notation
-------------
Scribe: fantasai

TabAtkins: The notation is incompatible with CSS tokenization
TabAtkins: For example, 5n-3 tokenizes as a dimension
TabAtkins: But we need to split it up into 5, n, -, 3
TabAtkins: Implementations right now have to guess where it ends ...
so far easy because of close-parens
TabAtkins: Then reserialize and reparse
TabAtkins: Wanted to redefine an+b in terms of CSS tokens
TabAtkins: I can get it almost identical to defined behavior
TabAtkins: But there are two small changes

dbaron: Think the WS change is not small
TabAtkins: First change is that +n or +n+2 ok
TabAtkins: but + n not ok
TabAtkins: Similarly -n vs. - n
TabAtkins: Issue is that when you tokenize these, +n becomes 2 tokens
TabAtkins: so does + n
TabAtkins: -n and - n tokenize differently
TabAtkins: Property grammars ignore white space
TabAtkins: So can't distinguish
fantasai: I don't think ths needs to change.
dbaron: Tab just made up this idea of property grammars
TabAtkins: propdef grammars don't talk about ws
dbaron: I don't want to change the space rules here
TabAtkins asks how parsers work
dbaron says Gecko has a flag about whether ws is paid attention to
In WebKit, apparently WS tokens are omitted automatically when you ask
for a token
fantasai: How can that possible if you correctly parse selectors?
<dbaron> div*p
Bert: Also numbers, numbers with a sign can't have space between number
and sign
[side discussion of NUMBER tokens]
dino: White space is gone by the time you try to parse things
TabAtkins: How do you parse calc() and selectors then?
krit: [ ... something about whitespace and ident in WebKit ... ]
RESOLVED: No spacing changes in An+B syntax

TabAtkins: +n-\33
TabAtkins: parses as "+" "n-3"
TabAtkins: which looks like An+B
TabAtkins: So, change to allow escaping in cases where the tokenization
derives from idents.
TabAtkins: Which probably matches implementations
Bert: So escaping before reparsing?
TabAtkins: no reparsing
TabAtkins: escaping was handled much earlier in parsing process
TabAtkins: Disallowing it would create a major layering violation
TabAtkins: By the time you care about whether something is an+b,
you've lost the original representation
plinss: So this scenario produces a valid ident, with content "n-3"
as the string
plinss: How does that become an an+b?
dbaron: Spec says if you have ident "-n-" followed by digits, then
it's valid an+b
dbaron: I think this is far saner than the route we went down with URL,
where we made it a separate token
TabAtkins: All of these are equivalent, except they tokenize
completely differently:
n - 3
n- 3
n -3
n-3
[Discussion of parsing mechanics among Bert, plinss, and Tab]
RESOLVED: escaping in An+B falls out of tokenization

Overall
-------

TabAtkins: That's all the syntax changes that we've reviewed now
TabAtkins: So, can I publish yet?
dbaron: I think the efforts here are worthwhile, but I would like to
review the bracket-matching part before FPWD. I haven't had time yet
dbaron: I'm worried because I think once we have FPWD, people will
only look at this, not at CSS2.1
dbaron: So if we don't fix something now, it will be hard to fix later
dbaron: And I'm worried because it was originally reverse-engineered
from the worst implemenation (WebKit)
TabAtkins: I won't deny that.

Bert: I can't understand this draft, apart from the railroad diagrams.
Can you put more of those, or put grammar rules?
TabAtkins: I think I have diagrams for everything in the spec
<Bert> (Seems all the production rules have diagrams.)

fantasai: does your spec define what a style rule is, what a declaration
is, what these things mean?
TabAtkins: yes
SimonSapin: No, it doesn't
fantasai: I think we need to add such a section, since this part of 2.1
logically belongs here.
TabAtkins: Maybe I'll expand the Description of CSS's Syntax section
SimonSapin: And make it normative
fantasai: We need something that defines how to interpret a CSS style sheet.

plinss: You asked for FPWD, but there's an existing css3-syntax draft
fantasai: From process POV, no. But effectively it's a new draft.
plinss: Just making sure that we plan to replace what's out there and
is 10 years old
TabAtkins: yes

Editorship
----------

TabAtkins: I propose adding Simon as Syntax co-editor.
RESOLVED: Simon Sapin is Syntax co-editor.

Received on Wednesday, 3 July 2013 00:23:14 UTC