- From: <bugzilla@jessica.w3.org>
- Date: Sun, 12 Sep 2010 13:13:55 +0000
- To: public-html-bugzilla@w3.org
http://www.w3.org/Bugs/Public/show_bug.cgi?id=9659 --- Comment #12 from Henri Sivonen <hsivonen@iki.fi> 2010-09-12 13:13:55 --- (In reply to comment #11) > WebKit does what the spec says: > > http://trac.webkit.org/browser/trunk/WebCore/html/parser/HTMLTreeBuilder.cpp#L2491 > > If any replacement characters arrive at the tree builder, they don't set > framesetOk to false. I don't understand the issues you're complaining about. > It doesn't matter how the replacement characters were generated. They just no > longer flip the framesetOk bit. Swallowing nulls in the modes related to the start of the document is sufficient for achieving the desired Web compat effect. Thus, with implementations swallowing nulls, making REPLACEMENT CHARACTER special for the purpose of frameset-ok is useless. Making the REPLACEMENT CHARACTER a parser-sensitive character is not just useless, it is harmful for two reasons: 1) The parsing algorithm was (until the change Hixie made here) designed to make decisions only based on Basic Latin characters. This means that the parsing algorithm had the property that it implementations can make all the decisions they need to make by examining a single code unit regardless of the choice of internal Unicode representation (UTF-8, UTF-16 or UTF-32). Even though Validator.nu and Gecko use UTF-16 internally, I was planning on enabling the reuse of the parser core so that it used UTF-8 internally. I'm very unhappy about a property of the parsing algorithm that I had counting on (being able to dispatch on a single code unit always) changing especially when the change is useless given U+0000 swallowing. 2) The output of the HTML parsing algorithm is not guaranteed to be a tree that's a well-formed XML Infoset. The Validator.nu HTML parser offers a feature (already shipped) that alters the output of the parser minimally to coerce it into a well-formed Infoset for compatibility with XML-oriented stages down the processing pipeline. This feature works by mapping Basic Multilingual Plane characters that are banned in XML into the REPLACEMENT CHARACTER. Before the change made when resolving this bug, this mapping of characters ahead of tokenization didn't change the decisions the tree builder would make. Now, mapping additional characters to REPLACEMENT CHARACTER would change the resulting tree in drastic ways in some cases. I'm also very unhappy that this spec change made it substantially harder to support infoset coercion (which is a shipped feature) in a way that doesn't cause changes to the output that aren't strictly necessary to achieve XML-compatibility. Therefore, I think the null swallowing that both Gecko and WebKit already do should be standardized and then the REPLACEMENT CHARACTER should be made non-special in the tree builder as Web compat considerations would no longer require it to be special when nulls have already been swallowed. -- Configure bugmail: http://www.w3.org/Bugs/Public/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the QA contact for the bug.
Received on Sunday, 12 September 2010 13:13:57 UTC