[Bug 9659] Initial U+0000 should not set frameset-ok to "not ok" from bugzilla@jessica.w3.org on 2010-09-12 (public-html-bugzilla@w3.org from September 2010)

From: <bugzilla@jessica.w3.org>
Date: Sun, 12 Sep 2010 13:13:55 +0000
To: public-html-bugzilla@w3.org
Message-Id: <E1OumNX-0005b9-L7@jessica.w3.org>

http://www.w3.org/Bugs/Public/show_bug.cgi?id=9659





--- Comment #12 from Henri Sivonen <hsivonen@iki.fi>  2010-09-12 13:13:55 ---
(In reply to comment #11)
> WebKit does what the spec says:
> 
> http://trac.webkit.org/browser/trunk/WebCore/html/parser/HTMLTreeBuilder.cpp#L2491
> 
> If any replacement characters arrive at the tree builder, they don't set
> framesetOk to false.  I don't understand the issues you're complaining about. 
> It doesn't matter how the replacement characters were generated.  They just no
> longer flip the framesetOk bit.

Swallowing nulls in the modes related to the start of the document is
sufficient for achieving the desired Web compat effect. Thus, with
implementations swallowing nulls, making REPLACEMENT CHARACTER special for the
purpose of frameset-ok is useless.

Making the REPLACEMENT CHARACTER a parser-sensitive character is not just
useless, it is harmful for two reasons:

 1) The parsing algorithm was (until the change Hixie made here) designed to
make decisions only based on Basic Latin characters. This means that the
parsing algorithm had the property that it implementations can make all the
decisions they need to make by examining a single code unit regardless of the
choice of internal Unicode representation (UTF-8, UTF-16 or UTF-32). Even
though Validator.nu and Gecko use UTF-16 internally, I was planning on enabling
the reuse of the parser core so that it used UTF-8 internally. I'm very unhappy
about a property of the parsing algorithm that I had counting on (being able to
dispatch on a single code unit always) changing especially when the change is
useless given U+0000 swallowing.

 2) The output of the HTML parsing algorithm is not guaranteed to be a tree
that's a well-formed XML Infoset. The Validator.nu HTML parser offers a feature
(already shipped) that alters the output of the parser minimally to coerce it
into a well-formed Infoset for compatibility with XML-oriented stages down the
processing pipeline. This feature works by mapping Basic Multilingual Plane
characters that are banned in XML into the REPLACEMENT CHARACTER. Before the
change made when resolving this bug, this mapping of characters ahead of
tokenization didn't change the decisions the tree builder would make. Now,
mapping additional characters to REPLACEMENT CHARACTER would change the
resulting tree in drastic ways in some cases. I'm also very unhappy that this
spec change made it substantially harder to support infoset coercion (which is
a shipped feature) in a way that doesn't cause changes to the output that
aren't strictly necessary to achieve XML-compatibility.

Therefore, I think the null swallowing that both Gecko and WebKit already do
should be standardized and then the REPLACEMENT CHARACTER should be made
non-special in the tree builder as Web compat considerations would no longer
require it to be special when nulls have already been swallowed.

-- 
Configure bugmail: http://www.w3.org/Bugs/Public/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the QA contact for the bug.

Received on Sunday, 12 September 2010 13:13:57 UTC