[Bug 11298] Surrogate catching doesn't belong in input stream preprocessing from bugzilla@jessica.w3.org on 2011-01-04 (public-html-bugzilla@w3.org from January 2011)

From: <bugzilla@jessica.w3.org>
Date: Tue, 04 Jan 2011 09:09:47 +0000
To: public-html-bugzilla@w3.org
Message-Id: <E1Pa2tn-0000ZP-HJ@jessica.w3.org>

http://www.w3.org/Bugs/Public/show_bug.cgi?id=11298

--- Comment #2 from Henri Sivonen <hsivonen@iki.fi> 2011-01-04 09:09:46 UTC ---
Considering established practice, the spec makes a conceptual error when it
pretends that the parser operates on Unicode characters. In the real world, the
parser (in applications that support document.write) operates on UTF-16 code
units and document.write writes UTF-16 code units. If document.write writes
unpaired surrogates, they pass through the parser unchanged and unpaired
surrogates end up in the DOM. It's not worthwhile to prevent this as long as
scripted DOM manipulation can put unpaired surrogates in the DOM.

The conceptually realistic setup is thus:
 1) The parser operates on UTF-16 code units.
 2) The parser is responsible for munging U+0000 and carriage return.
 3) The parser is *not* responsible for touching unpaired surrogates.
 4) document.write writes UTF-16 code units (with potentially unpaired
surrogates)
 5) When the input is a byte stream, the process that converts input bytes into
UTF-16 code units is responsible for replacing bogus byte sequences with
U+FFFD. When the input byte stream is encoded in a flavor of UTF-16, unpaired
surrogates constitute bogus byte sequences.

-- 
Configure bugmail: http://www.w3.org/Bugs/Public/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the QA contact for the bug.

Received on Tuesday, 4 January 2011 09:09:48 UTC