Re: [css-syntax] Defining "character" from Simon Pieters on 2013-08-13 (www-style@w3.org from August 2013)

From: Simon Pieters <simonp@opera.com>
Date: Tue, 13 Aug 2013 11:28:19 +0200
To: "Zack Weinberg" <zackw@panix.com>
Cc: "Simon Sapin" <simon.sapin@exyr.org>, "Tab Atkins Jr." <jackalmage@gmail.com>, "www-style list" <www-style@w3.org>
Message-ID: <op.w1q0lhqzidj3kv@simons-macbook-pro.local>

On Mon, 12 Aug 2013 22:09:47 +0200, Zack Weinberg <zackw@panix.com> wrote:

> On Mon, Aug 12, 2013 at 11:47 AM, Simon Pieters <simonp@opera.com> wrote:
>> On Mon, 12 Aug 2013 19:36:37 +0200, Tab Atkins Jr.  
>> <jackalmage@gmail.com>
>> wrote:
>>>
>>> If implementations are willing to change, I'm fine with specifying
>>> that unpaired surrogates get transformed into U+FFFD at CSS parse
>>> time.
>
> I wouldn't hesitate to make that change in Gecko.  We use UTF-16
> internally for everything (alas), so it would be a little fiddly, but
> not *that* fiddly.
>
>> Doing that seems like a slight perf cost and basically no benefit. The  
>> DOM
>> API and document.write in HTML just let lone surrogates through. I'd  
>> say we
>> do that in CSS for stuff coming from CSSOM also.
>
> Is that intentional in HTML5 or just an oversight?  If it's
> intentional, I suppose we ought to do the same for overall
> consistency's sake.

It is intentional. The HTML spec's parser actually previously operated on  
code points, but that was never a reality in implementations, and at least  
Henri Sivonen refused to implement it in Gecko's HTML parser [1], so the  
spec changed to let lone surrogates from document.write through.

[1]  
http://lists.w3.org/Archives/Public/public-whatwg-archive/2011Nov/0020.html

-- 
Simon Pieters
Opera Software

Received on Tuesday, 13 August 2013 09:23:29 UTC