Re: Non-character NCRs treated differently from literal non-characters from Henri Sivonen on 2009-06-10 (public-html@w3.org from June 2009)

From: Henri Sivonen <hsivonen@iki.fi>
Date: Wed, 10 Jun 2009 09:40:40 +0300
To: Ian Hickson <ian@hixie.ch>
Cc: "public-html@w3.org WG" <public-html@w3.org>
Message-Id: <762F096F-F27B-4A5E-A0D7-1A85525D6A9B@iki.fi>

On Jun 10, 2009, at 07:49, Ian Hickson wrote:

> On Mon, 18 May 2009, Henri Sivonen wrote:
>>
>> Literal non-characters don't turn into REPLACEMENT CHARACTER
>
> As far as I can tell, they do:
>
> # Bytes or sequences of bytes in the original byte stream that could  
> not
> # be converted to Unicode characters must be converted to U+FFFD
> # REPLACEMENT CHARACTER code points.
>
> (Unless I'm misunderstanding what you mean?)

To me, the part of the spec you quoted is talking about malformed byte  
sequences. If you meant it to also talk about valid UTF-8 byte  
sequences that expand to e.g. U+FFFF, the spec text has failed to  
convey the intention.

However, I object to adding more cases where code points decoded from  
valid UTF-8 streams are made arbitrarily magic. It's bad enough that U 
+0000 and CR are magic. They both have a disproportionate impact on  
the design of a parser implementation given that they don't enable any  
useful features.

Having more magic characters in an implementation that inlines the  
reading of the input buffer into the tokenizer loop would unreasonably  
bloat the code size. OTOH, not inlining the read action would add a  
per-code unit function call, which isn't nice, either. Furthermore, it  
seems that mapping non-characters to U+FFFD has no basis in supporting  
existing content. It seems to be a new theoretical purity thing.

Even further, making non-characters require action after a vanilla  
character encoding decoder has done its thing violates the design  
principle (not written in Design Principles) that the parser only ever  
needs to dispatch on characters in the Basic Latin range. (Obviously,  
requiring astral non-characters to be replaced poses a similar problem  
to UTF-16-based tokenizers that the failure to keep actionable  
characters in Basic Latin poses to UTF-8-based tokenizers. Aside: Even  
XML isn't theoretically pure enough to require parsers to check for  
astral non-characters.)

>> But why, then, do non-character NCRs turn into REPLACEMENT CHARACTER?
>
> For consistency.

In that case, I suggest letting non-characters end up in the DOM in  
both cases as is traditional.

-- 
Henri Sivonen
hsivonen@iki.fi
http://hsivonen.iki.fi/

Received on Wednesday, 10 June 2009 06:41:19 UTC