- From: Henri Sivonen <hsivonen@iki.fi>
- Date: Wed, 10 Jun 2009 09:40:40 +0300
- To: Ian Hickson <ian@hixie.ch>
- Cc: "public-html@w3.org WG" <public-html@w3.org>
On Jun 10, 2009, at 07:49, Ian Hickson wrote: > On Mon, 18 May 2009, Henri Sivonen wrote: >> >> Literal non-characters don't turn into REPLACEMENT CHARACTER > > As far as I can tell, they do: > > # Bytes or sequences of bytes in the original byte stream that could > not > # be converted to Unicode characters must be converted to U+FFFD > # REPLACEMENT CHARACTER code points. > > (Unless I'm misunderstanding what you mean?) To me, the part of the spec you quoted is talking about malformed byte sequences. If you meant it to also talk about valid UTF-8 byte sequences that expand to e.g. U+FFFF, the spec text has failed to convey the intention. However, I object to adding more cases where code points decoded from valid UTF-8 streams are made arbitrarily magic. It's bad enough that U +0000 and CR are magic. They both have a disproportionate impact on the design of a parser implementation given that they don't enable any useful features. Having more magic characters in an implementation that inlines the reading of the input buffer into the tokenizer loop would unreasonably bloat the code size. OTOH, not inlining the read action would add a per-code unit function call, which isn't nice, either. Furthermore, it seems that mapping non-characters to U+FFFD has no basis in supporting existing content. It seems to be a new theoretical purity thing. Even further, making non-characters require action after a vanilla character encoding decoder has done its thing violates the design principle (not written in Design Principles) that the parser only ever needs to dispatch on characters in the Basic Latin range. (Obviously, requiring astral non-characters to be replaced poses a similar problem to UTF-16-based tokenizers that the failure to keep actionable characters in Basic Latin poses to UTF-8-based tokenizers. Aside: Even XML isn't theoretically pure enough to require parsers to check for astral non-characters.) >> But why, then, do non-character NCRs turn into REPLACEMENT CHARACTER? > > For consistency. In that case, I suggest letting non-characters end up in the DOM in both cases as is traditional. -- Henri Sivonen hsivonen@iki.fi http://hsivonen.iki.fi/
Received on Wednesday, 10 June 2009 06:41:19 UTC