- From: Ian Hickson <ian@hixie.ch>
- Date: Tue, 22 Apr 2008 11:18:33 +0000 (UTC)
On Fri, 1 Dec 2006, Elliotte Harold wrote: > > In 9.1.3 we see > > Text must consist of valid Unicode characters other than U+0000. Text should > not contain control characters other than space characters. > > > Later in 9.2.3.1 we find: > > If the number is not a valid Unicode character (e.g. if the number is higher > than 1114111), or if the number is zero, then return a character token for the > U+FFFD REPLACEMENT CHARACTER character instead. > > > I do not think the Unicode spec defines the notion of a "valid Unicode > character". (It does define a valid Unicode code unit sequence, but that's a > little different. A code unit sequence generally consists of more than one > character.) Thus I suggest we need to be more precise here about what is and > is not a valid Unicode character. The spec is much more precise now. Is it ok? > In particular: > > 1. Are private use characters allowed? Yes. > 2. Are control characters allowed (probably yes, based on other parts of > the spec). No as raw characters. Control characters that aren't in U+80-U+9F are allowed as entities. > 3. Are surrogate characters allowed? (probably no) No. > 4. Are non-characters beyond 10FFFF allowed (no) No. > 5. Are reserved but currently undefined characters allowed (yes) Yes. > 6. Are noncharacters U+FDD0..U+FDEF allowed (?) > 7. Are the noncharacters from the last two characters of each plane > allowed (?) Not as raw charactes but, for now, as entities yes. On Sun, 3 Dec 2006, Henri Sivonen wrote: > On Dec 2, 2006, at 18:24, Sam Ruby wrote: > > > > It would not be wise for HTML5 to limit itself to the more constrained > > character set of XML. In particular, the form feed character is > > pretty popular, > > > > This is yet another case where "take HTML5, read it into a DOM, and > > serialize it as XML, and voil?: you have valid XHTML" doesn't work. > > What I am advocating is making sure that *conforming* HTML5 documents > can be serialized as XHTML5 without dataloss. This is important in order > to be able to promise that an "XML tool chain" can be used for > processing *conforming* HTML5 by sticking an HTML5 parser in front of > the processing pipeline (for *non-browser* use cases like data mining, > content management or conformance checking where scripts aren't executed > nor CSS rendering performed). The motivation is to make processing HTML5 > in non-browser apps less expensive without giving an incentive for the > solutions to violate the spec ad hoc on their own. > > For example, an "XML tool chain" is important enough for my conformance > checking service that if at this point the assumption of *conforming* > HTML5 being convertible to XHTML5 was broken in corner cases, I'd > probably come up with ad hoc trickery for masking it instead of throwing > away the tool chain. I'd prefer not having to do that and not having to > explain to everyone else who finds an "XML tool chain" to be of value > what tricks I needed to pull off to fake it. > > I am not suggesting that HTML5 browsers halt and catch fire upon finding > a form feed. And it is obvious that lossless conversion of all possible > non-conforming HTML5 documents to XML is impossible anyway, so making > that a goal would not be worthwhile. > > But what legitimate and popular use would a form feed have in HTML5? Why > can't we call it non-conforming? Are there use cases other than > converting .txt RFCs to HTML with regexps without bothering to get rid > of the form feeds? I don't think that it would be valuable to make that use case raise errors. -- Ian Hickson U+1047E )\._.,--....,'``. fL http://ln.hixie.ch/ U+263A /, _.. \ _\ ;`._ ,. Things that are impossible just take longer. `._.-(,_..'--(,_..'`-.;.'
Received on Tuesday, 22 April 2008 04:18:33 UTC