[whatwg] Space characters

On Jun 15, 2007, at 04:09, Ian Hickson wrote:

> On Mon, 6 Nov 2006, Henri Sivonen wrote:
>> On Nov 6, 2006, at 07:34, Ian Hickson wrote:
>>> On Sun, 5 Nov 2006, Henri Sivonen wrote:
>>>> Is there a reason why the definition of space characters does not
>>>> match the XML 1.0 and RELAX NG definition of white space (space,
>>>> tab, CR, LF) but also includes (line tabulation and form feed)? Is
>>>> the deviation from XML 1.0 needed for backwards compatibility with
>>>> text/html UAs?
>>> I made the parser consider VT and FF as being whitespace based  
>>> on, as
>>> I recall, a complete examination of every Unicode character's
>>> behaviour in the parsers I was testing. The definition of "space
>>> characters" matches the parser's behaviour for consistency.
>>> The definition of "space characters" doesn't affect the XML parser
>>> stage as far as I can recall, only attribute parsing and DOM
>>> conformance.
>> The potential problem with it affecting DOM conformance is that it  
>> may
>> have ripple effects to running XML tooling inside a browser engine.
>> Gecko has an XPath implementation. Disruptive Innovations has  
>> created a
>> RELAX NG implementation for Gecko. Running the schemas from
>> syntax.whattf.org on a DOM inside Gecko would be interesting,  
>> since it
>> would allow checking DOM snapshots modified by scripts. There may be
>> other reasons to run XML machinery on an HTML DOM in a browser. Both
>> XPath and RELAX NG assume that white space-separated tokens follow  
>> the
>> XML notion of white space. Not being able to use the native XPath and
>> RELAX NG notions of splitting on white space would be seriously  
>> uncool.
>> Of course, a browser engine might get away with tampering with the  
>> XPath
>> or RELAX NG notions of white space since the additional characters  
>> don't
>> occur in XML. But does it make sense to inflict the cost of such
>> tweaking on the XML parts of browser engines?
>> Would there be serious compatibility problems if the HTML5 parsing
>> algorithm required VT and FF to be mapped to space (after expanding
>> NCRs) and the higher-level parts of the spec defined white space as
>> space, tab, CR and LF?
> Well, I don't much care about VT, but I really think we should  
> round-trip
> form feed. Consider, for instance, RFCs, which have form feeds. I  
> don't
> like the idea of dropping them on the floor when you convert RFCs  
> to HTML
> and back to text again.

I see. If there's going to have to be a special XML incompatibility  
case for FF the second one for VT comes for free.

I'm addressing this issue in my tokenizer implementation by allowing  
the user of the library to opt to make the tokenizer non-conforming  
but XML 1.0-compatible either by treating FF and VT as fatal errors  
or by mapping them to U+0020. As far as I can tell, in this case even  
the fatal error treatment is non-conforming, because FF and VT  
haven't been defined as parse errors.

Henri Sivonen
hsivonen at iki.fi

Received on Sunday, 17 June 2007 08:26:17 UTC