CR and LF in the input stream / as NCRs (detailed review of parsing algorithm)

(This is part of my detailed review of the parsing algorithm.)

In http://www.whatwg.org/specs/web-apps/current-work/#consume the spec  
states that 
 is a parse error. Is this intentional?


The handling of 
, 
, CRs and LFs, and their combinations, seems  
to be a bit different in browsers.

    http://simon.html5.org/test/html/parsing/tokenisation/entities/carriage-return/demo.htm


In Opera, CRs and LFs are preserved in the DOM as they were written. CR is  
inserted for 
 and LF for 
. A CRLF pair in the DOM is rendered as  
a single linebreak.

In IE, CRLF pairs are converted to a single CR, and the remaining LFs are  
converted to CRs. It doesn't matter they were from real characters in the  
input stream or NCRs.

In Safari, a LF character in the input stream is ignored if the previous  
character was a CR (whether real or NCR). CRs (both real and NCRs) are  
then converted to LFs. LFs are inserted for both 
 and 
.

In Firefox, CRLF pairs in the input stream is converted to LF and  
remaining CR to LF. LFs are inserted for both 
 and 
.


The spec currently matches Firefox, AFAICT. Rendering-wise, there is  
interop between IE and Opera. I think the spec should require what IE  
does, except use LFs instead of CRs.

-- 
Simon Pieters
Opera Software

Received on Tuesday, 31 July 2007 01:04:42 UTC