CR and LF in the input stream / as NCRs (detailed review of parsing algorithm) from Simon Pieters on 2007-07-31 (public-html@w3.org from July 2007)

From: Simon Pieters <simonp@opera.com>
Date: Tue, 31 Jul 2007 03:04:26 +0200
To: public-html <public-html@w3.org>
Message-ID: <op.twa09oa6idj3kv@hp-a0a83fcd39d2>

(This is part of my detailed review of the parsing algorithm.)

In http://www.whatwg.org/specs/web-apps/current-work/#consume the spec  
states that &#13; is a parse error. Is this intentional?


The handling of &#10;, &#13;, CRs and LFs, and their combinations, seems  
to be a bit different in browsers.

    http://simon.html5.org/test/html/parsing/tokenisation/entities/carriage-return/demo.htm


In Opera, CRs and LFs are preserved in the DOM as they were written. CR is  
inserted for &#13; and LF for &#10;. A CRLF pair in the DOM is rendered as  
a single linebreak.

In IE, CRLF pairs are converted to a single CR, and the remaining LFs are  
converted to CRs. It doesn't matter they were from real characters in the  
input stream or NCRs.

In Safari, a LF character in the input stream is ignored if the previous  
character was a CR (whether real or NCR). CRs (both real and NCRs) are  
then converted to LFs. LFs are inserted for both &#10; and &#13;.

In Firefox, CRLF pairs in the input stream is converted to LF and  
remaining CR to LF. LFs are inserted for both &#10; and &#13;.


The spec currently matches Firefox, AFAICT. Rendering-wise, there is  
interop between IE and Opera. I think the spec should require what IE  
does, except use LFs instead of CRs.

-- 
Simon Pieters
Opera Software

Received on Tuesday, 31 July 2007 01:04:42 UTC