[Bug 11298] New: Surrogate catching doesn't belong in input stream preprocessing from bugzilla@jessica.w3.org on 2010-11-11 (public-html@w3.org from November 2010)

From: <bugzilla@jessica.w3.org>
Date: Thu, 11 Nov 2010 11:55:58 +0000
To: public-html@w3.org
Message-ID: <bug-11298-2495@http.www.w3.org/Bugs/Public/>

http://www.w3.org/Bugs/Public/show_bug.cgi?id=11298

           Summary: Surrogate catching doesn't belong in input stream
                    preprocessing
           Product: HTML WG
           Version: unspecified
          Platform: PC
        OS/Version: Linux
            Status: NEW
          Severity: normal
          Priority: P2
         Component: HTML5 spec (editor: Ian Hickson)
        AssignedTo: ian@hixie.ch
        ReportedBy: hsivonen@iki.fi
         QAContact: public-html-bugzilla@w3.org
                CC: mike@w3.org, public-html-wg-issue-tracking@w3.org,
                    public-html@w3.org


The spec says:
"Code points in the range U+D800 to U+DFFF in the input must be replaced by
U+FFFD REPLACEMENT CHARACTERs."

This doesn't really belong in the parser, since document.write()-inserted
UTF-16 text should not be subject to lone surrogate replacement since it would
add complexity without a backwards compatibility need.

Instead, the spec should have a note saying character decoders for UTF-8,
UTF-16 and similar (GB18030 maybe?) are required to emit U+FFFD for bogus byte
sequences and sequences decoding to surrogates in UTF-8 or lone surrogates in
UTF-16 are bogus.

-- 
Configure bugmail: http://www.w3.org/Bugs/Public/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are on the CC list for the bug.

Received on Thursday, 11 November 2010 11:56:00 UTC