[Bug 24104] Clarify how encoders should deal with lone surrogates from bugzilla@jessica.w3.org on 2014-03-28 (www-international@w3.org from January to March 2014)

From: <bugzilla@jessica.w3.org>
Date: Fri, 28 Mar 2014 12:01:05 +0000
To: www-international@w3.org
Message-ID: <bug-24104-4285-H9osyrsQkC@http.www.w3.org/Bugs/Public/>

https://www.w3.org/Bugs/Public/show_bug.cgi?id=24104

Anne <annevk@annevk.nl> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |bzbarsky@mit.edu,
                   |                            |hsivonen@hsivonen.fi,
                   |                            |simon.sapin@exyr.org

--- Comment #3 from Anne <annevk@annevk.nl> ---
I analyzed too quickly. In Gecko and Chrome is either lone surrogates never
reach the utf-8 encoder (replaced by U+FFFD before) or are replaced as part of
the encoder. They do not result in an error as that would cause something in
the form of &#...; to be emitted rather than a straight U+FFFD.

Boris, Henri, Simon, do you have any preferences how we arrange the encoder
setup? Should all encoders replace lone surrogates in the input stream with
U+FFFD or should we make encoders only take Unicode scalar values and let a
layer before handle the lone surrogates?

It seems more pragmatic to have encoders take code points. Maybe I should
introduce a special lone surrogate error that does the replacing to U+FFFD?

-- 
You are receiving this mail because:
You are on the CC list for the bug.

Received on Friday, 28 March 2014 12:01:08 UTC