W3C home > Mailing lists > Public > www-international@w3.org > January to March 2014

[Bug 24104] Clarify how encoders should deal with lone surrogates

From: <bugzilla@jessica.w3.org>
Date: Fri, 28 Mar 2014 11:52:34 +0000
To: www-international@w3.org
Message-ID: <bug-24104-4285-UZ9VhgGUrT@http.www.w3.org/Bugs/Public/>

--- Comment #2 from Anne <annevk@annevk.nl> ---
I tested this:

<meta charset=windows-1252>
<form action=http://software.hixie.ch/utilities/cgi/test-tools/echo>
<input name=a> <script> document.querySelector("input").value = "\ud801"
<input type=submit>

Gecko does U+FFFD, Chrome gives back U+D801 (encoded as per <form> error mode
as windows-1252 can express neither).

Now if set the encoding to utf-8 both Gecko and Chrome emit U+FFFD (as utf-8
bytes percent-encoded).

utf-16 results in the same as utf-8 as expected.

So either each encoder's handler needs to catch the surrogate range and return
error with U+FFFD (Gecko) or not (Chrome). Gecko's behavior is slightly saner I
suspect. I'll fix utf-8 and utf-16 to do this right away. Not sure who to
consult how we should change the rest.

You are receiving this mail because:
You are on the CC list for the bug.
Received on Friday, 28 March 2014 11:52:36 UTC

This archive was generated by hypermail 2.4.0 : Friday, 17 January 2020 22:41:04 UTC