W3C home > Mailing lists > Public > www-international@w3.org > January to March 2014

[Bug 24104] Clarify how encoders should deal with lone surrogates

From: <bugzilla@jessica.w3.org>
Date: Fri, 28 Mar 2014 11:52:34 +0000
To: www-international@w3.org
Message-ID: <bug-24104-4285-UZ9VhgGUrT@http.www.w3.org/Bugs/Public/>
https://www.w3.org/Bugs/Public/show_bug.cgi?id=24104

--- Comment #2 from Anne <annevk@annevk.nl> ---
I tested this:

<meta charset=windows-1252>
<form action=http://software.hixie.ch/utilities/cgi/test-tools/echo>
<input name=a> <script> document.querySelector("input").value = "\ud801"
</script>
<input type=submit>
</form>

Gecko does U+FFFD, Chrome gives back U+D801 (encoded as per <form> error mode
as windows-1252 can express neither).

Now if set the encoding to utf-8 both Gecko and Chrome emit U+FFFD (as utf-8
bytes percent-encoded).

utf-16 results in the same as utf-8 as expected.

So either each encoder's handler needs to catch the surrogate range and return
error with U+FFFD (Gecko) or not (Chrome). Gecko's behavior is slightly saner I
suspect. I'll fix utf-8 and utf-16 to do this right away. Not sure who to
consult how we should change the rest.

-- 
You are receiving this mail because:
You are on the CC list for the bug.
Received on Friday, 28 March 2014 11:52:36 UTC

This archive was generated by hypermail 2.3.1 : Wednesday, 21 September 2016 22:37:36 UTC