Re: [whatwg] StringEncoding open issues from Glenn Maynard on 2012-08-07 (public-whatwg-archive@w3.org from August 2012)

From: Glenn Maynard <glenn@zewt.org>
Date: Mon, 6 Aug 2012 19:06:05 -0500
To: Joshua Bell <jsbell@chromium.org>
Cc: WHAT Working Group <whatwg@lists.whatwg.org>
Message-ID: <CABirCh89joixGkwziK9b4FZTvxYGMn+vXRhupuvW_MX8K-nz9w@mail.gmail.com>

I agree with Jonas that encoding should just use a replacement character
(U+FFFD for Unicode encodings, '?' otherwise), and that we should put off
other modes (eg. exceptions and user-specified replacement characters)
until there's a clear need.

My intuition is that encoding DOMString to UTF-16 should never have errors;
if there are dangling surrogates, pass them through unchanged.  There's no
point in using a placeholder that says "an error occured here", when the
error can be passed through in exactly the same form (not possible with eg.
DOMString->SJIS).  I don't feel strongly about this only because outputting
UTF-16 is so rare to begin with.

On Mon, Aug 6, 2012 at 1:29 PM, Joshua Bell <jsbell@chromium.org> wrote:

> - if the document is encoded in UTF-8, UTF-16LE or UTF-16BE and includes
> the byte order mark (the encoding-specific serialization of U+FEFF).

This rarely detects the wrong type, but that doesn't mean it's not the
wrong answer.  If my input is meant to be UTF-8, and someone hands me
BOM-marked UTF-16, I want it to fail in the same way it would if someone
passed in SJIS.  I don't want it silently translated.

On the other hand, it probably does make sense for UTF-16 to switch to
UTF-16BE, since that's by definition the original purpose of the BOM.

The convention iconv uses, which I think is a useful one, is decoding from
"UTF-16" means "try to figure out the encoding from the BOM, if any", and
"UTF-16LE" and "UTF-16BE" mean "always use this exact encoding".

 * If one of the UTF encodings is specified AND the BOM matches then the
> leading BOM character (U+FEFF) MUST NOT be emitted in the output character
> sequence (i.e. it is silently consumed)
>

It's a little weird that

data = readFile("user-supplied-file.txt"); // shortcutting for brevity
var s = new TextDecoder("utf-16").decode(data); // or utf-8
s = s.replace("a", "b");
var data2 = new TextEncoder("utf-16").encode(s);
writeFile("user-supplied-file.txt", data2);

causes the BOM to be quietly stripped away.  Normally if you're modifying a
file, you want to pass through the BOM (or lack thereof) untouched.

One way to deal with this could be:

var decoder = new TextDecoder("utf-16");
var s = decoder.decode(data);
s = s.replace("a", "b");
var data2 = new TextEncoder(decoder.encoding).encode(s);

where decoder.encoding is eg. "UTF-16LE-BOM" if a BOM was present, thus
preserving both the BOM and (for UTF-16) endianness.  I don't actually like
this, though, because I don't like the idea of decoder.encoding changing
after the decoder has already been constructed.

I think I agree with just stripping it, and people who want to preserve
BOMs on write-through can jump the hoops manually (which aren't terribly
hard).

Another issue is "new TextDecoder('ascii').encoding" (and ISO-8859-1)
giving .encoding = "windows-1252".  That's strange, even when you know why
it's happening.

Is there any reason to expose the actual "primary" names?  It's not clear
that the "name" column in the Encoding spec is even intended to be exposed
to APIs; they look more like labels for specs to refer to internally.
(Anne?)  If there's no pressing reason to expose this, I'd suggest that the
.encoding attribute simply return the name that was passed to the
constructor.

It's still not ideal (it's weird that asking for ASCII gives you something
other than ASCII in the first place), but it at least seems a bit less
strange.  The "nice" fix would be to implement actual ASCII, ISO-8859-1,
ISO-8859-9, etc. charsets, but that just means extra implementation work
(and some charset proliferation) without use cases.

-- 
Glenn Maynard

Received on Tuesday, 7 August 2012 00:07:06 UTC