Re: [whatwg] StringEncoding open issues from Glenn Maynard on 2012-08-16 (public-whatwg-archive@w3.org from August 2012)

From: Glenn Maynard <glenn@zewt.org>
Date: Wed, 15 Aug 2012 19:30:09 -0500
To: Joshua Bell <jsbell@chromium.org>
Cc: WHAT Working Group <whatwg@lists.whatwg.org>
Message-ID: <CABirCh9vnuR=R5Hykp-FU5cThSWBy2fH7pHp1Vkm_gjY-D7Gsw@mail.gmail.com>

On Tue, Aug 14, 2012 at 12:34 PM, Joshua Bell <jsbell@chromium.org> wrote:

>    - Create an encoder with TextDecoder() and if present a BOM will be
>    respected (and consumed) otherwise default to UTF-8
>

Let's not default to "autodetect Unicode formats".  It encourages people to
support UTF-16 when they may not mean to.  If BOM detection for both UTF-8
and UTF-16 is wanted, I'd suggest something explicit, like "utf-*".

If the argument to the ctor is optional, I think the default should be
purely UTF-8.

>  This gets easier if we restrict to encoding UTF-8 which typically doesn't
> include BOMs. But it's looking like there's enough desire to keep UTF-16
> encoding at the moment. Agree with just stripping it for now.
>

UTF-8 sometimes does have a BOM, especially in Windows where applications
sometimes use it to distinguish UTF-8 from ACP text files (which are just
as common as ever--Windows has made no motion away from legacy encodings
whatsoever).  Stripping the BOM can cause those applications to
misinterpret the files as ACP.

Anyway, even if the encoding API gives a "helper" for this, figuring out
how that works would probably be more effort for developers than just
peeking at the ArrayBuffer for the BOM and adding it back in manually.
(I'm pretty sure anybody who knows enough to pay attention to this in the
first place will have no trouble doing that.)  So, yeah, let's not worry
about this.

-- 
Glenn Maynard

Received on Thursday, 16 August 2012 00:30:38 UTC