- From: Glenn Maynard <glenn@zewt.org>
- Date: Wed, 15 Aug 2012 19:15:32 -0500
- To: Andrea Marchesini <amarchesini@mozilla.com>
- Cc: whatwg@whatwg.org
On Wed, Aug 15, 2012 at 6:14 AM, Henri Sivonen <hsivonen@iki.fi> wrote: > As for the filenames, after an off-list discussion, I think the best > solution is that UTF-8 is tried first but the ArchiveReader > constructor takes an optional second argument that names a character > encoding from the Encoding Standard. This will be known as the > fallback encoding. If no fallback encoding is provided by the caller > of the constructor, "Windows-1252" is set as the fallback encoding. > When it ArchiveReader processes a filename from the zip archive, it > first tests if the byte string is a valid UTF-8 string. If it is, the > byte string is interpreted as UTF-8 when converting to UTF-16. If the > filename is not a valid UTF-8 string, it is decoded into UTF-16 using > the fallback encoding. > This would misinterpret filenames as UTF-8. For example, "黴雨.jpg" in a CP932 (SJIS) ZIP is also legal UTF-8. This would happen even though the user explicitly specified an encoding, and even though UTF-8 is exceptionally rare in ZIPs (all Windows ZIP software outputs filenames in the user's ACP, and many don't support UTF-8 at all). On Wed, Aug 15, 2012 at 6:17 AM, Andrea Marchesini <amarchesini@mozilla.com>wrote: > I agree. I was thinking that the default encoding for filenames is: > UTF-8. If filename is not a valid UTF-8 string we can use the > caller-supplied encoding: > I hate to argue against defaulting to UTF-8, but very few ZIPs are actually UTF-8. CP1252 as a default will at least often be correct, but UTF-8 will almost never be. (The only straightforward way I know to create a ZIP with UTF-8 filenames is with a *nix commandline client, and most Windows software won't understand it.) var reader = new ArchiveReader(blob, "Windows-1252"); > > If this fails, this filename/file will be excluded from the results. > There's no need. Decode with proper error handling, as specified in the Encoding spec: http://dvcs.w3.org/hg/encoding/raw-file/tip/Overview.html. This will give placeholder characters (U+FFFD); even if the whole filename comes out unreadable, the file can still be read, selected from a list, shown in a thumbnail view, and so on. Lots of uses aren't dependant on filenames. > > It should be possible to get the CRC32 of files, which ZIP stores in > > the central directory. This both allows the user to perform checksum > > verification himself if wanted, and all the other variously useful > > things about being able to get a file's checksum without having to > > read the whole file. > > can we have 'generic' archive API supporting CRC32? > Do you actually have any concrete plans for other archive formats? The only others commonly used are TAR and RAR. TAR is unsuitable for non-archive use (you have to scan the whole file to construct a file list), and RAR is proprietary. You could design a checksum API that uses the algorithm for a particular format, but that's severe overdesign if it never supports anything but ZIP. I wouldn't worry about this. -- Glenn Maynard
Received on Thursday, 16 August 2012 00:16:03 UTC