Re: File hashing and progressive computation from Ryan Sleevi on 2012-11-04 (public-webcrypto-comments@w3.org from November 2012)

From: Ryan Sleevi <sleevi@google.com>
Date: Sun, 4 Nov 2012 03:45:12 -0800
To: Artem Skoretskiy <tonn81@gmail.com>
Cc: "public-webcrypto-comments@w3.org" <public-webcrypto-comments@w3.org>
Message-ID: <CACvaWvb_mrZW+Os7aABxgixZb8L1ddwBpGT8mTQ5Cz_JQcbjjg@mail.gmail.com>
On Nov 4, 2012 11:22 AM, "Artem Skoretskiy" <tonn81@gmail.com> wrote:
>
> Dear W3C community!
>
> I have a feedback regarding current WebCryptoAPI draft (
> http://www.w3.org/TR/WebCryptoAPI/ or
> http://www.w3.org/2012/webcrypto/WebCryptoAPI/).
>
> I think the proposal is really great. There a some libraries that allows
> you to do similar stuff but they are slow (being written in Javascript)
and non-standard (no support, all implementations have different
interface). At
> the same time crypto is required for recent data-rich web applications.
>
> I have some consideration from the developer point of view, having worked
a
> lot with crypting in browser for different purposes.
>
> One of the stated use cases is Data integrity protection (
> http://www.w3.org/2012/webcrypto/WebCryptoAPI/#data-integrity-protection).
>
> However but three key features are missing or not clear from current
> documentation:
>
>   1. Support *MD5 hash* as digest algorithm
>   2. Support for *Blob as an input* parameter for
>   CryptoOperation.processData
>   3. *Progressive hash calculation*.
>
>
> Let me explain both and why they are *critical* for the stated use case.
>
> 1. *MD5 digest*.
>
> MD5 is the most popular algorithm nowadays to check data integrity. It is
> fast, produce reasonable results and -- the most important -- it is used
in
> many systems now. Just as example -- the online giant Amazon uses MD5 to
> check integrity of uploading files and even return MD5 hash for each
stored
> files. That is why -- for interoperability and real usage --
> MD5 should be added as one of hashing algorithms. It you consider MD5 as
> too weak for signing and encryption -- allow only digest for this method.
>
>

I agree with the idea in general, but support for weak algorithms -
particularly those that are easily misused by developers - has been a point
of contention. Every algorithm specified is a tacit encouragement of use.

Are you sure that, at this point, the differences are not minor to
non-existant between sha-1? Alternatively, are you sure a simple CRC is not
desired here?

> 2. *Blob as input parameter* for CryptoOperation.processData
>
> There are two typical use cases for digest -- get digest for generated
data
> or get digest for a file. The second one is pretty common: application
> would like to calculate file digest before submission -- to ensure this
> file wasn't uploaded before or to ensure file was not truncated or
> corrupted during submission.
>
> If you have a big file -- application will have to split file into parts
> and feed the hasher piece by piece. Having in mind both FileReader and
> CryptoOperation are asynchronous -- it will take a lot of code to
implement
> it (even just to iterate over a file). It would be REALLY handy if you can
> just provide a Blob to a function so it would be *REALLY* handy.

There is already an open issue on this (the reference is not handy). It was
originally included, but removed for now due to various canonicalization
ambiguities regarding Blob that do not exist with ArrayBuffer.

As an implementer, yes, the desire is to support Blob/File as well.

>
>
> 3. *Progressive hash calculation*.
>
> There should be a way to generate a hash for a file that is bigger than
the
> RAM or just too big to fit into browser memory. Our customers already use
> several-gigabytes files and they are looking to use 60 Gb files in the
> browser.
>
> The typical approach is following:
> 1) initialize an empty hasher
> 2) iterate over chunks of your data and feed hasher with a chunk
> 3) get resulting hash
>
> As a sample Python code:
>
> import hashlib
> hasher = hashlib.sha1()
> for chunk in ["a","b"]:
>    hasher.update(chunk)
> print hasher.hexdigest() #=> da23614e02469a0d7c7bd1bdab5c9c474b1904dc
>
> So progressive hash generation is a *MUST for production* usage.

This is already supported. That is exactly what processData is for - the
equivalent of update() - for hashing, encrypt, decrypt, and verify.

>
> --
> Truly yours,
> Artem Skoretskiy
Received on Sunday, 4 November 2012 11:45:40 UTC