File hashing and progressive computation from Artem Skoretskiy on 2012-11-04 (public-webcrypto-comments@w3.org from November 2012)

From: Artem Skoretskiy <tonn81@gmail.com>
Date: Sun, 4 Nov 2012 11:22:23 +0100
To: "public-webcrypto-comments@w3.org" <public-webcrypto-comments@w3.org>
Message-Id: <4ADBB514-E403-4D7D-A073-E03D8CE6FC88@gmail.com>

Dear W3C community!

I have a feedback regarding current WebCryptoAPI draft (
http://www.w3.org/TR/WebCryptoAPI/ or
http://www.w3.org/2012/webcrypto/WebCryptoAPI/).

I think the proposal is really great. There a some libraries that allows
you to do similar stuff but they are slow (being written in Javascript) and non-standard (no support, all implementations have different interface). At
the same time crypto is required for recent data-rich web applications.

I have some consideration from the developer point of view, having worked a
lot with crypting in browser for different purposes.

One of the stated use cases is Data integrity protection (
http://www.w3.org/2012/webcrypto/WebCryptoAPI/#data-integrity-protection).

However but three key features are missing or not clear from current
documentation:

1. Support *MD5 hash* as digest algorithm
2. Support for *Blob as an input* parameter for
CryptoOperation.processData
3. *Progressive hash calculation*.

Let me explain both and why they are *critical* for the stated use case.

1. *MD5 digest*.

MD5 is the most popular algorithm nowadays to check data integrity. It is
fast, produce reasonable results and -- the most important -- it is used in
many systems now. Just as example -- the online giant Amazon uses MD5 to
check integrity of uploading files and even return MD5 hash for each stored
files. That is why -- for interoperability and real usage --
MD5 should be added as one of hashing algorithms. It you consider MD5 as
too weak for signing and encryption -- allow only digest for this method.

2. *Blob as input parameter* for CryptoOperation.processData

There are two typical use cases for digest -- get digest for generated data
or get digest for a file. The second one is pretty common: application
would like to calculate file digest before submission -- to ensure this
file wasn't uploaded before or to ensure file was not truncated or
corrupted during submission.

If you have a big file -- application will have to split file into parts
and feed the hasher piece by piece. Having in mind both FileReader and
CryptoOperation are asynchronous -- it will take a lot of code to implement
it (even just to iterate over a file). It would be REALLY handy if you can
just provide a Blob to a function so it would be *REALLY* handy.

3. *Progressive hash calculation*.

There should be a way to generate a hash for a file that is bigger than the
RAM or just too big to fit into browser memory. Our customers already use
several-gigabytes files and they are looking to use 60 Gb files in the
browser.

The typical approach is following:
1) initialize an empty hasher
2) iterate over chunks of your data and feed hasher with a chunk
3) get resulting hash

As a sample Python code:

import hashlib
hasher = hashlib.sha1()
for chunk in ["a","b"]:
hasher.update(chunk)
print hasher.hexdigest() #=> da23614e02469a0d7c7bd1bdab5c9c474b1904dc

So progressive hash generation is a *MUST for production* usage.

--
Truly yours,
Artem Skoretskiy

Received on Sunday, 4 November 2012 10:22:24 UTC