Re: [WebSQLDatabase] Adding a vacuum() call from Shane Harrelson on 2010-03-13 (public-webapps@w3.org from January to March 2010)

From: Shane Harrelson <shane@sqlite.org>
Date: Fri, 12 Mar 2010 23:25:55 -0500
To: public-webapps <public-webapps@w3.org>
Message-ID: <969b6a981003122025v6f075267u87ae9946dda53ed@mail.gmail.com>
On Fri, Mar 12, 2010 at 4:15 PM, Dumitru Daniliuc <dumi@google.com> wrote:

> thanks for your comments, shane. a few questions:
>
>
>> (1) VACUUM is never "necessary", though can sometimes reduce the size of
>> the database file and provide a performance improvement if the database is
>> large and fragmented.
>>
>
> large and fragmented databases is exactly the use case that made us think
> about a vacuum() call.
>
>
>> (2) Use PRAGMA freelist_count to estimate how much VACUUM will shrink a
>> database.
>>
>
> if i understand the docs correctly, PRAGMA freelist_count will only return
> the number of empty pages. so if we had many pages with very little data on
> each one, vacuuming the database would considerably reduce its size, but
> PRAGMA freelist_node would return 0. is this correct? is there a better way
> to estimate the amount of space we'd get back if we vacuumed the database?
> in particular, is there a way to figure out how fragmented each page is?
>
>

Without looking at the specific page structure on the database, there's not
a good way to figure this out.   However, the B-Tree balancing algorithm
used by SQLite will attempt to merge pages with neighbors when there space
utilization drops below certain thresholds. Minimum average fill for intkey
leaves is 50%. For other pages I think it's 66%.



> (3) VACUUM requires O(N) time where N is the size of the database file, and
>> can require up to 2N temporary space while processing.
>>
>
> i'm not so worried about the space: hard drives are much much bigger than
> any DB a web app will have. the time it takes to vacuum a DB though is
> something that UAs would probably want to take into account. do you have any
> data on how long it takes to vacuum a fragmented database of 1MB? 10MB?
> 100MB? 1GB? i'm not looking for a benchmark, just some approximate data
> (even though it would be great if you had a benchmark that you could share
> with us).
>
>

VACUUM is largely io-bound and so the speed greatly depends on speed of the
underlying filesystem.  We generally use the rule of thumb of 1 second per
MB on basic systems.



> Our recommendation is that the use of VACUUM be discouraged.   The
>> performance improvement only comes up with very large databases that are
>> highly fragmented.  Where space reclamation is an issue, AUTO_VACUUM would
>> be more appropriate.
>>
>
> the spec i proposed in my previous email allows UAs to ignore vacuum()
> calls whenever they feel like it, so the UAs could just decide that it's not
> worth vacuuming small databases. in cases of apps like offline gmail though,
> the databases will be large, and i would guess that after a while they'd be
> pretty fragmented, due to constantly deleting older emails and inserting
> newer ones, so some kind of vacuuming would be needed.
>
> AUTO_VACUUM: if i understand the docs correctly, auto-vacuum does not go
> inside pages. it only automatically removes the pages that are entirely
> free. is that correct? if so, then it seems to me that auto-vacuuming
> doesn't really solve the fragmentation problem. also, what's the performance
> overhead of having auto-vacuum on? and do we have any control over when it
> kicks in? going back to my offline gmail example, i think it would be much
> better to have the app call vacuum() when the user writes an email and the
> app is otherwise idle, for example, then to have auto-vacuuming cause
> user-visible delays when the app is syncing with the server or doing some
> other intensive work.
>
>
When combined with the basic page merging mentioned above, AUTO_VACUUM does
a very good job, and the performance overhead is minimal.   There's no
control over when it kicks in.



> thanks,
> dumi
>
>
>
>> -Shane
>>
>>
>>
>> On Thu, Mar 11, 2010 at 7:33 PM, Dimitri Glazkov <dglazkov@chromium.org>wrote:
>>
>>> I like the completion callback idea.
>>>
>>> Also like the notion of some sort of protection from "over-eager
>>> vacuum-calling syndrome".
>>>
>>> :DG<
>>>
>>> On Thu, Mar 11, 2010 at 4:20 PM, Michael Nordman <michaeln@google.com>
>>> wrote:
>>> > Instead of calling back on success only, maybe call back on completion
>>> > regardless of success or failure. This way the caller would know when
>>> the
>>> > potentially lengthy operation was done, regardless of the outcome.
>>> >
>>> > 2010/3/11 Dumitru Daniliuc <dumi@chromium.org>
>>> >>
>>> >> joao,
>>> >>
>>> >> it looks like we mostly agree on this feature, so i was wondering if
>>> we
>>> >> could formally agree on a spec. here's what i propose:
>>> >>
>>> >> 1. name: vacuum. to the best of my knowledge, all current
>>> WebSQLDatabases
>>> >> implementations use SQLite, and in SQLite the command is called
>>> VACUUM. so
>>> >> it seems to me that we might as well call the new function vacuum().
>>> what do
>>> >> you think?
>>> >>
>>> >> 2. spec: no need for an error callback.
>>> >>
>>> >> interface Database {
>>> >>   // the methods and properties currently in the spec
>>> >>   void vacuum(in optional SQLVoidCallback successCallback);
>>> >> };
>>> >>
>>> >> 3. what the call should do: the purpose of this call is to allow apps
>>> to
>>> >> vacuum/compact/defragment/clean up their databases whenever they see
>>> fit.. a
>>> >> call to vacuum() could take a considerable amount of time (especially
>>> on
>>> >> larger or highly fragmented databases); therefore, it is not
>>> recommended for
>>> >> web apps to call this method during periods of high activity.
>>> >>
>>> >> how to process a vacuum() call:
>>> >>
>>> >> if the UA does not support this call (mobile browsers?), jump to step
>>> 3.
>>> >> queue up a task to vacuum/compact/defragment/clean up the database..
>>> >> if the task succeeds, and a success callback is provided, queue up a
>>> task
>>> >> to invoke the success callback; in all other cases (the task failed,
>>> or no
>>> >> success callback was provided), do nothing: proceed to the next task
>>> in the
>>> >> queue.
>>> >>
>>> >> does this seem acceptable? we (google engineers interested in this)
>>> feel
>>> >> that UAs should either not implement the vacuum() call, or they should
>>> >> respect it (rather than taking it as a hint). it is ok for UAs to keep
>>> track
>>> >> of things like system idleness or databases closing to do more
>>> vacuuming
>>> >> that the apps asked for, if they want to. however, we feel that a
>>> vacuum()
>>> >> request by an app should not be postponed, simply because sometimes
>>> apps
>>> >> know better than UAs when the best time to vacuum is (it might be nice
>>> to
>>> >> give apps more information on how fragmented their databases are, but
>>> that's
>>> >> a separate discussion).
>>> >>
>>> >> thanks,
>>> >> dumi
>>> >>
>>> >>
>>> >> 2010/3/9 Jeremy Orlow <jorlow@google.com>
>>> >>>
>>> >>> On Mon, Mar 8, 2010 at 8:47 PM, Dumitru Daniliuc <dumi@google.com>
>>> wrote:
>>> >>>>
>>> >>>>
>>> >>>> On Mon, Mar 8, 2010 at 3:39 AM, João Eiras <joaoe@opera.com> wrote:
>>> >>>>>
>>> >>>>>>> I don't see how the callbacks are useful though. Vacuum works
>>> >>>>>>> transparently, its effects are not visible, and what should the
>>> page
>>> >>>>>>> do in
>>> >>>>>>> case of error ?
>>> >>>>>>>
>>> >>>>>>
>>> >>>>>> i was thinking of something like:
>>> >>>>>>
>>> >>>>>> db.defragment(errorCallback, successCallback);
>>> >>>>>> showAPrettyImageAndAskTheUserToWait();
>>> >>>>>>
>>> >>>>>> function errorCallback(error) {}
>>> >>>>>> function successCallback() {
>>> >>>>>>  getRidOfThePrettyImageAndRestartTheApp();
>>> >>>>>> }
>>> >>>>>>
>>> >>>>>> just like you, i'm not sure if the error callback is useful at
>>> all,
>>> >>>>>> but i
>>> >>>>>> thought i'd add it to make the defragment() call look more like a
>>> >>>>>> transaction. maybe we don't need it.
>>> >>>>>>
>>> >>>>>
>>> >>>>> True, but this is a kind of operation that could very well just run
>>> on
>>> >>>>> the background, with a single optional callback when it's done (the
>>> webpage
>>> >>>>> can't do anything if an error is detected anyway).
>>> >>>>
>>> >>>> ok, so let's drop the errorCallback: vacuum([optional]
>>> successCallback);
>>> >>>>
>>> >>>>>
>>> >>>>> The user agent would need to queue any subsequent transactions if a
>>> >>>>> vacuum is running. I would consider it as an hint, and after all
>>> webpages
>>> >>>>> that own references to the underlying data files are closed, would
>>> do a
>>> >>>>> vacuum. So, if you have many tabs on gmail, and that a single gmail
>>> instance
>>> >>>>> tries to do multiple vacuums, it would equiv to one single vacuum
>>> operation.
>>> >>>>
>>> >>>> what do we do if some databases are opened for the entire life of
>>> the
>>> >>>> browser? for example, i open my browser which has myfavoriteapp.comset as
>>> >>>> its homepage. myfavoriteapp.com immediately opens a DB, and i only
>>> close
>>> >>>> that app when i close the browser. when would the browser vacuum
>>> >>>> myfavoriteapp's DBs in this case?
>>> >>>>
>>> >>>> i think it's ok for the UA to vacuum some DBs automatically when it
>>> >>>> thinks it's a good time to do so; however, if a platform supports
>>> the
>>> >>>> vacuum/defrag call (i.e. if it doesn't treat it is a no-op), then i
>>> think a
>>> >>>> vacuum call coming from the app should be immediately scheduled
>>> (yes, the
>>> >>>> subsequent transactions would have to wait for the vacuuming to
>>> finish
>>> >>>> running). in some cases, the apps know better than the UA when to
>>> vacuum
>>> >>>> their DBs.
>>> >>>>
>>> >>>> by the way, we should probably agree on a name for this call. which
>>> one
>>> >>>> do you prefer? vacuum, defragment, defrag, something else? i don't
>>> have a
>>> >>>> strong opinion.
>>> >>>
>>> >>> I think vacuum is fine since the spec is already tied to the SQLite
>>> SQL
>>> >>> dialect.
>>> >>> collectGarbage() is another possibility
>>> >>> Go with whatever you think is most clear and accurate though.
>>> >>> J
>>> >
>>> >
>>>
>>>
>>>
>>
>
Received on Saturday, 13 March 2010 04:26:28 UTC