Re: Q about reconciliation query batch size

Hi Kevin!

1.  The Reconciliation API is for ANY client, not just for OpenRefine.  And
why we decided to begin a W3C community to collaborate with community on
creating a standard (currently in working draft)

2. As for where OpenRefine uses limits within Reconciliation
processes...the limits are set or used in a few areas:

OpenRefine's StandardReconConfig  - API documented here:
https://reconciliation-api.github.io/specs/0.1/#structure-of-a-reconciliation-query
OpenRefine's GuessTypesOfColumn - hardcoded sample size in OpenRefine client
<https://github.com/OpenRefine/OpenRefine/blob/master/main/src/com/google/refine/commands/recon/GuessTypesOfColumnCommand.java#L119>
is limit of 10 first rows to inspect and send to service to guess column
Type.
OpenRefine's Suggest Service flyout pane - limits provided by service
provider - API documented here:
https://reconciliation-api.github.io/specs/0.1/#suggest-services

Data Extension Property - API proposal documented here
https://reconciliation-api.github.io/specs/0.1/#data-extension-service

Let me know if that helps or you have further questions.

Thad
https://www.linkedin.com/in/thadguidry/


On Thu, Jun 11, 2020 at 11:00 AM Ford, Kevin <kevinford@loc.gov> wrote:

> Hello all:
>
>
>
> I presume this is the best place to ask this question, which I’ve harbored
> for years but which for a variety of reasons I’ve never had real occasion
> to ask until now.
>
>
>
> Do I understand correctly that there is no limit to, and no way to enforce
> a limit of, the reconciliation query batch size?
>
>
>
> This sentence from the documentation on the Github site – “OpenRefine
> queries the reconciliation service in batch mode
> <https://github.com/OpenRefine/OpenRefine/wiki/Reconciliation-Service-API#multiple-query-mode>
> on the first ten items of the column to be reconciled.” [1] -  /might/
> suggest the size of batches is 10, but I believe we’re to understand that
> this particular call basically represents a test before the real, full
> reconciliation kicks off.  Yes?
>
>
>
> The “Note” under this section of the W3C specification work [2] seems to
> make it abundantly clear that there is no restriction on the length of
> query batches.
>
>
>
> I didn’t see a clear way to do this via the service manifest.
>
>
>
> If there is no limit on the size, is there a way for a service provider to
> impose a limit?  If so, how?  If not, why not?
>
>
>
> Assuming it is not possible to impose a limit, how does one protect a
> service from becoming overwhelmed by one extremely large reconciliation
> request or a number of big ones?  It seems that this opens up the service
> to a DoS attack, but perhaps I am mistaken.  Even if that risk is perhaps
> marginal, it still seems that a provider could nevertheless experience a
> considerable performance penalty having to field requests with huge query
> batch sizes.
>
>
>
> I’m familiar in an academic sense with OpenRefine, but not whether it
> might control the size of query batches to ensure a provider is not
> overwhelmed.  That said, if this work is to become a more generic way to
> provide reconciliation or suggest services to be used by software other
> than OpenRefine, then it still seems this should be an
> advertiseable/controllable value since one cannot always count on the
> client being responsible.
>
>
>
> Yours,
>
> Kevin
>
>
>
> [1]
> https://github.com/OpenRefine/OpenRefine/wiki/Reconciliation-Service-API#workflow-overview
>
> [2]
> https://reconciliation-api.github.io/specs/0.1/#sending-reconciliation-queries-to-a-service
>
>
>
> --
>
> Kevin Ford
>
> Library of Congress
>
> Washington, DC
>
>
>

Received on Thursday, 11 June 2020 16:47:34 UTC