Re: Ideas around using the reconciliation API in R

Hi team,

Yes, various strategies have been discussed at various R meetups around the
nation.

Most of the consensus has been that of rapid data exchange through *a
common backend* rather than confronting harder problems such as deep
integration in either RStudio or OpenRefine as discussed in our existing
bikeshedding issue `Edit cells > Transform > Language` support for R ·
Issue #1226 · OpenRefine/OpenRefine (github.com)
<https://github.com/OpenRefine/OpenRefine/issues/1226>.

I.E. Solve the data exchange problem using the KISS method.  And then just
let each tool do its best thing.

A few candidates for a common backend have appeared as beneficial to both
sides, R and OpenRefine.
SQL and relational databases could be taken advantage of and
are ubiquitous. dbplyr 2.2.0 (tidyverse.org)
<https://www.tidyverse.org/blog/2022/06/dbplyr-2-2-0/>
Apache Spark is a great data storage backend (that actually can store in
multiple disk formats itself).
OpenRefine has already put effort into utilizing Apache Spark as a possible
means of data storage for projects (4.0 branch in our GitHub)
Hadley Wickham has been instrumental in pushing forward new tooling to
support better data exchange for the R ecosystem with his 'tidyverse'
package (entire suite of packages).  Here's one as an example: tidyverse/dplyr:
dplyr: A grammar of data manipulation (github.com)
<https://github.com/tidyverse/dplyr>

With data exchange solved, then a quick means of data refresh is needed,
and notification mechanism both for human eyes and machine eyes, especially
with teams that split the roles of statisticians in R and data
prep/cleaners.  There it becomes useful to know if data has changed in a
project (Apache Spark gives this for free), and then for human eyes, offer
a UI presentation of some kind to the user in RStudio to know if a project
was modified and needs a refresh  (kind of like git).  Many users have
expressed the need in RStudio, if the data exchange is solved, that a sort
of 'git push/pull' would suffice, or simply 'read/write to table' and again
the 'tidyverse' ecosystem has the tooling to support this.  The UI in
RStudio can be customized and here is just the extra bit of work necessary
for users to be able to accomplish a 1 click of 'read/refresh remote table'
or 'write to remote table'.
This could be something discussed by users with Hadley and others on the R
mailing list or better yet at RStudio Community
<https://community.rstudio.com/>.

Alternative data exchange mechanisms have been considered and still viable
perhaps for some:
CSV on the Web  (you simply pass the URL to read_csv() in Rstudio) to load
it, but that's a one way street, and doesn't allow team workflows very well
(message passing is necessary to let someone know something has changed and
they need to read again).
Another data transformation is sometimes needed such as Data Cube
vocabulary, but this has waned a bit in recent years, I think.  CSV into
the DataCube vocabulary
<https://www.w3.org/TR/2016/NOTE-tabular-data-primer-20160225/#data-cube>
Although it still might be advantageous in some simple scenarios to expose
a CSV on the Web endpoint directly with OpenRefine (data and metadata).  We
do have embedded Jetty server for local networking where users might want
to quickly share an OpenRefine project as a CSV (all the caveats of
punching through a firewall or internal proxy in a university or company
would apply here).  But the use case of sharing was deemed not as
advantageous as direct collaboration through a common backend.  But if
folks wanted to do that, all that would be needed is OpenRefine would
expose the data and metadata using CSV on the Web standards (the
tableSchema is served separately like
http://localhost:3333/<project_id>/schema/countries.json).
Anyways, Jeni Tennison at ODI or others in the CSV on the Web community
could give some more background here (since I only read blogs and
publications) and their community is now here:  CSV on the Web Community
Group (w3.org) <https://www.w3.org/community/csvw/>

Thad
https://www.linkedin.com/in/thadguidry/
https://calendly.com/thadguidry/


On Wed, Jun 8, 2022 at 6:32 AM Antonin Delpeuch <antonin@delpeuch.eu> wrote:

> Hello,
>
> Jonathan Stoneman (Cc:) has put some thought into what reconciliation in
> R could look like. I think it would be amazing to have such an
> integration in R, so with his permission I am sharing those thoughts
> here (attached). Perhaps people more familiar than me with the R
> ecosystem want to give it a go?
>
> Cheers,
>
> Antonin
>
>

Received on Wednesday, 8 June 2022 14:50:15 UTC