Re: dataCommons.org

On Thu, 18 Oct 2018 at 14:20, Guha <guha@google.com> wrote:

> In May 2018, we introduced datacommons.org, <http://datacommons.org/> an
> initiative for the open sharing of data, and released the first fact check
> corpus to help academia and practitioners to study misinformation.
>
> We are now taking the next step in the evolution of datacommons.org.
>

A few notes to follow up on Guha's dataCommons announcement, oriented
towards the Schema.org community.

(These are potential FAQs, but nobody has actually asked them, so I'm trying
to anticipate a few more potential questions here in advance.)

Firstly, folk in the Search Marketing community will be wondering what this
means for SEO. At this point, I'd suggest this is "one to watch". The
dataCommons
effort is in large part about making it easier to use Schema.org data,
re-exposing an integrated view of data that is represented in Schema markup.
For those in the SEO world interested to engage, it probably makes most
sense to continue to focus on the vocabulary already used by search
engines, e.g. for Google,
https://developers.google.com/search/docs/guides/search-gallery

For those more in the standards world, you may be wondering "hey, what
about W3C RDF Data Cube, SKOS, SPARQL 1.1, PROV, SHACL, ShEx, CSVW, JSON-LD
contexts, Linked Data Platform, etc etc?"... or many other interesting and
standardized approaches. At this stage dataCommons is more focussed on
taking a step back and concentrating on mechanisms (e.g. workflow,
implementation, APIs) focussed more on the core graph data model. In a W3C
setting, this roughly means RDF. The dataCommons approach to Knowledge
Graphs highlights a common issue with RDF that has been encountered also in
many related efforts, from Freebase and Wikidata to Schema.org itself: that
we need to represent fine-grained provenance and qualifications with each
piece of factual data; historically this is difficult in standard RDF
without (ab)using SPARQL named graphs as representational mechanism. W3C's
upcoming workshop on bridging RDF, Property Graph and SQL standards for
Graph Data is therefore highly relevant (
https://www.w3.org/Data/events/data-ws-2019/cfp.html).

For data science, journalists and those working with public datasets, we have
been exploring (see http://datacommons.org/colab) the use of
Python/Jupyter notebooks
(including a protocol-backed Python API) as a way to expose data for
exploration
via Panda data frames. There are a few directions this could take, and the
python wrapper is effectively another way to avoid prematurely fixing our
approach to query language, provenance, etc. The current approach drafts
some dedicated domain-specific schemas to reflect more explicitly what some
public statistics datasets are telling us, and we are looking at ways of
bridging this to more generic representations (like data cube) which offer
weaker data Integration but may scale better for the long tail of public
data.


cheers,


Dan

Received on Thursday, 18 October 2018 21:26:52 UTC