Re: dataCommons.org: Data Commons Knowledge Graph (DCKG) from Dan Brickley on 2018-11-21 (public-schemaorg@w3.org from November 2018)

From: Dan Brickley <danbri@google.com>
Date: Tue, 20 Nov 2018 20:38:37 -0800
To: Paola Di Maio <paoladimaio10@googlemail.com>
Cc: "R.V. Guha" <guha@google.com>, elwinlhq@gmail.com, "schema.org Mailing List" <public-schemaorg@w3.org>, support@datacommons.org
Message-ID: <CAK-qy=7vKmLQVryHGx5UPgVffSZsW4KZMVZiciq==Z4iDvPu2Q@mail.gmail.com>
On Tue, 20 Nov 2018, 19:37 Paola Di Maio <paola.dimaio@gmail.com wrote:

> So, dear Guha et ak
> are you excluding the possibility of any direct correlation between
> size of the graph and utility?
> I;d like to see the hypothesis  evaluated either way
>

Counting triples is a very crude tool. Pragmatic tweaks to the modeling
structure can radically change the triple count without changing data
meaningfully.

At least we can be sure that zero-triple and infinitely large KGs are
equally useless; everything else needs more careful consideration...

Dan


> Dr Paola Di Maio
> Artificial Intelligence Knowledge Representation
> Special Issue, Systems MDPI
> *Cfp  accepting manuscripts
> A bit about me
>
>
>
>
> On Wed, Nov 21, 2018 at 11:04 AM Guha <guha@google.com> wrote:
> >
> > Elwin,
> >
> >  We are not trying to 'improve' the size of the data commons KG. Merely
> its utility. In that spirit, I have to respectfully keep our knowledge base
> size out of the discussions you reference.
> >
> >  guha
> >
> > On Tue, Nov 20, 2018 at 5:11 AM Elwin Huaman <elwinlhq@gmail.com> wrote:
> >>
> >> I totally agree with you, "what really matters is the utility of the
> data", However, is important to note that for building high-quality
> Knowledge Graphs(KGs) depends on the structure and identified reliable data
> sources. "What gets measured, gets improved".
> >>
> >> I am just curious about these numbers because two reasons: First, to
> date, some authors have estimated the cost of the KG curation and that
> create a large-scale KGs can be hard and costly [1]. As well as, whichever
> approach is taken for constructing a KG, the KG resulted is not always
> correct [2]. Second, the importance of KG cleaning in the KG creation
> lifecycle, where external sources(e.g., Data Commons Knowledge Graph DCKG )
> can be used as input to training data i.e., machine learning approach may
> require a sufficient quantity of training data.
> >>
> >> thank you for your time!
> >>
> >> Elwin
> >>
> >> [1] Paulheim, H. (2018). How much is a triple? estimating the cost of
> knowledge graph creation. In Proceedings of the ISWC 2018 Posters &
> Demonstrations, Industry and Blue Sky Ideas Tracks co-located with 17th
> International Semantic Web Conference (ISWC 2018), Monterey, USA, October
> 8th - to - 12th, 2018.
> >> [2] Bordes, A. and Gabrilovich, E. (2014). Constructing and mining
> web-scale knowledge graphs: Kdd 2014 tutorial. In Proceedings of the 20th
> ACM SIGKDD International Conference on Knowledge Discovery and Data Mining,
> KDD ’14, pages 1967–1967, New York, NY, USA. ACM.
> >>
> >> On Mon, 19 Nov 2018 at 22:35, Guha <guha@google.com> wrote:
> >>>
> >>> Elwin,
> >>>
> >>>  These numbers are rapidly changing and as we have all learnt, what
> matters really is the utility of the data, not the size of the graph.
> >>>
> >>>  So, could you give us some context?
> >>>
> >>> guha
> >>>
> >>> On Mon, Nov 19, 2018 at 11:47 AM Elwin Huaman <elwinlhq@gmail.com>
> wrote:
> >>>>
> >>>> Hey all,
> >>>>
> >>>> I was challenged last week to provide info(in rough numbers) about
> the Data Commons Knowledge Graph(DCKG), which was constructed by
> synthesizing in a single Knowledge Graph from many different data
> sources[1]. What I am looking for especially is to know:
> >>>>
> >>>> How many entities or nodes the DCKG has?, understanding that dcid
> (DataCommons identifier) is a unique identifier assigned to each entity in
> the knowledge graph, furthermore entities are represented by nodes[2].
> >>>> How many data sources the DCKG has?, because currently contains data
> from Wikipedia, the US Census, NOAA, FBI, etc?[3].
> >>>> How many nodes and relations the DCKG has? and  How many statements
> it has?
> >>>>
> >>>> For example, the statement "Santa Clara County is contained in the
> State of California" is represented in the graph as two nodes: "Santa Clara
> County" and "California" with an edge labeled "containedInPlace" pointing
> from Santa Clara to California.
> >>>>
> >>>> What is the current size of the used vocabulary in the DCKG?, taking
> into account that dataCommons.org builds upon on the vocabularies defined
> by Schema.org[4]
> >>>> These are potential FAQs for future researchers (of course there are
> more)
> >>>>
> >>>> Could you help me?
> >>>>
> >>>> cheers,
> >>>> Elwin Huaman
> >>>>
> >>>>
> >>>> [1] https://browser.datacommons.org/
> >>>> [2]
> https://colab.research.google.com/drive/1vffnWktZyffk7pNfpuXrTsCpp-od5W47
> >>>> [3] https://datacommons.org/
> >>>> [4] https://datacommons.org/faq
> >>>>
> >>
> >>
> >> --
> >> _____________________________
> >> Elwin Huaman
> >> Web Engineer
> >> +34 671 529 151
> >> +43 0677 631 129 47
>
Received on Wednesday, 21 November 2018 04:39:14 UTC