Re: dataCommons.org: Data Commons Knowledge Graph (DCKG) from Elwin Huaman on 2018-11-20 (public-schemaorg@w3.org from November 2018)

From: Elwin Huaman <elwinlhq@gmail.com>
Date: Tue, 20 Nov 2018 14:10:57 +0100
To: "R.V. Guha" <guha@google.com>
Cc: public-schemaorg@w3.org, Dan Brickley <danbri@google.com>, support@datacommons.org
Message-ID: <CABhN3mx22b5aogQ3MK=b0vRQsppva5CnR_o8uchDKfCBRp6o1g@mail.gmail.com>

I totally agree with you, "what really matters is the utility of the data",
However, is important to note that for building high-quality Knowledge
Graphs(KGs)
depends on the structure and identified reliable data sources. "What gets
measured, gets improved".

I am just curious about these numbers because two reasons: First, to date,
some authors have estimated the cost of the KG curation and that create a
large-scale KGs can be hard and costly [1]. As well as, whichever approach
is taken for constructing a KG, the KG resulted is not always correct [2].
Second, the importance of KG cleaning in the KG creation lifecycle, where
external sources(e.g., Data Commons Knowledge Graph DCKG ) can be used as
input to training data i.e., machine learning approach may require a
sufficient quantity of training data.

thank you for your time!

Elwin

[1] Paulheim, H. (2018). How much is a triple? estimating the cost of
knowledge graph creation. In Proceedings of the ISWC 2018 Posters &
Demonstrations, Industry and Blue Sky Ideas Tracks co-located with 17th
International Semantic Web Conference (ISWC 2018), Monterey, USA, October
8th - to - 12th, 2018.
[2] Bordes, A. and Gabrilovich, E. (2014). Constructing and mining
web-scale knowledge graphs: Kdd 2014 tutorial. In Proceedings of the 20th
ACM SIGKDD International Conference on Knowledge Discovery and Data Mining,
KDD ’14, pages 1967–1967, New York, NY, USA. ACM.

On Mon, 19 Nov 2018 at 22:35, Guha <guha@google.com> wrote:

> Elwin,
>
>  These numbers are rapidly changing and as we have all learnt, what
> matters really is the utility of the data, not the size of the graph.
>
>  So, could you give us some context?
>
> guha
>
> On Mon, Nov 19, 2018 at 11:47 AM Elwin Huaman <elwinlhq@gmail.com> wrote:
>
>> Hey all,
>>
>> I was challenged last week to provide info(in rough numbers) about the
>> Data Commons Knowledge Graph(DCKG), which was constructed by synthesizing
>> in a single Knowledge Graph from many different data sources[1]. What I am
>> looking for especially is to know:
>>
>>    - *How many entities or nodes the DCKG has?*, understanding that
>>    *dcid* (DataCommons identifier) is a unique identifier assigned to
>>    each entity in the knowledge graph, furthermore entities are represented by
>>    nodes[2].
>>    - *How many data sources the DCKG has?*, because currently contains
>>    data from Wikipedia, the US Census, NOAA, FBI, *etc?*[3].
>>    - *How many nodes and relations the DCKG has? and  **How many
>>    statements it has?*
>>       - For example, the statement "Santa Clara County is contained in
>>       the State of California" is represented in the graph as two nodes: "Santa
>>       Clara County" and "California" with an edge labeled "containedInPlace"
>>       pointing from Santa Clara to California.
>>    - *What is the current size of the used vocabulary in the DCKG?*,
>>    taking into account that dataCommons.org builds upon on the vocabularies
>>    defined by Schema.org[4]
>>    - *These are potential FAQs* for future researchers (of course there
>>    are more)
>>
>> Could you help me?
>>
>> cheers,
>> Elwin Huaman
>>
>>
>> [1] https://browser.datacommons.org/
>> [2]
>> https://colab.research.google.com/drive/1vffnWktZyffk7pNfpuXrTsCpp-od5W47
>> [3] https://datacommons.org/
>> [4] https://datacommons.org/faq
>>
>>

-- 
_____________________________
*Elwin Huaman *
*Web Engineer*
*+34 671 529 151*
*+43 0677 631 129 47*

Received on Tuesday, 20 November 2018 13:11:30 UTC