- From: Guha <guha@google.com>
- Date: Tue, 20 Nov 2018 19:00:39 -0800
- To: Elwin Huaman <elwinlhq@gmail.com>
- Cc: public-schemaorg@w3.org, Dan Brickley <danbri@google.com>, support@datacommons.org
- Message-ID: <CAPAGhv9nqtmsxnyHyLUEXmQyg=v7tOOkDANEcDYK5Z61zDCASQ@mail.gmail.com>
Elwin, We are not trying to 'improve' the size of the data commons KG. Merely its utility. In that spirit, I have to respectfully keep our knowledge base size out of the discussions you reference. guha On Tue, Nov 20, 2018 at 5:11 AM Elwin Huaman <elwinlhq@gmail.com> wrote: > I totally agree with you, "what really matters is the utility of the > data", However, is important to note that for building high-quality Knowledge > Graphs(KGs) depends on the structure and identified reliable data > sources. "What gets measured, gets improved". > > I am just curious about these numbers because two reasons: First, to date, > some authors have estimated the cost of the KG curation and that create a > large-scale KGs can be hard and costly [1]. As well as, whichever approach > is taken for constructing a KG, the KG resulted is not always correct [2]. > Second, the importance of KG cleaning in the KG creation lifecycle, where > external sources(e.g., Data Commons Knowledge Graph DCKG ) can be used as > input to training data i.e., machine learning approach may require a > sufficient quantity of training data. > > thank you for your time! > > Elwin > > [1] Paulheim, H. (2018). How much is a triple? estimating the cost of > knowledge graph creation. In Proceedings of the ISWC 2018 Posters & > Demonstrations, Industry and Blue Sky Ideas Tracks co-located with 17th > International Semantic Web Conference (ISWC 2018), Monterey, USA, October > 8th - to - 12th, 2018. > [2] Bordes, A. and Gabrilovich, E. (2014). Constructing and mining > web-scale knowledge graphs: Kdd 2014 tutorial. In Proceedings of the 20th > ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, > KDD ’14, pages 1967–1967, New York, NY, USA. ACM. > > On Mon, 19 Nov 2018 at 22:35, Guha <guha@google.com> wrote: > >> Elwin, >> >> These numbers are rapidly changing and as we have all learnt, what >> matters really is the utility of the data, not the size of the graph. >> >> So, could you give us some context? >> >> guha >> >> On Mon, Nov 19, 2018 at 11:47 AM Elwin Huaman <elwinlhq@gmail.com> wrote: >> >>> Hey all, >>> >>> I was challenged last week to provide info(in rough numbers) about the >>> Data Commons Knowledge Graph(DCKG), which was constructed by synthesizing >>> in a single Knowledge Graph from many different data sources[1]. What I am >>> looking for especially is to know: >>> >>> - *How many entities or nodes the DCKG has?*, understanding that >>> *dcid* (DataCommons identifier) is a unique identifier assigned to >>> each entity in the knowledge graph, furthermore entities are represented by >>> nodes[2]. >>> - *How many data sources the DCKG has?*, because currently contains >>> data from Wikipedia, the US Census, NOAA, FBI, *etc?*[3]. >>> - *How many nodes and relations the DCKG has? and **How many >>> statements it has?* >>> - For example, the statement "Santa Clara County is contained in >>> the State of California" is represented in the graph as two nodes: "Santa >>> Clara County" and "California" with an edge labeled "containedInPlace" >>> pointing from Santa Clara to California. >>> - *What is the current size of the used vocabulary in the DCKG?*, >>> taking into account that dataCommons.org builds upon on the vocabularies >>> defined by Schema.org[4] >>> - *These are potential FAQs* for future researchers (of course there >>> are more) >>> >>> Could you help me? >>> >>> cheers, >>> Elwin Huaman >>> >>> >>> [1] https://browser.datacommons.org/ >>> [2] >>> https://colab.research.google.com/drive/1vffnWktZyffk7pNfpuXrTsCpp-od5W47 >>> [3] https://datacommons.org/ >>> [4] https://datacommons.org/faq >>> >>> > > -- > _____________________________ > *Elwin Huaman * > *Web Engineer* > *+34 671 529 151* > *+43 0677 631 129 47* >
Received on Wednesday, 21 November 2018 03:01:15 UTC