- From: Dan Brickley <danbri@google.com>
- Date: Tue, 20 Nov 2018 20:38:37 -0800
- To: Paola Di Maio <paoladimaio10@googlemail.com>
- Cc: "R.V. Guha" <guha@google.com>, elwinlhq@gmail.com, "schema.org Mailing List" <public-schemaorg@w3.org>, support@datacommons.org
- Message-ID: <CAK-qy=7vKmLQVryHGx5UPgVffSZsW4KZMVZiciq==Z4iDvPu2Q@mail.gmail.com>
On Tue, 20 Nov 2018, 19:37 Paola Di Maio <paola.dimaio@gmail.com wrote: > So, dear Guha et ak > are you excluding the possibility of any direct correlation between > size of the graph and utility? > I;d like to see the hypothesis evaluated either way > Counting triples is a very crude tool. Pragmatic tweaks to the modeling structure can radically change the triple count without changing data meaningfully. At least we can be sure that zero-triple and infinitely large KGs are equally useless; everything else needs more careful consideration... Dan > Dr Paola Di Maio > Artificial Intelligence Knowledge Representation > Special Issue, Systems MDPI > *Cfp accepting manuscripts > A bit about me > > > > > On Wed, Nov 21, 2018 at 11:04 AM Guha <guha@google.com> wrote: > > > > Elwin, > > > > We are not trying to 'improve' the size of the data commons KG. Merely > its utility. In that spirit, I have to respectfully keep our knowledge base > size out of the discussions you reference. > > > > guha > > > > On Tue, Nov 20, 2018 at 5:11 AM Elwin Huaman <elwinlhq@gmail.com> wrote: > >> > >> I totally agree with you, "what really matters is the utility of the > data", However, is important to note that for building high-quality > Knowledge Graphs(KGs) depends on the structure and identified reliable data > sources. "What gets measured, gets improved". > >> > >> I am just curious about these numbers because two reasons: First, to > date, some authors have estimated the cost of the KG curation and that > create a large-scale KGs can be hard and costly [1]. As well as, whichever > approach is taken for constructing a KG, the KG resulted is not always > correct [2]. Second, the importance of KG cleaning in the KG creation > lifecycle, where external sources(e.g., Data Commons Knowledge Graph DCKG ) > can be used as input to training data i.e., machine learning approach may > require a sufficient quantity of training data. > >> > >> thank you for your time! > >> > >> Elwin > >> > >> [1] Paulheim, H. (2018). How much is a triple? estimating the cost of > knowledge graph creation. In Proceedings of the ISWC 2018 Posters & > Demonstrations, Industry and Blue Sky Ideas Tracks co-located with 17th > International Semantic Web Conference (ISWC 2018), Monterey, USA, October > 8th - to - 12th, 2018. > >> [2] Bordes, A. and Gabrilovich, E. (2014). Constructing and mining > web-scale knowledge graphs: Kdd 2014 tutorial. In Proceedings of the 20th > ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, > KDD ’14, pages 1967–1967, New York, NY, USA. ACM. > >> > >> On Mon, 19 Nov 2018 at 22:35, Guha <guha@google.com> wrote: > >>> > >>> Elwin, > >>> > >>> These numbers are rapidly changing and as we have all learnt, what > matters really is the utility of the data, not the size of the graph. > >>> > >>> So, could you give us some context? > >>> > >>> guha > >>> > >>> On Mon, Nov 19, 2018 at 11:47 AM Elwin Huaman <elwinlhq@gmail.com> > wrote: > >>>> > >>>> Hey all, > >>>> > >>>> I was challenged last week to provide info(in rough numbers) about > the Data Commons Knowledge Graph(DCKG), which was constructed by > synthesizing in a single Knowledge Graph from many different data > sources[1]. What I am looking for especially is to know: > >>>> > >>>> How many entities or nodes the DCKG has?, understanding that dcid > (DataCommons identifier) is a unique identifier assigned to each entity in > the knowledge graph, furthermore entities are represented by nodes[2]. > >>>> How many data sources the DCKG has?, because currently contains data > from Wikipedia, the US Census, NOAA, FBI, etc?[3]. > >>>> How many nodes and relations the DCKG has? and How many statements > it has? > >>>> > >>>> For example, the statement "Santa Clara County is contained in the > State of California" is represented in the graph as two nodes: "Santa Clara > County" and "California" with an edge labeled "containedInPlace" pointing > from Santa Clara to California. > >>>> > >>>> What is the current size of the used vocabulary in the DCKG?, taking > into account that dataCommons.org builds upon on the vocabularies defined > by Schema.org[4] > >>>> These are potential FAQs for future researchers (of course there are > more) > >>>> > >>>> Could you help me? > >>>> > >>>> cheers, > >>>> Elwin Huaman > >>>> > >>>> > >>>> [1] https://browser.datacommons.org/ > >>>> [2] > https://colab.research.google.com/drive/1vffnWktZyffk7pNfpuXrTsCpp-od5W47 > >>>> [3] https://datacommons.org/ > >>>> [4] https://datacommons.org/faq > >>>> > >> > >> > >> -- > >> _____________________________ > >> Elwin Huaman > >> Web Engineer > >> +34 671 529 151 > >> +43 0677 631 129 47 >
Received on Wednesday, 21 November 2018 04:39:14 UTC