Re: New KBpedia and KBAI from Mike Bergman on 2020-06-18 (public-aikr@w3.org from June 2020)

From: Mike Bergman <mike@mkbergman.com>
Date: Wed, 17 Jun 2020 19:57:59 -0500
To: paoladimaio10@googlemail.com
Cc: W3C AIKR CG <public-aikr@w3.org>
Message-ID: <e30d9fe6-f22f-e9f8-1ae8-865de442b281@mkbergman.com>
Hi Paola,

You always ask the big questions. ;) So, I will try to limit my response 
to big answers.

As for Wikipedia, I know/suspect there is bias and falsity in some of 
the information. I see little of it directly, more in terms of errors of 
omission or viewpoint, rather than direct falsities. My suspicion is the 
actual percentage of unreliable information is quite low, though the 
information may still be incomplete. One point worth making has to do 
with so-called 'gold standards' that are essential to all science-based 
assessments, particularly with regard to human language or knowledge. 
Studies often see interannotator agreements in the 75-80% range, and 
only very widely used standards (like WordNet or various language 
corpora) get to agreements in the 90-95% range. This is an actual error 
term, so when one sees F1 stats or similar, perhaps of 80% or whatever, 
you need to decrement that amount by the interannotator agreement 
percentage. Many tests claiming 85-90% agreements for NLP are actually 
closer to 64% to 85% once we adjust for interannotator. Is 35% to 15% of 
information on Wikipedia bad??

As a general matter, I am extremely leery of fact-checking services 
because what are the standards? who are the annotators? what is their 
interannotator agreement? These are science-based concerns, and I have 
ethical ones as well.

As for the information in KBpedia, we tend to check most if not all of 
our links each release (which have averaged every 4-6 months or so). 
That is not perhaps frequent enough, but we also tend to tie into the 
more central or structural concepts in these external sources, rather 
than the leaves, which are more dynamic. The way KBpedia works is to tie 
into a key linkage point in an external source, and use that linkage 
point to retrieve current instances from that source. That is one reason 
why there are only 58 K concepts in KBpedia, but they tie into tens of 
millions of instances as maintained by the external sources.

The reasoning we do is the traditional deductive ones (consistency, 
satisfiabiilty and subsumption) using reasoners like Pellet or HermiT, 
plus inductive reasoning that is based on various supervised machine 
learning approaches. We are not using abductive reasoning, but a reason 
for trying to follow the insights of Charles Peirce is that we have a 
means to get into that hypothesis-generating and -screening logic, which 
Peirce did more than anyone to explicate. It is an area I personally 
want to pursue further.

The management of information follows a triple/quad store that handles 
the overall reasoning knowledge graph, with direct retrievals of 
instance data from the source knowledge bases (the seven specifically 
mentioned, plus another score of minor ones). Thus, KBpedia is not a 
massive, centralized system, but a rather lightweight one with 
distributed access and retrieval from its contributor sources. Of 
course, this kind of Web-oriented architecture with all resources 
identified by IRIs is one of the reasons semantic technologies make such 
great sense.

Lastly, in terms of big useful lessons, I would point to the power of 
having "correct" KR distinctions between instances (individuals), types 
(generals or concepts), events, and attributes (monadic characteristics 
like color or shape) on the noun side. And, on the verb side, relations 
that split between attributes, direct relations and representations 
(indexes and denotations). Look at any top-level ontology or knowledge 
graph and ask yourself whether and how they handle these distinctions. 
Most do not or only hand wave. The distinctions that we use on these 
matters again come from the insights of Charles Sanders Peirce.

Best, Mike

On 6/16/2020 6:49 PM, Paola Di Maio wrote:
> Thank you Mike, looks like a big interesting project, congrats for the 
> release
>
> Now, the problem I have with wikipedia is that in addition to 
> containing good articles sometimes, it is not fact checked, there is a 
> lot of rubbish/false information (true, there is quite a lot of 
> rubbish outside of wikipedia too).
>
> A few of questions: how often is the data pulled/updated from these 
> databases?  Is the data stored in sql or how? How does the system 
> manage the integration of different data sets/data structures? can you 
> share the design of the inference model/reasoning architecture? what 
> are the implications/useful lessons for KR we can learn from this project?
>
> On Tue, Jun 16, 2020 at 10:27 PM Mike Bergman <mike@mkbergman.com 
> <mailto:mike@mkbergman.com>> wrote:
>
>     To All,
>
>     I am pleased to announce that we have released KBpedia
>     <http://kbpedia.org/> v 2.50 with e-commerce and logistics
>     capabilities, as well as significant other refinements. This
>     upgrade comes from adding the entire top structure and the most
>     common products and services of the United Nations Standard
>     Products and Services Code. UNSPSC
>     <https://en.wikipedia.org/wiki/UNSPSC> is a comprehensive,
>     multi-lingual taxonomy for products and services, organized into
>     four levels, with third-party crosswalks to economic and
>     demographic data sources. It is a leading standard for many
>     industrial and economic applications. UNSPSC is KBpedia's seventh
>     core knowledge base, joining the public knowledge bases of
>     Wikipedia <https://en.wikipedia.org/wiki/Wikipedia>, Wikidata
>     <https://en.wikipedia.org/wiki/Wikidata>, GeoNames
>     <https://en.wikipedia.org/wiki/GeoNames>, DBpedia
>     <https://en.wikipedia.org/wiki/DBpedia>, schema.org
>     <https://en.wikipedia.org/wiki/Schema.org>, and OpenCyc
>     <https://en.wikipedia.org/wiki/Cyc> already integrated into the
>     system.
>
>     KBpedia is a knowledge graph that provides a coherent scaffolding
>     to achieve its twin goals of data interoperability and
>     knowledge-based artificial intelligence (KBAI
>     <http://www.mkbergman.com/category/kbai/>). KBpedia now contains
>     more than 58,000 reference concepts and nearly 200,000 unique
>     mappings to its knowledge bases, enabling links to more than 40
>     million entities. It is written in the standard OWL 2
>     <https://en.wikipedia.org/wiki/Web_Ontology_Language> semantic
>     language from the W3C
>     <https://en.wikipedia.org/wiki/World_Wide_Web_Consortium>.
>
>     KBpedia consists of 73 mostly disjoint typologies organized under
>     an upper KBpedia Knowledge Ontology (KKO), which is designed
>     according to the universal categories and knowledge representation
>     insights of the great American 19th century scientist, logician,
>     and polymath, Charles Sanders Peirce
>     <https://en.wikipedia.org/wiki/Charles_Sanders_Peirce>. KBpedia,
>     KKO, and all of its mappings and files are open source under the
>     Creative Commons Attribution 4.0 International (CC BY 4.0)
>     <https://creativecommons.org/licenses/by/4.0/> license.
>
>     For more details, see the release announcement
>     <http://kbpedia.org/resources/news/kbpedia-adds-ecommerce/> or go
>     to Github
>     <https://github.com/Cognonto/kbpedia/blob/master/versions/2.50/>
>     to download <http://kbpedia.org/resources/downloads/> the distro.
>
>     Thanks, Mike
>
Received on Thursday, 18 June 2020 00:58:14 UTC