How to represent theories? from Amirouche Boubekki on 2019-11-01 (public-aikr@w3.org from November 2019)

From: Amirouche Boubekki <amirouche.boubekki@gmail.com>
Date: Fri, 1 Nov 2019 15:45:56 +0100
To: W3C AIKR CG <public-aikr@w3.org>
Message-ID: <CAL7_Mo8YHYaWKk2vJLsBeWzsktw9h8fg89HkumhHcAEym8gbHA@mail.gmail.com>

I stumbled upon an interesting problem based on my work on vnstore
(formerly fstore) that is how to represent several theories made by an
algorithm in the context of a versioned branch-able database (like
git).

Consider for instance a gazetteer based entity-resolution system as
described in the following question:

https://stackoverflow.com/q/52046394/140837

Here is the code:

input = 'new york is the big apple'.split()


def spans(lst):
    if len(lst) == 0:
        yield None
    for index in range(1, len(lst)):
        for span in spans(lst[index:]):
            if span is not None:
                yield [lst[0:index]] + span
    yield [lst]

knowledgebase = [
    ['new', 'york'],
    ['big', 'apple'],
]

out = []
scores = []

for span in spans(input):
    score = 0
    for candidate in span:
        for uid, entity in enumerate(knowledgebase):
            if candidate == entity:
                score += 1
    out.append(span)
    scores.append(score)

leaderboard = sorted(zip(out, scores), key=lambda x: x[1])

for winner in leaderboard:
    print(winner[1], ' ~ ', winner[0])

The above (naive?) algorithm will guess multiple probable way to link
a sentence to the knowledge base. With a determinist scoring heuristic
it will filter many alternatives and for example the following
alternatives:

  [['new', 'york'], ['is'], ['the'], ['big', 'apple']]
  [['new', 'york'], ['is', 'the'], ['big', 'apple']]

Those are two possible way to link the input sentence "new york is the
big apple".

What I want to show is an example where a determinist algorithm can
not come up with a single result and must keep around "theories"
downstream and eliminate zero or more theory with another algorithm or
knowledge acquired later.

In the versioned nstore (vnstore), one can represent theories using
branches (as in git) OR using an abstraction on top of the nstore.
Representing theory in the vnstore will require access to the history
and branch information along some data to tie together a set of
theories that are related to a given problem. Whereas theories on top
of the nstore will require only "some data to tie together a set of
theories that are related to a given problem" but will require extra
care to make sure one theory does not leak in another theory.

Using the nstore approach will mean that there is yet-another
structure, the structure of alternative theories, on top the nstore
that is very similar to the vnstore. It gives more freedom but it also
lead to more complex system.

It seems to me that the vnstore seems to already solve the idea of
"alternative theories", as in git, branches are alternative version of
a software, but it seems like re-using vnstore abstraction for
theories made by algorithms will lead to more complex code.

What do you think? How do you handle alternative theories in your work?


-- 
Amirouche ~ https://hyper.dev

Received on Friday, 1 November 2019 14:46:10 UTC