Re: Word sense discrimination

Hi Dave (/list)

I started working on an email response; but its turning into a sort of
paper expanding upon my introduction to this group given the topic, so,
will post seperately and its still a work in progress.


On Tue, 24 Aug 2021 at 00:52, Dave Raggett <dsr@w3.org> wrote:

> I hope you have had a pleasant summer.  I’ve been doing some background
> reading, and am getting closer to starting some further implementation
> experiments. The aim is to develop a good enough solution for transforming
> natural language into graph representations of the meaning. By good enough,
> I mean good enough to enable work on using natural language in experiments
> on human-like reasoning and learning.
>
> Natural language understanding can be broken down into sub-tasks, such as,
> part of speech tagging, phrase structure analysis, semantic and pragmatic
> analysis. The difficulties mainly occur with the semantic processing. Many
> words can have multiple meanings, but humans effortlessly understand which
> meaning is intended in any given case. Semantic processing is also needed
> to figure out prepositional attachments, and  determine what pronouns, and
> other kinds of noun phrases, are referring to.
>

I'm led to believe some of the inferencing aspects link to Semiotics (
https://en.wikipedia.org/wiki/Semiotics )

back in 2001, had a minor involvement associated to cataloguing and making
searchable digibetas to phonetically transcribed MPEG2 files in a
database.  From memory https://en.wikipedia.org/wiki/Nuance_Communications
was the leader in phonetic analysis at that time.

Early last decade i learned of https://www.mico-project.eu/ and with that
sparql-mm, that i hoped could provide an open standards based methodology /
tooling / reference platform. I am not aware presently of more advanced
works done since.

QUESTION; How may the outcome support 'freedom of thought' and how does
that relate to the W3C patent Pool related mandates / membership interests,
etc.?

>
> Two decades ago work on word sense disambiguation focused on n-gram
> statistics for word collocations. More recently, artificial neural networks
> have proved to be very effective at unsupervised learning of statistical
> language models for predicting what text is likely to follow on from a
> given text passage. Unfortunately, marvellous as this is, it isn’t
> transferrable to tasks such as word sense disambiguation and measuring
> semantic consistency for deciding on prepositional attachments etc.
>
> I am therefore still looking for practical ways to exploit natural
> language corpora to determine word senses in context. The intended sense of
> a word is correlated to the words with which it appears in any given
> utterance. The accompanying words vary in their specificity for
> discriminating particular word senses. However, strongly discriminating
> words may be found several words away from the word in question. A simple
> n-gram model would require an impractical amount of memory to capture such
> dependencies.  We therefore need a way to learn which words/features to pay
> attention to, and what can be safely forgotten as a means to limit the
> demand on memory.
>
> I rather like the 1995 paper by David Yarowksy “Unsupervised words sense
> disambiguation rivalling supervised methods”. This assumes that words have
> one sense per discourse and one sense per collation, and exploits this in
> an iterative bootstrapping procedure. Other papers exploit linguistic
> resources like WordNet. I am now hoping to experiment with Yarowksy’s ideas
> using loose parsing for longer range dependencies, together with heuristics
> for discarding collocation data with weak discrimination.
>
> I’ve downloaded free samples of large corpora from www.corpusdata.org as
> a basis for experimentation.
>

perhaps creating some sort of github file or solution, that provides
reference to an array of open resources, could be useful?



> Each word is given with its lemma and part of speech, e.g. "announced",
> "announce", "vvn”. This will enable me to apply shift-reduce parsing to
> build phrase structures as an input to computing collocations. Further work
> would address the potential for utilising prior knowledge, e.g. from
> WordNet, and how to compute measures of semantic consistency for resolving
> noun phrases and attachment of prepositions as verb arguments.  An open
> question is whether this can be done effectively without resorting to
> artificial neural networks.
>
> Anyone interested in helping?
>

yes.  but 'not ready yet' (personally)...  noting i do not have all the
necessary skills to support the underlying scope of works, without help /
cooperative collaboration with others, etc.

also - isn't this sort of stuff computationally intensive?  how can / are
experiments (be) funded?  is there a schema about how projects be defined
by the scope that is incorporated and aspects set-aside?

q: how and what in the proposed specification supports temporal
considerations?  including but not limited to if an inference has a
dependency upon an API and/or 3rd party query service...

therein - inferences based on 'half truths' (in simple language) are likely
to be different to inferences / dermination (causality) linked to having a
better means to form opinions. like mindfulness / consciousness and related
facets; this isn't necessarily about result that are bad for purposes
intended by others; but sometimes that is the case.  The underlying concept
linking to the idea of 'the status of the observer';
https://www.youtube.com/watch?v=ZYPjXz1MVv0&list=PLCbmz0VSZ_voTpRK9-o5RksERak4kOL40&index=4&t=5s
<https://www.youtube.com/watch?v=ZYPjXz1MVv0&list=PLCbmz0VSZ_voTpRK9-o5RksERak4kOL40&index=4&t=5s>
- my much longer writing (been working on it for a couple of hours so far,
for the express purpose of this group work) will go into considerations /
deliberations in more detail, suffice to say for now - IMO, its quite
complicated stuff...

i see here:
https://github.com/w3c/cogai/blob/master/demos/decision-tree/rules.chk a
series of considerations about a 'way of thinking' of a particularly
illustrated underlying concept.  It seems obvious to consider that some
such examples are based on physics or similar (ie: gravity, amongst others)
others may be more subjective (ie; linked to religious / worship related /
spiritual belief's, or medical procedures (including but not limited to
OSCE's (
https://en.wikipedia.org/wiki/Objective_structured_clinical_examination );
does the present scope of works have a concept of 'libraries' or 'sources'
or similar?   The Sci-Fi example would be Neo uploading knowledge
https://www.youtube.com/watch?v=w_8NsPQBdV0
<https://www.youtube.com/watch?v=w_8NsPQBdV0>  the more pragmatic example,
would be virus signature libraries uploaded (or downloaded, depending on
how you think about it) into anti-virus programs...

part of the underlying thought is about 'computational load' which will
likely have an impact (various implications) on how solutions can be
deployed (how well they may be 'democratised', or similar).

also; what consideration has been given on storing resources on DLTs (ie:
blockchains, DHTs, cryptographically signed (tamper evident), decentralised
resources)?

Timothy Holborn.


> Dave Raggett <dsr@w3.org> http://www.w3.org/People/Raggett
> W3C Data Activity Lead & W3C champion for the Web of things
>
>
>
>
>

Received on Tuesday, 24 August 2021 12:22:59 UTC