Re: Something on grammar from Christian Chiarcos on 2021-01-20 (public-cogai@w3.org from January 2021)

From: Christian Chiarcos <christian.chiarcos@web.de>
Date: Wed, 20 Jan 2021 18:56:27 +0100
To: Dave Raggett <dsr@w3.org>
Cc: public-cogai <public-cogai@w3.org>
Message-ID: <CAC1YGdjkkNxWsLrwat65ZcO7L2_1Rim0pStdAoOngPKDyk01=g@mail.gmail.com>
Hi Dave,

Many thanks for your thoughtful comments and pointers.
>

;)

As far as I am aware current approaches to applying artificial neural
> networks (e.g. BERT, GPT-3) focus on syntax with meaning indirectly
> expressed in terms of statistics over the co-occurrence of words. That
> doesn’t get us very far when it comes to reasoning about meanings.
>

I don't think neural models should be seen as a model of the mind. But they
can be trained to produce representations in accordance with any particular
representation framework.

>  Complexity of symbolic parsing. Notoriously slow when it comes to larger
> dictionaries
>
> Can you please expand on that as it isn’t obvious to me. Perhaps this is
> something to do with the kind of parsers they’ve used?
>
>
> Kind of. There are linear-time (O(n)) parsers, but achieving linear time
> means to limit context awareness and to ignore (or postpone) the resolution
> of ambiguities. Shift-reduce parsers with backtracking can be exponential.
> Between that, everything is possible, but realistically, the more
> expressive grammar formalisms are mildly context-sensitive and can be
> parsed in polynomial time (e.g.,  Öttl et al. 2015). So, we're talking
> about something between O(n^2) or O(n^3) and O(n^6) time complexity. Here,
> n is the length of the sentence, but in practice, the effect of the size of
> the ruleset (grammar+lexicon)  has an immense impact, too (think of it as a
> constant multiplied with the base n). Large coverage requires large
> rulesets, so, this will work nicely for closed-domain applications but
> beyond that, you have a tradeoff between scalability and coverage -- or, if
> you take a more restricted parsing formalism, context-awareness.
>
>
> I think you’re describing the family of statistical phrase structure
> parsers which make heavy use of backtracking,
>

No. Heavy use of backtracking is exponential. Those parsers aim, if you
will, at minimizing that. Realistically, a Left-Right Bottom-Up
Shift-Reduce parser can be linear only if every word is attached to either
the element of the stack that precedes it directly or to the next word. (As
soon as we allow us not to attach immediately, we need re-runs over the
sentence over the current stack of partial parses, and we'll be quadratic.)
This will work in many cases, but will simply often be incorrect. Think of
"Nothing unheard of." "Of nothing" is a phrase that is an argument of
"unheard". A LR-BU-SR parser would probably do something like the following:

- Step 0: read "Nothing" => SHIFT, stack is [nothing]
- Step 1: read "unheard", recognize it as a verb, attach Nothing as
argument to the verb => REDUCE, stack is [unheard(nothing)]
- Step 2: read "of", recognize it as preposition, expect it to come before
a noun phrase => SHIFT, stack is [unheard(nothing), of]
- Step 3: read end of sentence, no SHIFT possible (nothing follows), no
REDUCE possible (the last element of the stack cannot be attached -- if so,
we would need a rule for a free prepositional argument of unheard) => FAIL

(Slightly simplified.)

There is not really a way to circumvent this particular error with a linear
time parser. Going polynomial is not really a problem, though, -- unless we
try to have good-coverage grammars, because they come with large rule sets.
If we eliminate the grammar (and use a classifier, instead), things can
speed up a lot (with a certain error rate), but then, we don't use rules
anymore.


> and use statistics to rank different parse trees given the syntactic
> ambiguities that are so prevalent in human languages. This backtracking can
> be avoided by incremental concurrent processing of syntax and semantics,
> where syntax guilds semantics and vice versa. This is something I am trying
> to realise.
>

Incremental parsing works and backtracking can even be completely
eliminated if multiple (all?) possible analyses are pursued simultaneously.
But that does not solve the time complexity problem because of the growth
of the search space (so, we need to search the alternatives rather than
start over again). A solution is to pursue only a fixed number of
alternative parse hypotheses at the same time (e.g., using beam search),
but there are no guarantees that you pick the right hypotheses. The smaller
the beam witdth (and the more efficient the parser), the more likely are
errors. Beam width 1 is linear.


> As an example, one step is to identify the part of speech and word sense
> for a word given the current state of processing an utterance together with
> the next word. This can take the statistics for part of speech sequences
> into account, along with the meaning of the previous words, and semantic
> models of meaning using spreading activation as originally proposed by
> Quillian.
>
> Here is an example of previous work on this using ACT-R:
>
> “A cognitive approach to word sense disambiguation”, Sudakshina Dutta &
> Anupam Basu
> https://link.springer.com/chapter/10.1007/978-3-642-28604-9_18
>
> And here is an account of top-down syntactic parsing using ACT-R:
>
>  “The Basics of Syntactic Parsing in ACT-R”, Adrian Brasoveanu & Jakub
> Dotlačil
> https://link.springer.com/chapter/10.1007/978-3-030-31846-8_3
>

Thank you, I'm remotely familiar with ACT-R (Shravan Vasishth was in my PhD
commission) -- but remotely only. I find it very hard to find any
information about the *computational* complexity of parsing with ACT-R,
mostly because they talk a lot about cognitive complexity which is a
completely different beast.

I want to support out-of-vocabulary words in a way that reflects how
> children are seen to deal with them using the syntactic and semantic
> context.
>

It may also be interesting to see what philologists and lexicographers do
if they see an unknown word and they're not native speaker themselves:
Basically, they're checking collocates and possible substitutes in these
contexts to approximate the meaning. This is pretty much the same insight
that is underlying the use of word embeddings (which are basically
collocation lists plus dimensionality reduction).

Overall, the history of symbolic parsing points to the existence two major
> challenges:
>
> - How to acquire the necessary knowledge (rules, etc.) ?
> - How to make processing performant?
>
>
> Applications for modest closed domains are quite practical using manually
> developed knowledge,
>

Yes, that's a realistic goal.


> but to go beyond that we need to model how humans learn in terms of a
> toolkit of techniques, including metacognition.
>
> Processing speed is a lesser challenge. Concurrent syntactic and semantic
> processing makes parsing relatively easy. Graph algorithms can be speeded
> up using hardware acceleration, and rule conditions can be compiled into
> discrimination networks that mimic sub-cortical regions of the brain.
>

In theory, yes. But if we end up with anything exponential (which our brain
apparently *can* handle because of massive parallelization), the approach
won't technically scale. On the other hand, scalability is not so important
if we can restrict the input length (for example) to a fixed length, say,
max 10 words per sentence. This just means that a certain percentage of
input sentences cannot be processed. With 10 words, however, that would be
something like 90%.

> Regardless as to whether CogAI solutions are based on these earlier lines
> of research or be developed from scratch, these are the challenges to be
> expected. For closed domains and demos, all can work nicely. For anything
>  beyond that, we need to think how to address that. The approach of NLP has
> been to largely abandon symbolic parsing, but this development is less
> driven by scientific insight than by the prospect to trade human expertise
> against computing power (remember "Every time I fire a linguist ...").
>
>
> We have seen symbolic approaches replaced by statistical approaches that
> are free of semantics, and now we need to figure out how to focus on
> human-style semantic reasoning and to combine symbolic and statistical
> approaches. This is where there is a great deal to gain from the work
> across the cognitive sciences.
>

Yes.

> In any case, I'm not sure I fully understand the chunk mechanism, so it's
> likely I miss something obvious.
>
>
> I’d be happy to answer any questions you have. If you have the interest
> and time, there is an extensive body of literature on ACT-R.
>

A pointer to any complexity analysis would be helpful. In particular, an
assessment of the size of the ruleset on parsing time. Just to see whether
it's feasible at all to scale the approach.


>
>
> Cognitive parsers should be able to make some sense of incomplete or
> ungrammatical utterances. This also relates to the potential for learning
> new grammar.
>
>
> That would be ideal. How would that work in practice?
>
>
> The shift-reduce parser is essentially bottom up and the grammar is
> implicit in the shift-reduce rules. The syntax-semantics mapping rules try
> to make sense out of what phrase structure is constructed. This in turn is
> subject to higher level reasoning in the context of the agent’s current
> goals.
>

Coming back to the example above, I guess what you hint at is that the
parser may dynamically create a new "repair" rule that states that
prepositions without noun phrases can attach to verbs. This is not
necessarily wrong (it is in this case but not for particle verbs like
"looking for"), but this now means that every sequence of verb+preposition
has two possible interpretations. Using this strategy as an inference
mechanism will massively increase the size of the ruleset (and thus,
parsing time) and the degree of ambiguity (even if weighted by frequency).
Because of the prevalence of Zipf distributions in language, there will be
a very long tail, and creating more and more rules in order to improve the
grammar coverage in increasingly smaller steps will make that intractable
at some point. It must be constrained in some way.


> What’s much less clear is how the agent learns new grammar as it involves
> extensions to the lexicon, shift-reduce rules and the syntax-semantic
> rules. I am looking for ways to apply weakly supervised learning of new
> rules from even single examples, along with means to generalise or
> specialise competing hypotheses. The way that children learn should provide
> rich insights as to what is needed, along with a suite of heuristics to
> apply.
>

I guess what we actually do on a cognitive level is some form of implicitly
weighted parallel processing (be it for reasoning or parsing), the
unconscious parallel creation of several more or less parsing alternatives
and the concurrent reanalysis of the input previously seen before. If so, a
cognitively plausible technical solution is massive parallelization. And
that explains why we can do exponential-time operations in realtime (i.e.,
linear time). I'm just not sure we'll ever be as good as the brain on that
;)

Best regards,
Christian


>
> Kind regards,
>
> Dave Raggett <dsr@w3.org> http://www.w3.org/People/Raggett
> W3C Data Activity Lead & W3C champion for the Web of things
>
>
>
>
>
Received on Wednesday, 20 January 2021 17:57:38 UTC