Querying tools. Was: The AIDA Dashboard: a new tool for analysing research in Computer Science from Hugh Glaser on 2022-09-15 (semantic-web@w3.org from September 2022)

From: Hugh Glaser <hugh@glasers.org>
Date: Thu, 15 Sep 2022 11:48:43 +0100
To: Sarven Capadisli <info@csarven.ca>
Cc: semantic-web@w3.org
Message-Id: <5710A8A9-FEB0-45F0-B0DF-38B4BBFB4158@glasers.org>
(AIDA looks really nice - response to Sarven's sub-topic.)

Training is the problem, innit.
Over the years I have needed to get this sort of information out of documents (often the PDFs that Sarven so rightly dislikes).
When I tried to deploy typical NLP stuff to extract the sort of knowledge I wanted from them (to put into RDF relations, of course), it seemed that I needed to be interested in what the NLP system was doing, and tune and train it for the subject matter, often to the level of the research domain; with training sets etc.
I really didn’t want to do that - that was not the focus of my work or expertise.
I want to deliver my whole system to be run unsupervised by end users.

What I needed was a system that I could just show a document at, and say
“Who wrote this?”, “What is it about?”, “What is the conclusion?”
And construct the appropriate triple.

It turns out there are such systems.
Large language models (LLMs) seem to have reached amazing levels now, as is widely reported.
And they can be set up as QA systems.
So why not simply ask such a system the question you want, to populate your RDF metadata.

This is what I have been experimenting with recently.
I tripped over https://huggingface.co/spaces/impira/docquery
(Download also available at https://github.com/impira/docquery )
which seemed to fit my bill.
It isn’t designed for academic papers as it stands - more for business and structured and semi-structured documents.
But I have had some success with it’s standard configuration asking it questions about letters, academic papers or summaries and freeform descriptions of documents.
Moving beyond the metadata that is routinely available for most papers is unreliable, but can bear useful fruit, although I haven’t done any large-scale experiments yet.
It was trained on OCR & PDFs, so may well be ideal for me (it is historical archives I am interested in at the moment).
It doesn’t seem to deal with hand-written documents very well.

So now I hope to have a service tied up to my KA system that I can fire arbitrary documents at and get RDF triples back.
I would expect that things can be done to the system and model choice (there are others available to the DocQuery system) that will improve things - I need to see how far this goes.

It may be there is some existing NLP web service or downloadable service that will already do this for me that I have missed (I haven’t looked for a while).
But this way of doing things is certainly an option.
And it may be others are already doing this with more success than I have managed.
Better things would be great to hear about.

Best
Hugh

> On 14 Sep 2022, at 09:48, Sarven Capadisli <info@csarven.ca> wrote:
> 
> On 2022-09-14 10:11, Angelo Salatino wrote:
>> As a joint effort between Springer Nature, the Open University, and the University of Cagliari, we recently launched the AIDA Dashboard [1], https://w3id.org/aida/dashboard <https://w3id.org/aida/dashboard>, an innovative tool for exploring and making sense of the dynamics of research topics, scientific conferences and journals in Computer Science.
> 
> 
> Is there an innovative tool for querying or exploring significant units of information in research findings and making sense of the dynamics of research topics, scientific conferences and journals in Computer Science?
> 
> Is it possible to discover problem statements, motivation, hypothesis, arguments, workflow steps, methodology, design, results, evaluation, conclusions, future challenges, as well as all inline semantic citations (to name a few) where they are uniquely identified and related to other data?
> 
> If not, why not?
> 
> -Sarven
> https://csarven.ca/#i
> <OpenPGP_0xA74187CE3D508E3A.asc>
Received on Thursday, 15 September 2022 10:49:02 UTC