Re: SemWeb + LLMs etc, here or a new group? from Adam Sobieski on 2023-09-26 (semantic-web@w3.org from September 2023)

From: Adam Sobieski <adamsobieski@hotmail.com>
Date: Tue, 26 Sep 2023 20:56:49 +0000
To: Danny Ayers <danny.ayers@gmail.com>, Melvin Carvalho <melvincarvalho@gmail.com>
CC: Semantic Web <semantic-web@w3.org>, Dan Brickley <danbri@danbri.org>
Message-ID: <SA1P223MB067964755C9FBABBEEE855BAC5C3A@SA1P223MB0679.NAMP223.PROD.OUTLOOK.COM>
Thank you for sharing that information about graph retrieval-augmented generation (graph RAG).

With respect to the question-answering use case, I am pleased to share that I am presently revising the Wikianswers proposal<https://meta.wikimedia.org/wiki/Wikianswers>. Wikianswers would answer users' natural-language questions utilizing a combination of resources: Wikipedia, Wikidata, and Commons. I am presently exploring how best to expand the Wikidata interoperability<https://meta.wikimedia.org/wiki/Wikianswers#Wikidata> section.

In these regards, I am thinking about the generation of SPARQL queries from natural-language questions. In January and February of this year, some of these topics were previously discussed here in a thread: ChatGPT, ontologies and SPARQL. It is noteworthy that, e.g., using self-ask with search, LLMs can decompose complex natural-language questions into simpler ones during the answering process. Cognizant of these capabilities, being able to transform even simple user- and AI-originating natural-language questions into SPARQL queries will be tremendously useful.

Of interest to the proposal are wiki-technology topics and related human-curation scenarios, e.g., users being able to edit or correct those AI-generated SPARQL queries occurring in some form of LLM-based or agent-based transcripts which could be retained and stored alongside cached questions and answers. Corrected SPARQL queries would be able to be used as training data with which to improve AI systems.

The proposal also broaches interoperability with Commons, e.g., storing AI-generated multimedia content and its accompanying metadata there so that the stored multimedia content could be subsequently sought, retrieved, and reused (examples of such multimedia content would include 3D models, animations, audio, charts, diagrams, figures, graphs, images, infographics, maps, mathematical formulas, photographs, tables, and video).

I hope that the proposal is interesting to the group. I welcome any comments, feedback, and ideas on the proposal's talk page<https://meta.wikimedia.org/wiki/Talk:Wikianswers>, here, or in a new Community Group!


Best regards,
Adam

________________________________
From: Danny Ayers <danny.ayers@gmail.com>
Sent: Tuesday, September 26, 2023 8:59 AM
To: Melvin Carvalho <melvincarvalho@gmail.com>
Cc: Semantic Web <semantic-web@w3.org>; Dan Brickley <danbri@danbri.org>
Subject: Re: SemWeb + LLMs etc, here or a new group?

Hiya Melvin. Yeah, I think you're right about the practical use cases being hard to pin down. But I suspect this is an occasion when the tech appears first, its utility only later, praxis or whatever.

Look at the way folks in the dev community/industry at large are scurrying around looking for applications of GPT they can monetize. I've lost count of the number of '#1 AI Coding Assistant' apps I've seen.

(Incidentally I heard on the radio about a new book suggesting that Neanderthal homos may have been a lot more creative than us, few of their stone tools share a pattern unlike our clunky copied efforts of the same period. Their thinking patterns might have been useful now.).

On the web tech side, I reckon you hit the nail on the head in mentioning the follow-your-nose protocol. As with the web already, it's that where the real value comes from, specific formats etc. being at best secondary. I just found out that Claude can gobble fairly sizeable quantities of docs, 10k context window or somesuch?
That's going to seem nothing in a few years.

But a value point of using typed links etc is that it can provide a faster route to more relevant info. So (hands beginning to wave) although you can make the machines smarter with sheer bulk of data, RDF & co offer a leaner, more efficient kind of discovery. A turbo button, if you will.

From a practical point of view, at this point in time a volume-based approach is almost certainly going to produce good results faster (I am not an analyst but it also seems where most the funding is being directed : both MS & Amazon appear to be tying their efforts to their existing Big Data/big cloud systems).

But we (broad circular hand gesture) have experience on how the web does/can operate, have a lot of proven tools (formalizations, specs, all the way down to runnable code) in this domain.
TL;DR - just need to glue it all together.

Cheers,
Danny.


On Tue, 26 Sept 2023, 04:04 Melvin Carvalho, <melvincarvalho@gmail.com<mailto:melvincarvalho@gmail.com>> wrote:


út 26. 9. 2023 v 3:01 odesílatel Danny Ayers <danny.ayers@gmail.com<mailto:danny.ayers@gmail.com>> napsal:
Something big & new has arrived, but it is at a tangent to regular business, so maybe a new Community Group/list or whatever should be considered. I don't know what folks think about boundaries. Let me tell you my story...

I'm asking because I've been playing with it a bit, one specific angle, using LlamaIndex to use Graph Retrieval Augmented Generation against OpenAI's GPT API. Sorry, I haven't links at hand, but the papers on RAG and Graph RAG, Graph of Thoughts are on arXiv. I have very naive code that runs at :
https://github.com/danja/llama_index/blob/main/docs/examples/graph_stores/graph-rag-sparql-mini.py
(Isn't ego great? Found my own thing immediately).

My immediate conclusions are that conceptually it's a no--brainer to attach such systems to Linked Data (naturally a moderate pain in practice). LLMs expect verbals, so you give them, them. A RAG has RDF graph URLs as pointers, it looks them up (HTTP, HTTP), chases the schema definition, pulls in the definition of the property or class, it has a sentence comparable to the texts it's been trained on. It's very floppy, but I believe potentially useful.

(I spent a long time bugged by this - surely we can just give URIs to the LLM as some kind of first class token? I still haven't a clue, but for now there are easier ways in).

I accidentally came up with a TED Talk-style analogy that might work for the big picture. For something unrelated I typed in 'warp start' when I meant 'yarn start'. How I giggled! But yeah, the Web (very strongly including Linked Data, as much OWLishness as you want) is a clear Warp, where the AI bits can fill it out with Weft of information fabric. (Apologies to Tim re. book-naming).

So yeah, in a rambly way, do you see why I think another group is something to bear in mind? Personally I'm happy either way, as long as the W3C tries to keep their eye on the ball. Blockchain Web3, maybe not so much. But LLMs, I'd say in scope, here or somewhere parallel.

LLMs are useful.  Perhaps early versions of Enquire were ahead of their time.

However, they can equally use plain text, JSON, 1-5 star linked data, and RDF.

If we were to call RDF 5* linked data, in what use cases would that give you an advantage over, say 1-4.5* linked data?

Perhaps when full follow-your-nose capabilities are added it may yield some interesting results.

But I have yet to figure out use cases for RDF that *significantly* out perform text analysis, or a website with schema.org<http://schema.org> sprinkles.


Cheers,
Danny.








--
----

https://hyperdata.it<http://hyperdata.it/danja>
Received on Tuesday, 26 September 2023 20:56:58 UTC