Re: Semantics and Embedding Vectors from Chris Harding on 2022-10-10 (semantic-web@w3.org from October 2022)

From: Chris Harding <chris@lacibus.net>
Date: Mon, 10 Oct 2022 11:55:03 +0100
To: Carlos Bobed <cbobed@unizar.es>, adamsobieski@hotmail.com
Cc: semantic-web@w3.org
Message-ID: <965e698c-9303-f658-912c-2f242a5bcba8@lacibus.net>

Hi, Adam and Carlos -

Here's my 2c

Carlos - this is a good point. However, it may be that some specific 
embedding sets (e.g. the set produced by T5 using Wikipedia as of Jan 1 
2020) could be selected as standards and be represented by objects in 
the ontology. Using such an ontology would be the AI equivalent of 
consulting a particular human expert. Human experts give different 
answers according to their individual knowledge and experience. The 
standard embeddings would give different answers depending on the models 
used to generate them.

Adam - imho the machine-usable lexicons should include the embeddings, 
and clients should search them using nearest-neighbor algorithms to find 
embeddings closest to the ones they have extracted from research papers 
(for example) using the same models as were used to produce the standard 
embedding sets.

Carlos Bobed wrote on 10/10/2022 10:19:
>
> Hi Adam,
>
> El 09/10/2022 a las 9:07, Adam Sobieski escribió:
>>
>> Semantic Web Interest Group,
>>
>> Embedding vectors can represent many things: words [1], sentences 
>> [2], paragraphs, documents, percepts, concepts, multimedia data, 
>> users, and so forth.
>>
>> A few months ago, I started a discussion on GitHub about formal 
>> ontologies for describing these vectors and their models [3]. There, 
>> I also indicated that MIME types for these vectors could be created, 
>> e.g., “embedding/gpt-3” or “vector/gpt-3”.
>>
>> For discussion and brainstorming, I would like to share some ideas 
>> with the group.
>>
>> Firstly, we can envision machine-utilizable lexicons which, for each 
>> sense of each lexeme, include, refer to, or hyperlink to embedding 
>> vectors.
>>
> My two cents: It can be extremely tricky. Embedding vectors by 
> themselves are meaningless if you don't provide all the information 
> about the source: model, training dataset, task trained for (maybe 
> tasks fine-tuned for). First of all, I think we should define more 
> precisely what information is to be shared.
>
> I can see the point if you are aiming at providing entry points to 
> fixed embedding spaces (for example, this sense S is represented in 
> Nasari as this vector x ) so as to try to align external elements 
> using a well-known and shared model as anchor. If the goal is to share 
> embeddings by themselves ... even fixing the model and the dataset 
> might have different resulting spaces.
>
> BTW, the above mentioned holds for static vectors, if you have 
> dynamic/contextual ones (e.g., BERT-* ones), then the entry point 
> would be a little bit meaningless as the vector will change depending 
> on the accompanying elements.
>
> Best,
>
> Carlos
>
>

-- 
Regards,
Chris
++++

Chris Harding
Chief Executive, Lacibus Ltd <http://www.lacibus.com>

Received on Monday, 10 October 2022 10:55:33 UTC