Re: Semantics and Embedding Vectors from Christian Chiarcos on 2022-10-10 (semantic-web@w3.org from October 2022)

From: Christian Chiarcos <christian.chiarcos@web.de>
Date: Mon, 10 Oct 2022 12:59:32 +0200
To: Adam Sobieski <adamsobieski@hotmail.com>
Cc: "semantic-web@w3.org" <semantic-web@w3.org>
Message-ID: <CAC1YGdjno=h0Bz3nFjsUo5t8LZmOkNzadQQQ+9xQ65H5SkfzCA@mail.gmail.com>
Dear Adam, dear all,

in the context of the W3C CG Ontology-Lexica, we're currently working on
the development of an OntoLex module for "Frequency, Attestation and
Corpus-Based Information" (OntoLex-FrAC). OntoLex is a vocabulary to
publish lexical data and/or lexical annotations of an existing ontology in
RDF and as Linked Data, and has become widely used in digital lexicography.

One aspect of corpus-based information is embeddings (for ontolex:Forms ~
word embeddings, ontolex:LexicalEntries ~ lemma embeddings, ontolex:Sense =
sense embeddings, ontolex:LexicalConcepts ~ embeddings for SKOS concepts,
for frac:Attestations ~ contextuaĺized [word/sense/...] embeddings á la
Bert, etc.) and you might want to take a look into
https://aclanthology.org/2021.semdeep-1.3.pdf. In case you plan to attend
COLING 2022, we'll present an overview over FrAC there this week.
FrAC is close to finalization and planned to be published as a W3C
community report and as a companion vocabulary to OntoLex. It is designed
to be broadly applicable and slim, the datastructures it provides for
embeddings are thus relatively generic. For embeddings, these are:

property frac:embedding:
- domain: frac:Observable (anything that a corpus observation can be made
about in a corpus), can be any URI
- range: frac:Embedding

class frac:Embedding, esp. sub-class frac:FixedSizeVector
- property rdf:value (literal representation of the embedding, e.g., as
String or JSON array)
- property dc:description (*human*-readable metadata about how the
embedding was created)
- property frac:corpus (URI of the underlying corpus/source data)
- property dc:extent (for frac:FixedSizeVector: length of  vectors)

class frac:Attestation: for any observable, an attestation is (a pointer
to) an example drawn from a frac:Corpus
- property frac:corpus (URI of the underlying corpus/source data)
- property frac:locus (URI for the exact occurrence in the corpus, can be
modelled with Web Annotation or the NLP Interchange Format)
- property rdf:value (string value of the context for the observable)
- property frac:attestationEmbedding (contextualized embedding of the
observable *in the context represented by the attestation*, can be used for
Bert-style embeddings)

property frac:attestation: assigns an observable one (or mutliple) contexts
in which it occurs and/or information about these contexts.
- domain: frac:Observable
- range: frac:Attestation

The current model lacks machine-readable metadata about embeddings because
we felt this would be beyond scope (OntoLex is about lexical data and
ontologies, not about machine learning), and because I remember there were
some initial efforts in this direction presented at an ESWC(?) a few years
back. However, if any such vocabulary or typology emerges, it could be
easily integrated with FrAC resources, and I'd be very interested to learn
about it.

As far as metadata is concerned, the direct annotation of *every* embedding
with metadata about underlying corpus, etc. seems like an extremely verbose
encoding. Thus, the recommended practice is to create a custom subclass of
frac:Embedding with static values (OWL2 value axioms) for dc:extent,
frac:corpus etc (say, my:Embedding rdfs:subClassOf frac:FixedSizeVector,
owl:Restriction [ ... ]), and then not to repeat this information with
every embedding, but instead just to provide rdf:value (the vector) and
rdf:type (my:Embedding). Similarly for attestations (and all other
frac:Observations).

FrAC is not tied to one particular use case, but we aim to cover (and
modelled) a broad band-width of possible use cases. With respect to
embeddings, this includes sharing and publishing concept embeddings (in a
bundle together with the defining knowledge graph, in our case, a specific
WordNet edition) and the modelling of responses and inputs of web services
that consume or provide embeddings, either contextualized or
non-contextualized. One aspect of its genericity is that frac:Embedding
also includes things other than "NLP embeddings", i.e., traditional
(weighted) bag of words to model the usage context of words, etc. (roughly
comparable and similar in function to embeddings, but sparse and not
size-limited; more like a Hashtable/Dict than a vector) and time series
information (a potentially infinite-size sequence of fixed size vectors,
could be the input layer of an ANN but also a sequence of sensor readings).

FrAC is discussed in a series of bi-weekly telcos in the W3C CG
Ontology-Lexica (
https://www.w3.org/community/ontolex/wiki/Frequency,_Attestation_and_Corpus_Information),
please feel free to reach out to me personally or to join the group if
you're interested.

Best regards,
Christian

Am So., 9. Okt. 2022 um 09:13 Uhr schrieb Adam Sobieski <
adamsobieski@hotmail.com>:

> Semantic Web Interest Group,
>
>
>
> Embedding vectors can represent many things: words [1], sentences [2],
> paragraphs, documents, percepts, concepts, multimedia data, users, and so
> forth.
>
>
>
> A few months ago, I started a discussion on GitHub about formal ontologies
> for describing these vectors and their models [3]. There, I also indicated
> that MIME types for these vectors could be created, e.g., “embedding/gpt-3”
> or “vector/gpt-3”.
>
>
>
> For discussion and brainstorming, I would like to share some ideas with
> the group.
>
>
>
> Firstly, we can envision machine-utilizable lexicons which, for each sense
> of each lexeme, include, refer to, or hyperlink to embedding vectors.
>
>
>
> Secondly, we can envision that metadata for scholarly and scientific
> publications might, one day, include sets of embedding vectors, e.g., each
> representing a topic or a category from a scholarly or scientific domain,
> or that these publications might include sets of URI’s or text strings from
> controlled vocabularies, each URI or term related elsewhere to embedding
> vectors.
>
>
>
> Is there any interest, here, in formal ontologies which describe embedding
> vectors and their models? Do any such ontologies already exist? Any
> thoughts on these topics?
>
>
>
>
>
> Best regards,
>
> Adam Sobieski
>
>
>
> [1] https://en.wikipedia.org/wiki/Word_embedding
>
> [2] https://en.wikipedia.org/wiki/Sentence_embedding
>
> [3] https://github.com/onnx/onnx/discussions/4318
>
>
>
Received on Monday, 10 October 2022 10:59:57 UTC