- From: Christian Chiarcos <christian.chiarcos@web.de>
- Date: Mon, 10 Oct 2022 12:59:32 +0200
- To: Adam Sobieski <adamsobieski@hotmail.com>
- Cc: "semantic-web@w3.org" <semantic-web@w3.org>
- Message-ID: <CAC1YGdjno=h0Bz3nFjsUo5t8LZmOkNzadQQQ+9xQ65H5SkfzCA@mail.gmail.com>
Dear Adam, dear all, in the context of the W3C CG Ontology-Lexica, we're currently working on the development of an OntoLex module for "Frequency, Attestation and Corpus-Based Information" (OntoLex-FrAC). OntoLex is a vocabulary to publish lexical data and/or lexical annotations of an existing ontology in RDF and as Linked Data, and has become widely used in digital lexicography. One aspect of corpus-based information is embeddings (for ontolex:Forms ~ word embeddings, ontolex:LexicalEntries ~ lemma embeddings, ontolex:Sense = sense embeddings, ontolex:LexicalConcepts ~ embeddings for SKOS concepts, for frac:Attestations ~ contextuaĺized [word/sense/...] embeddings á la Bert, etc.) and you might want to take a look into https://aclanthology.org/2021.semdeep-1.3.pdf. In case you plan to attend COLING 2022, we'll present an overview over FrAC there this week. FrAC is close to finalization and planned to be published as a W3C community report and as a companion vocabulary to OntoLex. It is designed to be broadly applicable and slim, the datastructures it provides for embeddings are thus relatively generic. For embeddings, these are: property frac:embedding: - domain: frac:Observable (anything that a corpus observation can be made about in a corpus), can be any URI - range: frac:Embedding class frac:Embedding, esp. sub-class frac:FixedSizeVector - property rdf:value (literal representation of the embedding, e.g., as String or JSON array) - property dc:description (*human*-readable metadata about how the embedding was created) - property frac:corpus (URI of the underlying corpus/source data) - property dc:extent (for frac:FixedSizeVector: length of vectors) class frac:Attestation: for any observable, an attestation is (a pointer to) an example drawn from a frac:Corpus - property frac:corpus (URI of the underlying corpus/source data) - property frac:locus (URI for the exact occurrence in the corpus, can be modelled with Web Annotation or the NLP Interchange Format) - property rdf:value (string value of the context for the observable) - property frac:attestationEmbedding (contextualized embedding of the observable *in the context represented by the attestation*, can be used for Bert-style embeddings) property frac:attestation: assigns an observable one (or mutliple) contexts in which it occurs and/or information about these contexts. - domain: frac:Observable - range: frac:Attestation The current model lacks machine-readable metadata about embeddings because we felt this would be beyond scope (OntoLex is about lexical data and ontologies, not about machine learning), and because I remember there were some initial efforts in this direction presented at an ESWC(?) a few years back. However, if any such vocabulary or typology emerges, it could be easily integrated with FrAC resources, and I'd be very interested to learn about it. As far as metadata is concerned, the direct annotation of *every* embedding with metadata about underlying corpus, etc. seems like an extremely verbose encoding. Thus, the recommended practice is to create a custom subclass of frac:Embedding with static values (OWL2 value axioms) for dc:extent, frac:corpus etc (say, my:Embedding rdfs:subClassOf frac:FixedSizeVector, owl:Restriction [ ... ]), and then not to repeat this information with every embedding, but instead just to provide rdf:value (the vector) and rdf:type (my:Embedding). Similarly for attestations (and all other frac:Observations). FrAC is not tied to one particular use case, but we aim to cover (and modelled) a broad band-width of possible use cases. With respect to embeddings, this includes sharing and publishing concept embeddings (in a bundle together with the defining knowledge graph, in our case, a specific WordNet edition) and the modelling of responses and inputs of web services that consume or provide embeddings, either contextualized or non-contextualized. One aspect of its genericity is that frac:Embedding also includes things other than "NLP embeddings", i.e., traditional (weighted) bag of words to model the usage context of words, etc. (roughly comparable and similar in function to embeddings, but sparse and not size-limited; more like a Hashtable/Dict than a vector) and time series information (a potentially infinite-size sequence of fixed size vectors, could be the input layer of an ANN but also a sequence of sensor readings). FrAC is discussed in a series of bi-weekly telcos in the W3C CG Ontology-Lexica ( https://www.w3.org/community/ontolex/wiki/Frequency,_Attestation_and_Corpus_Information), please feel free to reach out to me personally or to join the group if you're interested. Best regards, Christian Am So., 9. Okt. 2022 um 09:13 Uhr schrieb Adam Sobieski < adamsobieski@hotmail.com>: > Semantic Web Interest Group, > > > > Embedding vectors can represent many things: words [1], sentences [2], > paragraphs, documents, percepts, concepts, multimedia data, users, and so > forth. > > > > A few months ago, I started a discussion on GitHub about formal ontologies > for describing these vectors and their models [3]. There, I also indicated > that MIME types for these vectors could be created, e.g., “embedding/gpt-3” > or “vector/gpt-3”. > > > > For discussion and brainstorming, I would like to share some ideas with > the group. > > > > Firstly, we can envision machine-utilizable lexicons which, for each sense > of each lexeme, include, refer to, or hyperlink to embedding vectors. > > > > Secondly, we can envision that metadata for scholarly and scientific > publications might, one day, include sets of embedding vectors, e.g., each > representing a topic or a category from a scholarly or scientific domain, > or that these publications might include sets of URI’s or text strings from > controlled vocabularies, each URI or term related elsewhere to embedding > vectors. > > > > Is there any interest, here, in formal ontologies which describe embedding > vectors and their models? Do any such ontologies already exist? Any > thoughts on these topics? > > > > > > Best regards, > > Adam Sobieski > > > > [1] https://en.wikipedia.org/wiki/Word_embedding > > [2] https://en.wikipedia.org/wiki/Sentence_embedding > > [3] https://github.com/onnx/onnx/discussions/4318 > > >
Received on Monday, 10 October 2022 10:59:57 UTC