W3C home > Mailing lists > Public > semantic-web@w3.org > October 2006

TermExtractor is on line! (It's a tool for terminology extraction from any type of files: doc, pdf, xls, ppt, xml, html, chm, etc.)

From: Francesco Sclano <francesco_sclano@yahoo.it>
Date: Sat, 21 Oct 2006 23:29:03 +0200 (CEST)
Message-ID: <20061021212903.82098.qmail@web86805.mail.ukl.yahoo.com>
To: semantic-web@w3.org

TermExtractor, my master thesis, is online at the
address http://lcl2.di.uniroma1.it !!!

TermExtractor is a software package for automatic
building, validation and maintenance of glossaries in
english language.

TermExtractor extracts terminology consensually
referred in a specific application domain. The package
takes as input a corpus of domain documents, parses
the documents, and extracts a list of "syntactically
plausible" terms (e.g. compounds, adjective-nouns,
etc.). Documents parsing assigns a greater importance
to terms with text layouts (title, bold, italic,
underlined, etc.). Two entropy-based measures, called
Domain Relevance and Domain Consensus, are then used.
Domain Consensus is used to select only the terms
which are consensually referred throughout the corpus
documents. Domain Relevance to select only the terms
which are relevant to the domain of interest, Domain
Relevance is computed with reference to a set of
contrastive terminologies from different domains.
Finally, extracted terms are further filtered using
Lexical Cohesion, that measures the degree of
association of all the words in a terminological
string. Accept files formats are: txt, pdf, ps, dvi,
tex, doc, rtf, ppt, xls, xml, html/htm, chm, wpd and
also zip archives.

Francesco Sclano
home page: http://lcl2.di.uniroma1.it/~sclano
msn:       francesco_sclano@yahoo.it
skype:     francesco978

Do You Yahoo!?
Poco spazio e tanto spam? Yahoo! Mail ti protegge dallo spam e ti da tanto spazio gratuito per i tuoi file e i messaggi 
Received on Sunday, 22 October 2006 15:59:23 UTC

This archive was generated by hypermail 2.3.1 : Tuesday, 1 March 2016 07:41:53 UTC