issue clustering / issue tracking -- status report from Dailey, David P. on 2007-05-15 (www-archive@w3.org from May 2007)

From: Dailey, David P. <david.dailey@sru.edu>
Date: Mon, 14 May 2007 21:04:28 -0400
To: <www-archive@w3.org>, <connolly@w3.org>, <cwilso@microsoft.com>
Cc: <ian@hixie.ch>, <hyatt@apple.com>, <chasen@chasenlehara.com>, <doug.schepers@vectoreal.com>, <bhopgood@brookes.ac.uk>, <mjs@apple.com>
Message-ID: <1835D662B263BC4E864A7CFAB2FEEB3D258C0A@msfexch01.srunet.sruad.edu>

Dear group*

If the editors and chairs would find it to be of use, I think issue tracking could be done using two stages:

1. Clustering: an approach using topic clustering (or content analysis) could help to sort through numerous allied issues as represented by email messages from possibly divergent subject threads.

2. Tracking: a more traditional approach to issue tracking such as offered by Bugzilla or Trac. (see previous messages**)

Stage 1 could prove useful for sorting through and organizing large amounts of textual data as a sort of pre-processing for stage 2

The most promising approaches to in order of promise:

1 Vivisimo/Clusty web search and text clustering engine. They have developed a clustering system for analysis of email pertaining the the Enron affair, and have provided us a view of their approach at
http://vivisimo.com/w3c-email . I do think it could save the editors (and the chairs) some considerable effort.

2. As I mentioned earlier [5] (quoting from Robert Kosara) "The <http://infoviz.pnl.gov/>InfoViz group at PNNL/NVAC has developed a tool called <http://in-spire.pnl.gov/>IN-SPIRE for visually analyzing large text corpora.It takes some training, but is very powerful. For emails, you would probably want to do some pre-processing to get rid of quoted text, signatures, and such, as that would fool the similarity metrics.... The cool thing about IN-SPIRE is that it's actually quite dumb, but that means you have a chance to know why it put certain documents close to each other"

3 Yoshikoder - http://www.yoshikoder.org/
The Yoshikoder is a cross-platform multilingual content analysis program developed as part of the Identity Project at Harvard's Weatherhead Center for International Affairs.
"The Yoshikoder is licensed under the Gnu Public License. This means you can do essentially anything you like with the software, except sell it."

4 Cypher http://www.monrai.com/products/cypher
Cypher - "Alpha Release Natural Language Processing for the Semantic Web
Free license for non-commercial use only (includes personal use, acedemic use, product testing and internal development) includes Hello World example dataset open architecture exposed through morphology, phrase grammar, lexicon, framenet, and the W3C recommended RDF(S) specifications "

Others

5 S-EM (Spy-EM), a text classification system that learns from positive and unlabeled examples. Bayesian Free http://www.cs.uic.edu/~liub/S-EM/S-EM-download.html Primarily windows, unix may be available.

6 The Semantic Indexing Project, offering open source tools, including Semantic Engine - a standalone indexer/search application. Mac only with Windows and Linux on the way. http://www.semantic-web.at/10.11.413.link.the-semantic-indexing-project.htm -- the project's links seem to be unavailable (5-20-2007)

David

References: (Most of the above references are not reviewed, but the following provides a nice overview of methodologies nevertheless):
Software for Content Analysis - A Review, by Will Lowe: 21 programs compared in detail, grouped into dictionary-based analysis, development environments, and annotation aids (PDF file, 18 pages).

* In addition to those previously included I have also added Maciej, consistent with my recollection of his interest expressed recently on IRC.

** In inverse chronological order we have:

6. http://lists.w3.org/Archives/Public/www-archive/2007May/0046.html <http://rockmail.sru.edu/exchweb/bin/redir.asp?URL=http://lists.w3.org/Archives/Public/www-archive/2007May/0046.html>
5. http://lists.w3.org/Archives/Public/www-archive/2007May/0028.html <http://rockmail.sru.edu/exchweb/bin/redir.asp?URL=http://lists.w3.org/Archives/Public/www-archive/2007May/0028.html>
4. http://lists.w3.org/Archives/Public/www-archive/2007May/0011.html <http://rockmail.sru.edu/exchweb/bin/redir.asp?URL=http://lists.w3.org/Archives/Public/www-archive/2007May/0011.html>
3. http://lists.w3.org/Archives/Public/www-archive/2007Apr/0075.html <http://rockmail.sru.edu/exchweb/bin/redir.asp?URL=http://lists.w3.org/Archives/Public/www-archive/2007Apr/0075.html>
2. http://lists.w3.org/Archives/Public/public-html/2007Apr/1401.html <http://rockmail.sru.edu/exchweb/bin/redir.asp?URL=http://lists.w3.org/Archives/Public/public-html/2007Apr/1401.html>
1. http://lists.w3.org/Archives/Public/public-html/2007Apr/1389.html <http://rockmail.sru.edu/exchweb/bin/redir.asp?URL=http://lists.w3.org/Archives/Public/public-html/2007Apr/1389.html>

Received on Tuesday, 15 May 2007 01:04:47 UTC