Re: FW: Semantic Suggestions please ...

Neil,
I'm a bit surprised that no one has mentioned Aduna AutoFocus.
It is not quite everything you are looking for but I think it goes a long
way.
From looking at it over the past couple of weeks I understand the following
that i hope is relevant.
It is open source.
You can modify it to better suite your requirements.
With AutoFocus there is a desktop version and a server (in the open source
version deployed as two apps, one to administer and one just to serve the
repository.)

AutoFocus desktop is really a a document search tool using indexing and
faceted search. It therefore concentrates on document retrieval where what
is wanted in the document store is refined in the search as possibles are
discovered.
Documents can be web pages.

Aside from the way that different types of document are reconised by the
system, the central feature to AutoFocus is the repository where the data is
held in RDF in combination with a Lucene index.

This respoitory can be formed by one componant and consummed by another,
e.g. by the desktop application and then referenced by the server
application or vice versa.

Suggested uses seem to be to
1. Just use the generated RDF
1.1 Having generated it there would be work to do on where it appears.
2. Use the server application and make it available to users.
3. Both?

I am not usre if this addresses your question, or whether the question is
more of a challenge to this group than an actual need, but these are my
thoughts. It does seem to me to fall in with the suggestion of using calais,
and I think it is a beter solution as I believe calais is more identifying
named entities whereas i think you simply want to convert existing data that
alread contains all or most of the descriptions you need into RDF to better
consume it.

Best,

Adam Saltiel

2008/10/2 Neil McNaughton <neil@oilit.com>

>  Sorry – should have copied this to the group...
>
>
>
> Dan,
>
>
>
> *Subject: Re: Semantic Suggestions please ...*
>
> * *
>
> *Can you say a bit more about what structures you do have behind the *
>
> *scene? Are there perhaps subsets of an SQL database that could be *
>
> *shared? How is the site built / maintained?*
>
>
>
> Site is rebuilt monthly with a lot of clunky VB code generating PHP, HTML -
> no database just index files - but you already seem to have discovered
> that...
>
>
>
> *Looking eg at http://www.oilit.com/2journal/2index/2peo.htm*
>
> *I see first of all,*
>
> *<META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW">*
>
> *... which suggests you don't want machines to do anything with this data.
> *
>
>
>
> No not really. This is just so that things don't get indexed multiple times
> as all the index files (by company, by person and by calendar month) all
> point to the article files.
>
>
>
> *Then each author/person (are persons topics too, or just authors?) gets *
>
> *a link,*
>
> * *
>
> *<p><b>Select item</b>*
>
> *<BR><A HREF = "2peo/21.htm">Aamodt, Finn</A>*
>
> * *
>
> *<BR><A HREF = "2peo/22.htm">Aasheim, Hilda</A>*
>
> *<BR><A HREF = "2peo/23.htm">Abbot, Dave</A>*
>
> *<BR><A HREF = "2peo/24.htm">Abbott, David</A>*
>
> *<BR><A HREF = "2peo/25.htm">Abdalla, Ab</A>*
>
> *<BR><A HREF = "2peo/26.htm">Abel, Roger</A>*
>
> *<BR><A HREF = "2peo/27.htm">Abernathy, Steve</A>*
>
> *<BR><A HREF = "2peo/28.htm">Aberson, John</A>*
>
> *<BR><A HREF = "2peo/29.htm">Abougoush, Mickey</A>*
>
> *<BR><A HREF = "2peo/210.htm">Abou-Sayed, Ahmed</A>*
>
> * *
>
> * *
>
> *If I go to one of these, eg.*
>
> *http://www.oilit.com/2journal/2index/2peo/210.htm*
>
> * *
>
> *I see a page listing article(s) by that person, so for Abou-Sayed, Ahmed*
>
> *we get <BR><A HREF = "../../2article/0603_11.htm" > Sixth Middle East IM
> *
>
> *Forum**, Kuwait** (March 2006)</A> ie.*
>
> * *
>
> *http://www.oilit.com/2journal/2article/0603_11.htm*
>
> * *
>
> *We get basic metadata here,*
>
> * *
>
> *     <meta name="document-date" content="28 Mar 2006 00:00:00 GMT">*
>
> *     <TITLE>Sixth Middle East IM Forum, Kuwait (March 2006)</TITLE>*
>
> * *
>
> *And what looks like an abstract/intro paragraph,*
>
> * *
>
> *<Font Face="Arial" Size=2><b>*
>
> *Data management and information management (DM/IM) in the Middle East *
>
> *countries  is different. First because it has much more of a production *
>
> *focus that in Europe or the USA. Second, because Middle East National *
>
> *Oil Companies have taken the long term view. If building a corporate *
>
> *data store for fields with hundreds of wells and decades of production *
>
> *history means a five year plan, with allocation of people, training and *
>
> *finance, then that is what happens. Kuwait Oil Co. (KOC) has over 1,000 *
>
> *users of its Finder database with projects ongoing for data quality, *
>
> *SCADA integration, data mining, decision support and automated data *
>
> *capture. Finder database has cornered the data store market for Middle *
>
> *East NOCs. This is both a great achievement and a potential *
>
> *embarrassment for Schlumberger which is in the process of trying to wean
> *
>
> *its clients off Finder and onto Seabed. An animated debate at the close *
>
> *of the conference showed that this will not be easy.*
>
> *</b></font>*
>
> * *
>
> * *
>
> *The match against Ahmed Abou Sayed seems to be based on his being *
>
> *quoted.  Is the matching/indexing done by hand or machine?*
>
>
>
> It's based on his being referred to in the article and is done by hand
> monthly – and kept in an Access database which generates the indexes.
>
>
>
> *     "For Ahmed Abou-Sayed (Informateks), 'data mining is set to become a
> *
>
> *tough competitor for simulation.'"*
>
> * *
>
> *You're right, there's a lot here to work with. But the current structure
> *
>
> *of the site (markup, frames etc) is a little daunting for the *
>
> *uninitiated. Could you give some suggestions on how semweb folk might *
>
> *explore it? eg. is it OK to crawl the entire site? Can you make some *
>
> *data dumps available, or suggest key URLs to explore from?*
>
>
>
> The site has two versions of the same text – a 'monthly' edition which you
> see upfront on login with PHP and CSS which looks reasonably OK. But of more
> interest to semantic stuff perhaps is the same information in individual
> article files. These have the structure
>
> <H1>A n article title
>
> <H2>A subtitle
>
> The text of the article. They are all located in
> www.oilit.com/2journal/2article/YYMM_NN.htm (Year/month/article number).
>
> The index structure you have basically figured out. For instance the
> 'people' index contains all the people we have ever mentioned in an article
> and points to a list of such articles – which points to the articles
> themselves.
>
>
>
> I Have thought about – and probably will – moving this over to a mySQL
> database, but not had the time to do so. What I would like to understand
> from you folks is how this information – say the list of companies and
> people – can be presented in a semantic way that would give them more
> usefulness (discovery, reuse?) to other sites and robots.
>
>
>
> Others on the list have suggested OpenCalais – which looks interesting for
> marking up the text. But is would be good to add back in my own lists of
> companies and people – maybe I can do this with Calais?
>
>
>
> Regards – and thanks a lot for having spent time with my frames already ;-)
>
>
>
> Neil McNaughton
>
> --
>
> http://danbri.org/
>
>
> ---------------------------------------------------------------------------------------
>
> Orange vous informe que cet  e-mail a ete controle par l'anti-virus mail.
>
> Aucun virus connu a ce jour par nos services n'a ete detecte.
>
>
>

Received on Thursday, 2 October 2008 09:48:51 UTC