Announce: MKSearch beta 1

MKDoc Ltd. would like to announce the first beta release of 
MKSearch, under the GNU General Public Licence. Source and 
pre-compiled binary downloads are available from the project Web 
site.

http://www.mksearch.mkdoc.org/downloads/


MKSearch is a metadata search engine that indexes structured 
metadata in Web documents, not free text in the document body. 
The data acquisition system:

* Conforms to the Dublin Core metadata in HTML 
recommendations [1]

* Supports other application profiles, such as the UK e-Government 
Metadata Standard [2]

* Indexes native RDF formats, including RSS 1.0



The MKSearch system has five major components:

1. A Web crawler based on JSpider [3]

    * Multi-threaded processing
    * Per-site throttle, user agent, depth and linking rules
    * Respects the robots.txt exclusion policy
    * Extensible plug-in based content handling

2. An HTML document validator and formatter based on JTidy [4]

    * Cleans-up and corrects HTML syntax errors
    * Converts HTML to XHTML

3. A set of custom indexers based on the Simple API for XML (SAX)

    * Extracts metadata from HTML meta and link elements
    * Converts metadata to RDF triple statements
    * Configurable application profiles

4. An RDF storage and query system based on Sesame [5]

    * XML/RDF file-based storage
    * Database storage using PostgreSQL or MySQL
    * Sophisticated Sesame RDF Query Language (SeRQL) queries
    * Scope for more semantically rich queries with inferencing

5. A public query interface, provided through a standard servlet 
container

    * Simple, expandable query builder form
    * Configurable application profile-based presentation
    * Wildcard query handling
    * Phrase searches
    * Paged HTML results
    * Standing RSS results


The two main elements of the MKSearch system can be used 
independently. The data acquisition system can be used to gather 
large quantities of metadata from the Web and store it as RDF. The 
query system can be used to provide a typical search engine-style 
interface to existing RDF content.

The MKSearch beta 1 distribution includes sample configurations 
that crawl a Web site and create:

* A mirror of the site on the local file system in valid XHTML
* An RDF N-Triple record for each page on the local file system
* UK e-Government metadata in a Sesame file-based repository 
(XML/RDF)


This distribution also includes a demonstration of the MKSearch 
query interface, in the form of a Web Application Archive (WAR) 
that can be deployed directly to an existing servlet container. The 
sample search content is from an index of the MKSearch project 
Web site on 2 November 2005. See the site documentation below:

http://www.mksearch.mkdoc.org/documentation/tomcat-on-fc4/

http://www.mksearch.mkdoc.org/howto/

http://www.mksearch.mkdoc.org/plans/beta-1-release-
tasks/mksearch-beta-1-release-notes/


System requirements and licence
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

MKSearch is written in the Java programming language and is 
designed to run on any platform that supports a Java environment 
equivalent to the Sun Java 2 language specification.

The system has specifically been designed, developed and tested 
to run on GNU/Linux systems using the GNU Compiler for Java 
(GCJ) [6] and Apache Tomcat 5 servlet container, as available on 
Fedora Core 4 [7].  This provision means that MKSearch can be 
built and run on software systems that are entirely open source 
and free from proprietary licencing.

The system has been tested extensively using the Sun Java SDK 
1.5 on Microsoft Windows 2000. JUnit test suites for the 
MKSearch code base cover 99% of all code branches.

If you have any comments or questions about the MKSearch 
system, please join us on the project mailing list.

http://www.email-lists.org/mailman/listinfo/mksearch-dev





References
~~~~~~~~~~

[1] http://dublincore.org/documents/2003/11/30/dcq-html/

[2] 
http://www.govtalk.gov.uk/schemasstandards/metadata_document.
asp?docnum=805

[3] http://j-spider.sourceforge.net/

[4] http://jtidy.sourceforge.net/

[5] http://www.openrdf.org/

[6] http://gcc.gnu.org/java/

[7] http://fedora.redhat.com/

--
MKSearch (beta)

http://www.mksearch.mkdoc.org/

Free, open source metadata search engine with RDF storage and query.

Received on Friday, 4 November 2005 04:37:04 UTC