Re: question about Z39.50 from Edward C. Zimmermann on 2006-02-14 (www-zig@w3.org from February 2006)

From: Edward C. Zimmermann <edz@bsn.com>
Date: Tue, 14 Feb 2006 08:06:18 +0100
To: www-zig@w3.org
Cc: to.takasu@infocom.co.jp
Message-ID: <1139900778.43f1816a4cfd1@mail.bsn.com>
Quoting to.takasu@infocom.co.jp:


> 
> I am from Japan and we have several styles of character(kanji, hiragana,
> katakana) in Japanese language

And encoding paradigm JIS or Unicode or .. ?

> as you might have known. Therefore when we use Z39.50 or whatever to build
> a search system,
> it comes to a critical issue that a book could be stored with a title
> consisted of different styles of character in
> each Z39.50 database. Once it happened, the search system reads the title
> of the same book from each database,
> and recognize them as different books because their titles don't match.

Correct. Say you were using Unicode as encoding then since in Unicode we
don't have the style info only the characters we'd be unable to tell them
as the same and the Kanji characters are undistuingishable from Chinese or
Korean (this is the basis for the opposition to Unicode in Japan).


> I assume that the same problem could arise in English as well.
> The search system has to handle the capital letters, lower case letters,
> space, /, -, commas, periods, so on so forth,
> and a title of a book could be represented in different way. (Title
> metadata could be slightly different depending on the database, right?)

The problem is more heinous as the same "object" can have very different
titles. In books we have ISBN and other library identification numbers to
tell things apart (or to identify them as the same) but in other areas such
as film we don't.

In Film, example, there is not always such "clearinghouse" numbers. Different
film distributors do time and again have the "same film" under different names.
Either one considers them as different films (they may indeed be slightly
different due to editing and printing) or one needs to try to create a database
mapping the "same" objects and given that the material is not known, heuristics
to try to reduce the magnitude of the effort demanded...

We tackled this about a dozen years ago in a large film catalogue (database)
for the German Minterium for Political Education by searching content against
each other and looking for "large" correlations.

In a German national pilot to create a unified infrastructure for the various
German state media clearinghouses we took yet another approach in that we
worked to convert each states metadata into a common interchange format with
a common attribute semantics including a common identification and created
tools for record authors to search the collection to confirm that no record
already existed. There was motivation to find any existing records of an item
since only the first person to register a media was responsible for creating
its record!
(and we, of course, had tools as well using some of the above heuristics to
try to identify "errors")

> The thing is that a search system with ability to recognize a same books as
> a same book even though the
> titles don't perfectly match, is ideal system because it can output exactly
> one result per a book.

That's nothing to do with Z39.50/ISO23950 but other systems and agreements such
as ISBN which are exactly about uniquely identifying books and book-like
products published internationally (Journals etc. are identified by ISSNs or
International Standard Serial Numbers).

> 
> So how does your system handle this problem?
> 
> I apologize that I wrote more than enough amount, but if you could have
> some time to
> answer my question, that will be greatly appreciated.

The ISBN is encoded in the various MARC and other traditional exchange formats
and is in our basis (Bib-1) attribute set under the UID (Unique ID) 7. The
ISSN is encoded in Bib-1 under 8.

-- 
-- 
Edward C. Zimmermann, Basis Systeme netzwerk, Munich
Office Leo (R&D):
   Leopoldstrasse 53-55, D-80802 Munich,
   Federal Republic of Germany
Telephone:   Voice:=  +49 (89) 385-47074  Corp.Fax:= +49 (89)  692-8150
 Nomadic (SMS/MMS/Fax):= +49 (176) 100-360-55  Alt.Mobile:= +49 (179) 205-0539
http://www.nonmonotonic.net
Received on Tuesday, 14 February 2006 07:06:38 UTC