- From: Edward C. Zimmermann <edz@bsn.com>
- Date: Tue, 14 Feb 2006 08:06:18 +0100
- To: www-zig@w3.org
- Cc: to.takasu@infocom.co.jp
Quoting to.takasu@infocom.co.jp: > > I am from Japan and we have several styles of character(kanji, hiragana, > katakana) in Japanese language And encoding paradigm JIS or Unicode or .. ? > as you might have known. Therefore when we use Z39.50 or whatever to build > a search system, > it comes to a critical issue that a book could be stored with a title > consisted of different styles of character in > each Z39.50 database. Once it happened, the search system reads the title > of the same book from each database, > and recognize them as different books because their titles don't match. Correct. Say you were using Unicode as encoding then since in Unicode we don't have the style info only the characters we'd be unable to tell them as the same and the Kanji characters are undistuingishable from Chinese or Korean (this is the basis for the opposition to Unicode in Japan). > I assume that the same problem could arise in English as well. > The search system has to handle the capital letters, lower case letters, > space, /, -, commas, periods, so on so forth, > and a title of a book could be represented in different way. (Title > metadata could be slightly different depending on the database, right?) The problem is more heinous as the same "object" can have very different titles. In books we have ISBN and other library identification numbers to tell things apart (or to identify them as the same) but in other areas such as film we don't. In Film, example, there is not always such "clearinghouse" numbers. Different film distributors do time and again have the "same film" under different names. Either one considers them as different films (they may indeed be slightly different due to editing and printing) or one needs to try to create a database mapping the "same" objects and given that the material is not known, heuristics to try to reduce the magnitude of the effort demanded... We tackled this about a dozen years ago in a large film catalogue (database) for the German Minterium for Political Education by searching content against each other and looking for "large" correlations. In a German national pilot to create a unified infrastructure for the various German state media clearinghouses we took yet another approach in that we worked to convert each states metadata into a common interchange format with a common attribute semantics including a common identification and created tools for record authors to search the collection to confirm that no record already existed. There was motivation to find any existing records of an item since only the first person to register a media was responsible for creating its record! (and we, of course, had tools as well using some of the above heuristics to try to identify "errors") > The thing is that a search system with ability to recognize a same books as > a same book even though the > titles don't perfectly match, is ideal system because it can output exactly > one result per a book. That's nothing to do with Z39.50/ISO23950 but other systems and agreements such as ISBN which are exactly about uniquely identifying books and book-like products published internationally (Journals etc. are identified by ISSNs or International Standard Serial Numbers). > > So how does your system handle this problem? > > I apologize that I wrote more than enough amount, but if you could have > some time to > answer my question, that will be greatly appreciated. The ISBN is encoded in the various MARC and other traditional exchange formats and is in our basis (Bib-1) attribute set under the UID (Unique ID) 7. The ISSN is encoded in Bib-1 under 8. -- -- Edward C. Zimmermann, Basis Systeme netzwerk, Munich Office Leo (R&D): Leopoldstrasse 53-55, D-80802 Munich, Federal Republic of Germany Telephone: Voice:= +49 (89) 385-47074 Corp.Fax:= +49 (89) 692-8150 Nomadic (SMS/MMS/Fax):= +49 (176) 100-360-55 Alt.Mobile:= +49 (179) 205-0539 http://www.nonmonotonic.net
Received on Tuesday, 14 February 2006 07:06:38 UTC