- From: Alan Kent <ajk@mds.rmit.edu.au>
- Date: Thu, 3 Jul 2003 13:30:40 +1000
- To: ZIG <www-zig@w3.org>
I am home sick today, so sending this may be dangerous (inaccurate), but it makes sense to me at the moment! *-) One thing that struck me recently was the imprecision of Z39.50 compared to other standards. This probably no shock to anyone who has been involved with Z39.50, but seems a pity. For example, if I load the same data into two systems claiming to be SQL conformant, then I know the same SQL query on both systems will return the same results. This is not true with Z39.50. Often this is acceptable. In many places a request lets the server pick the default it wishes to use. Record syntaxes is a good example. I can omit the record syntax from my request in which case the server can choose whatever it likes. But I can also request a particular syntax (that is, I can use the default, or say what I want). So profiles like Bath give guidence of how to express unambiguous (or at least less ambiguous) requests for people conforming to Bath. But there seems to me to be one central place in Z39.50 where things are messy, and that is where terms and attributes meet. Two concrete examples: (1) I was involved in the Bath interoperability testing by William Moen. I was supplied with data, then my server was hit for lots of queries. My server passed, but the results were inconsistant with the results of the base server because we treated '-' different when it came to extracting words from text. Is "book-case" one word or two? An observation is that in a query, I can specify all sorts of things, but not what my interpretation is of how to extract terms from text. (2) I was reading the new attribute architecture and Bath profile draft. If you read the truncation attributes (left truncation on character boundary to be precise) it says: Matching is considered successful when the value of the term matches any substring of the value of the access point where that substring is obtained by removing zero or more characters at the beginning of the string This implies to me an access point identifies a string from a record. For example "title" of a record may be "The Art of War". I can say queries such as "rt of W" with left and right truncation at character boundaries to match the access point. I can also say whether to normalize whitespace or not using another attribute. So my query term in this case is not a series of words (or is it?)... What do I get if I scan the 'title' access point in the new attribute architecture? Words extracted from the title, or complete title strings? In Bib-1, the Bath profile used the 'completness' attribute to differentiate. That is, I could ask for one or the other. Nothing like this seems to exist in the new attribute architecture. I think the new attribute architecture has a flaw. I was not really involved in its development, but my seconding guessing is that the model used was to try and make it abstract by avoiding dependence on how it might be implemented. That is, lets define lots of query operators that people have needed to do in the past but leave it up to developers how to implement it. On the surface this sounds good. My problem is, Z39.50 as a whole exposes the concept of 'terms' being extracted from the string values of access points. This is what scans return. Bib-1 (via Bath) supported the concept of different term extraction rules (complete value versus words). This seems to have been lost as an explicit concept in the new attribute architecture (it is sort of implicit in some of the query operators, but this creates problems when you specify multiple query operators that imply different rules). This lead me to wonder if there should be support for an attribute that can be used to specify the rule for extracting 'terms' (where a term could be a word or the complete value etc) from the query string supplied and the string of the access point being searched. I am not saying the new attribute set archiecture should be thrown out. It has lots of good stuff in it. But it seems sad to me that years after the new attribute architecture has been released, with a PhD in text indexing, around 10 years of practical Z39.50 experience, I still find it impossible to implement. But I guess I am sick... ;-) Rather than only be critical (I am certainly not trying to offend those have put lots of time and effort into it), I guess I should present something constructive. If I was to design the ideal solution for myself, ignoring all other people's requirements, I would start from a definition of the basics of how I (and I think most vendors) implement Z39.50. To me, given a record, a set of terms are extracted. These terms may be stored in indexes to improve query performance. All terms have metadata including an access point (and semantic qualifiers etc) and term position information. Note that I am carefully using the word 'term', as the term could be strings that are words or complete values, or an integer or float etc. Terms are also extracted from queries. A query string may for example contain three word-terms. A Z39.50 query-term is not a term to me - its the input to a process that extracts terms from it. Attributes supplied with the query identify how to extract term from the query. I will use the name 'term-point' to refer to a combination of an access point, semantic qualfiers etc, plus the term extraction rules used to extract terms. Just to clarify, 'author' might be an access point, but if I want to support searching on access=author+semantic-qualifier=person and access=author+semantic-qualifier=institution, then I would have two term-points. A query that specified only access=author I would treat as an OR query on both term-points. Next, terms extracted from records and queries can be transformed. Example transformations include mapping to upper case and stemming. This is not mandatory, but is an optional step (expanded upon below). Comparison of records to queries then proceeds by comparing the terms extracted from records and from queries. The term-points of the terms must be the same for a match (eg: both must be 'title' and a 'word'). Other query attributes can specify the rules for comparing the terms (case sensitivity, truncation, greater-than, adjacency/all/any). Scanning returns terms extracted from records. Display terms are terms before any transformations, the actual term is after any transformations have been done. A term-list is therefore all the terms extracted from a term-point. The attribute list for a scan identifies which term-point to use. I would leave it up to an implementation to choose what attributes bind to term extraction rules, transformation rules, and comparison rules. For example, term extraction may return terms mapped to upper case, or a transformation rule may map the term to upper case, or the comparison function may be case insensitive. The three choices affect what you see in the record versus what you see as as the display and actual term when scanning. A flexible system would allow you to change theses rules per term-point. Once this model was in place, I would then go through the various attributes and define them with respect to terms (not words or complete values). For example, space normalization is defined on terms - if the terms were words, there would not be any spaces in the terms so it can be ignored. AllWords, AnyWords, AdjacentWords would become AllTerms, AnyTerms, AdjacentTerms. Truncation at character and word boundaries would also be on terms. I would also introduce new attribute values to identify the term extraction rules. Examples may include * Server choice of how to extract words from input string. * Server choice of how to extract the complete value of the input string. * Input string after normalizing all punctuation and whitespace into spaces, trimming leading and trailing whitespace, and compressing internal white space into a single space. * Break into words where a word is a continuous sequence of letters and digits. The last two are possible examples of precise term extraction rules that could be defined. This would allow a query to specify *exactly* how to extract terms. I can imagine lots of vendors then wanting their variations included. The first two would be the more common cases of letting a server choose, but still being able to choose between words or complete values. Alan -- Alan Kent (mailto:Alan.Kent@teratext.com.au, http://www.mds.rmit.edu.au/~ajk/) Project: TeraText Technical Director (http://teratext.com.au) InQuirion Pty Ltd Postal: Multimedia Database Systems, RMIT, GPO Box 2476V, Melbourne 3001. Where: RMIT MDS, Bld 91, Level 3, 110 Victoria St, Carlton 3053, VIC Australia. Phone: +61 3 9925 4114 Reception: +61 3 9925 4099 Fax: +61 3 9925 4098
Received on Wednesday, 2 July 2003 23:30:49 UTC