The imprecision of Z39.50 from Alan Kent on 2003-07-03 (www-zig@w3.org from July 2003)

From: Alan Kent <ajk@mds.rmit.edu.au>
Date: Thu, 3 Jul 2003 13:30:40 +1000
To: ZIG <www-zig@w3.org>
Message-ID: <20030703133040.A15146@io.mds.rmit.edu.au>
I am home sick today, so sending this may be dangerous (inaccurate),
but it makes sense to me at the moment! *-)

One thing that struck me recently was the imprecision of Z39.50 compared
to other standards. This probably no shock to anyone who has been involved
with Z39.50, but seems a pity.

For example, if I load the same data into two systems claiming to be SQL
conformant, then I know the same SQL query on both systems will return
the same results.

This is not true with Z39.50.

Often this is acceptable. In many places a request lets the server
pick the default it wishes to use. Record syntaxes is a good example.
I can omit the record syntax from my request in which case the server
can choose whatever it likes. But I can also request a particular syntax
(that is, I can use the default, or say what I want).

So profiles like Bath give guidence of how to express unambiguous
(or at least less ambiguous) requests for people conforming to Bath.

But there seems to me to be one central place in Z39.50 where things
are messy, and that is where terms and attributes meet.

Two concrete examples:


(1) I was involved in the Bath interoperability testing by William Moen.
I was supplied with data, then my server was hit for lots of queries.
My server passed, but the results were inconsistant with the results
of the base server because we treated '-' different when it came to
extracting words from text. Is "book-case" one word or two?

An observation is that in a query, I can specify all sorts of things,
but not what my interpretation is of how to extract terms from text.


(2) I was reading the new attribute architecture and Bath profile draft.
If you read the truncation attributes (left truncation on character
boundary to be precise) it says:

    Matching is considered successful when the value of the term
    matches any substring of the value of the access point where that
    substring is obtained by removing zero or more characters at the
    beginning of the string

This implies to me an access point identifies a string from a record.
For example "title" of a record may be "The Art of War". I can say
queries such as "rt of W" with left and right truncation at character
boundaries to match the access point. I can also say whether to normalize
whitespace or not using another attribute. So my query term in this
case is not a series of words (or is it?)...

What do I get if I scan the 'title' access point in the new attribute
architecture? Words extracted from the title, or complete title strings?

In Bib-1, the Bath profile used the 'completness' attribute to
differentiate. That is, I could ask for one or the other.
Nothing like this seems to exist in the new attribute architecture.


I think the new attribute architecture has a flaw. I was not
really involved in its development, but my seconding guessing is that
the model used was to try and make it abstract by avoiding dependence on
how it might be implemented. That is, lets define lots of query operators
that people have needed to do in the past but leave it up to developers
how to implement it. On the surface this sounds good.

My problem is, Z39.50 as a whole exposes the concept of 'terms' being
extracted from the string values of access points. This is what scans
return. Bib-1 (via Bath) supported the concept of different term extraction
rules (complete value versus words). This seems to have been lost as
an explicit concept in the new attribute architecture (it is sort of
implicit in some of the query operators, but this creates problems
when you specify multiple query operators that imply different rules).

This lead me to wonder if there should be support for an attribute 
that can be used to specify the rule for extracting 'terms' (where a
term could be a word or the complete value etc) from the query string
supplied and the string of the access point being searched.


I am not saying the new attribute set archiecture should be
thrown out. It has lots of good stuff in it. But it seems sad to me
that years after the new attribute architecture has been released, with
a PhD in text indexing, around 10 years of practical Z39.50 experience,
I still find it impossible to implement. But I guess I am sick... ;-)



Rather than only be critical (I am certainly not trying to offend those
have put lots of time and effort into it), I guess I should present
something constructive. If I was to design the ideal solution for myself,
ignoring all other people's requirements, I would start from a definition
of the basics of how I (and I think most vendors) implement Z39.50.

To me, given a record, a set of terms are extracted. These terms may
be stored in indexes to improve query performance. All terms have
metadata including an access point (and semantic qualifiers etc) and
term position information. Note that I am carefully using the word
'term', as the term could be strings that are words or complete values,
or an integer or float etc.

Terms are also extracted from queries. A query string may for example
contain three word-terms. A Z39.50 query-term is not a term to me -
its the input to a process that extracts terms from it. Attributes
supplied with the query identify how to extract term from the query.

I will use the name 'term-point' to refer to a combination of an
access point, semantic qualfiers etc, plus the term extraction rules
used to extract terms.

Just to clarify, 'author' might be an access point, but if I want to
support searching on access=author+semantic-qualifier=person and
access=author+semantic-qualifier=institution, then I would have
two term-points. A query that specified only access=author I would
treat as an OR query on both term-points.

Next, terms extracted from records and queries can be transformed.
Example transformations include mapping to upper case and stemming.
This is not mandatory, but is an optional step (expanded upon below).

Comparison of records to queries then proceeds by comparing the terms
extracted from records and from queries. The term-points of the terms
must be the same for a match (eg: both must be 'title' and a 'word').
Other query attributes can specify the rules for comparing the terms
(case sensitivity, truncation, greater-than, adjacency/all/any).

Scanning returns terms extracted from records. Display terms are terms
before any transformations, the actual term is after any transformations
have been done. A term-list is therefore all the terms extracted from
a term-point. The attribute list for a scan identifies which term-point
to use.

I would leave it up to an implementation to choose what attributes bind
to term extraction rules, transformation rules, and comparison rules.
For example, term extraction may return terms mapped to upper case,
or a transformation rule may map the term to upper case, or the comparison
function may be case insensitive. The three choices affect what you see
in the record versus what you see as as the display and actual term
when scanning. A flexible system would allow you to change theses rules
per term-point.

Once this model was in place, I would then go through the various attributes
and define them with respect to terms (not words or complete values).
For example, space normalization is defined on terms - if the terms
were words, there would not be any spaces in the terms so it can be
ignored. AllWords, AnyWords, AdjacentWords would become AllTerms, AnyTerms,
AdjacentTerms. Truncation at character and word boundaries would also
be on terms.

I would also introduce new attribute values to identify the term extraction
rules. Examples may include
* Server choice of how to extract words from input string.
* Server choice of how to extract the complete value of the input string.
* Input string after normalizing all punctuation and whitespace into spaces,
  trimming leading and trailing whitespace, and compressing internal white
  space into a single space.
* Break into words where a word is a continuous sequence of letters and
  digits.

The last two are possible examples of precise term extraction rules
that could be defined. This would allow a query to specify *exactly*
how to extract terms. I can imagine lots of vendors then wanting their
variations included. The first two would be the more common cases
of letting a server choose, but still being able to choose between
words or complete values.



Alan

-- 
Alan Kent (mailto:Alan.Kent@teratext.com.au, http://www.mds.rmit.edu.au/~ajk/)
Project: TeraText Technical Director (http://teratext.com.au) InQuirion Pty Ltd
Postal: Multimedia Database Systems, RMIT, GPO Box 2476V, Melbourne 3001.
Where: RMIT MDS, Bld 91, Level 3, 110 Victoria St, Carlton 3053, VIC Australia.
Phone: +61 3 9925 4114  Reception: +61 3 9925 4099  Fax: +61 3 9925 4098
Received on Wednesday, 2 July 2003 23:30:49 UTC