Re: Attribute Architecture -- new type?

On Mon, Aug 18, 2003 at 12:16:14PM +0100, Robert Sanderson wrote:
> >     access=title
> >     comparision=any
> >     format=string
> >     Term=child's book-case 
> > Does the second mean the title must equal 'child's' or 'book-case'?
> 
> Exactly my point! :)  We don't know if the client meant two strings or 
> one. Thus it needs to say what it meant somehow.

Yes, I thought that is the problem we were trying to come up with a
precise spec to address. We know there is a problem - do you have a
concrete proposed solution we can put into the spec?

What I am trying to work out in my mind is what are the rules and
when are they invoked. I believe this has to be clearly expressed
in the spec.

Being picky to highlight the point (not purposely trying to be obnoxious),
examples I have seen in recent mail are:

* If date/time then 1/2/3 12:34:56 2/3/4 12:34:56 is two values
* If there are quotes around strings, treat each quoted string as
  a separate value (implying you have to release quotes in strings)
* If you specify multiple terms for format string, then the system
  should "work out" what the terms are (Does this mean child's book-case
  is 2 terms ("child's" and "book-case") if '2' is specified and 3 terms
  ("child's" "book" "case") if '3' is specified?)

(I will admit the last point I have purposely pushed what you said
beyond what I suspect you intended. My point is simply that I would
like to know the *precise* rules to use, whatever they are, otherwise
people can bend and twist them in ways not intended.)

The above seems to imply each 'format' value has different rules for
extracting multiple terms from the Term=... value (which is fine).
But further, the parsing rules change depending on what occurrence value
is specified? (format=string + occurrence=single means grab whole
string, but format=string + occurrence=multiple means split on white
space, but don't split quoted strings and release quotes in strings?
Or is it the presence of any/all/adj that indicates multiplicity?)

I like trying to keep things orthogonal in the attribute types as
much as possible. It seems like the concept of pulling out multiple
terms from a single query string is a query-only concept - it is
not relevant scanning an index for example. Is it therefore that
the new attribute should specify not only that there are multiple
terms, but how to pull them out of the supplied query string?
(The default value for different formats would be different.)

    access=title
    format=string
    parse-query-term=single-value    (the default for format=string)
    comparison=equal                 (actually, its irrelevant)
    Term=The Fall of the Roman Empire

    access=title
    format=string
    parse-query-term=space-separated-quoted-strings
    comparison=any
    Term="The Fall of the Roman Empire" "Batman forever" Jaws

    access=title
    format=word
    parse-query-term=word-boundaries   (the default for format=word)
    comparison=adj
    Term=XML Schema

    access=title
    format=date/time
    parse-query-term=space-separated-date/time-values    (the default)
    comparison=all
    Term=1/2/3 12:34:56 2/3/4 12:34:56

The idea is the parse-query-term attribute type alone identifies how to
parse the supplied Term=... into multiple terms. The default value if
this attribute is not specified is defaulted based on the format=... value.

Further, comparisons of any/all/adj should only be used with parsing
rules that can return more than one value. Comparisons of equal/greater/...
should only be used with parsing rules that return exactly one value.

Candidate values for the parse-query-term attribute type:

    single-value
	Takes input string verbatim.
    	Returns exactly one term.
    
    space-separated-quoted-strings
	Finds all the quoted strings. Any non-quoted text is separated
	on whitespace boundaries.
	Can return multiple terms.

    word-boundaries
	Parses as words using the same word parsing rules as the access
	point uses. Can return multiple terms.

I am not saying we should do the above, but it is an option. Previously
I had been suggesting the format=... value alone should specify how
to identify multiple terms from the query string. If that is the case,
then I would not allow multiple quoted strings with format=string.
format=string means grab the whole value. If people think there is
an advantage in being able to have multiple string values in a single
query term, then we could come up with a syntax - but I don't see
clearly how this would fit in with CQL etc. Quoted strings in quoted
strings?


But I am strongly of the opinion that the rules for breaking the query
string into multiple search terms should be clear in the spec. I don't
mind the system working out the terms from an occurrence count, if the
algorithm for doing so is included in the spec. If you still prefer
a null/single/multiple style attribute type, did you have a specific
algorithm in mind for extracing terms from query strings? Could you write
it down? There is a danger of wandering around discussing options too long.
My personal goal is to get an acceptable approach signed off on. Looking
at different options can help find a better approach, but only if the
proposal is pretty concrete (in my personal opinion).


Thanks!
Alan

Received on Monday, 18 August 2003 19:15:20 UTC