- From: Alan Kent <ajk@mds.rmit.edu.au>
- Date: Mon, 21 Jul 2003 10:06:35 +1000
- To: www-zig@w3.org
Hi Ray, I have replied to your email below. Sorry to be a broken record, but I think its *critical* to get scanning to work. Once scanning is fixed, the query stuff falls out in the wash. The attribute architecture is not only for querying - its for everything that uses attributes. This includes scanning. If you step away from doing searches for a moment, and just look at scanning indexes, you immediately and clearly hit the problem (in my opinion anyway! ;-). If I have term-lists that contain words from titles and the complete values of titles, then how do I express an attribute list for scanning? If I read the textual descriptions of the various attribute types, then format/structure sounds ideal. The Bib-2 attribute values make complete sense for scanning. Its the Util attribute values that are strange. Why specify 'any of these words' when doing a scan to identify that I want the title as a scan? Its semantically wrong. You want to identify the fact that you want words independently to the search-oriented operator of how to handle multiple terms in a search. Once you reach this point, you realise the descriptions for the attribute types are good. The overall architecture is good. Its just that the utility attribute set has not defined words vs strings, and that some comparison operators (any/all/adj) have slipped into format/structure by mistake. So I strongly recommended for a moment forgetting searches, and thinking about SCAN requests. What are the attribute lists for scanning title as keywords and title as complete values? On Fri, Jul 18, 2003 at 05:46:57PM -0400, Ray Denenberg wrote: > There's consensus (among those who have participated in this > discussion) that allTheseWords, anyOfTheseWords, adjacentWords should > be changed from Structure/format to Comparison attributes. > > There's less consensus about adding two new Structure/format > attributes, (1) word(s), and (2) string (or 'completeValue'). Mike > feels strongly that they should be added, and I don't feel strongly but > am somewhat uncomfortable about adding them (without clarifying certain > other parts of the proposal). I don't know how strongly Alan feel. And > I'd like to get other opinions. Actually, I think the discussion has been the opposite. I think there is strong consensus that word and string should be in format/structure. This is because they should be talking about the format or interpretation of the structure of the value supplied. This is ideal for doing index scanning too as the attribute is also for what is returned by a SCAN request - it describes the format/structure of the returned scan terms. It is not purely a query attribute. As a *result* of this consensus, it was realised and agreed to all,any,adj words should move out - they are in the wrong spot. Comparison is a more correct place. It makes sense with scanning too. Comparisons are query operators, not scanning stuff (this is a little hand-wavy here, which is always dangerous as I know I can come up with example applications where this is not true). But I am happy to get other people's opinions. I think I have finally managed to express what I meant clearly so that Mike and Rob understand and agree with what I am saying (they may have actually reached where I am at before me as a result of the CQL work). > This is how I see it: if the query term is a set of words, and the > comparison attribute is one of the above three, then clearly a > structure/attribute to indicate "words" is not necessary. To keep arguments simpler, I have tried to avoid the different term extraction rules side of things. But I want to support different definitions of what a 'word' is. To me, allWords, anyWords etc really should be allTerms, anyTerms, etc. They define what to do if there are multiple terms extracted from the query. The terms can be words. But the terms could also be floating point numbers defining a line segment, or coordinate pairs, or special things in chemical formulas etc. I think its better if comparison operators, whenever possible, should define how to *compare* values, not how to extract values to be compared. Orthogonality is good. It allows new term extraction rules to be added orthogonally. For example, Bib-2 already defines additional format/structure attributes. So its not a possibility, its a current reality. Its not just words and strings we are talking about - its the ability to define multiple ways to structure terms extracted from records (and queries), then then keeping this independent to comparison operators. I would love to change any/all/adj from 'words' to 'terms' in general. I think they will make sense when people define other concepts of how to extract terms from record content. However, this seemed to hard for people to swallow so I backed off trying to get at least the major problem fixed. > Conversely, if the desire is to search for words (as opposed to a > complete string) then can the comparison attribute be anything but one > of these three? Probably not. However, I believe a goal of the AA is to be extensible, and I can see cases where different projects may want different concepts of what a 'word' is. I am thinking more of chemical formulas, geographic coordinates, other rich and complex data types etc. This would be done by defining a new 'chemical' attribute set with a set of chemsitry specific access point names and new format/structure attributes related to chemical formulas. (Note: I know almost nothing about chemistry. I am using it as an example only.) > However, what if the term is a single word? If the intent is to > search for it as a word (not a string), I don't think Alan's proposal > addresses whether this should fit within the three attributes proposed > - all three would mean the same thing, and so there may be sentiment > for separating out the single-word case. If so, then I can see a > stronger argument for having 'word' and 'string' format/structure > values. I think what you are saying is a good example of why comparions operators should not be used to define what terms are. Its a good example of one of the many little nasty side effects that come up. That is why I strongly believe any/all/adj words should not imply term structure. > So I see two possibilities: > > 1. A single-word search would be handled by one of the > word-comparison attributes (one of these would be "singled-out" for > this use), no format/structure attribute included. If the term is a > single-word but is to be searched as a string, then another comparison > would be used. [aside: I'm not sure which one though. "Equal" seems to > be precluded, since the Utility set prose says that it cannot be used > with expansion/interpretation. On the other hand, Bath uses it. This > may be another defect that we should address.] > > 2. When the term is a single-word, the comparison attribute may not > be one of the above three (they can only be used for multiple words) > and the format/structure 'word' or 'string' is supplied. > > I think we need to nail down one of these two, and I don't really care which. I think the above assumes there is only one definition of what a 'word' is, and I think the goal of the AA is to be a framework for expansion, not restrictive. So I don't think you can ever preclude using a format/structure attribute in a query. I (personally) think it makes sense leaving format/structure open to identify different ways to extract multiple terms from a record (even different term extraction rules). I understand where you are coming from, but I don't think either (1) or (2) above should be mandated. The problem is both options assume the client *knows* whether the search term contains one or more words. But is 'book-case' one word or two? Clients do not know the word extraction rules used by a server (there is no formal agreed interpretation of what a word is anywhere), so clients cannot know if a search string entered by a user is a single word or not. So I think * Clients must be allowed to send all/any/adj for single or multiple word queries. * Clients must always be allowed to send a format/structure attribute. If omitted, its the servers choice as what to do. I don't think its necessary to define the preferred way to do single word queries as distinct from multi-word queries - as it implies the client has to understand how to extract words from strings using the same rules as a server. If this is considered important, then I would look at adding a new comparison operator to the any/all/adj list of 'exactly one', which aborts the query with an error if there is not exactly one word (term) supplied. The responsibility is then given to the server rather than being on the client. Alan
Received on Sunday, 20 July 2003 20:06:42 UTC