Re: more on attribute proposal from Mike Taylor on 2003-07-21 (www-zig@w3.org from July 2003)

From: Mike Taylor <mike@indexdata.com>
Date: Mon, 21 Jul 2003 10:08:49 +0100
To: www-zig@w3.org
Message-Id: <E19eWf3-0005C8-00@auntie.miketaylor.org.uk>
Just want to say I agree 100% with all Alan's said here.  We need the
AA to provide total orthogonality.

 _/|_	 _______________________________________________________________
/o ) \/  Mike Taylor  <mike@indexdata.com>  http://www.miketaylor.org.uk
)_v__/\  "No pearl grows without a grain of irritation at its heart.
	 The trick is to grow a pearl and not an ulcer" -- Neil Peart.

--
Listen to my wife's new CD of kids' music, _Child's Play_, at
	http://www.pipedreaming.org.uk/childsplay/



> Envelope-to: mike@indexdata.com
> Delivery-date: Mon, 21 Jul 2003 02:07:16 +0200
> Date: Mon, 21 Jul 2003 10:06:35 +1000
> From: Alan Kent <ajk@mds.rmit.edu.au>
> Content-Type: text/plain; charset=us-ascii
> X-Archived-At: http://www.w3.org/mid/20030721100635.B10849@io.mds.rmit.edu.au
> Resent-From: www-zig@w3.org
> X-Mailing-List: <www-zig@w3.org> archive/latest/1363
> X-Loop: www-zig@w3.org
> Sender: www-zig-request@w3.org
> Resent-Sender: www-zig-request@w3.org
> Precedence: list
> List-Id: <www-zig.w3.org>
> List-Help: <http://www.w3.org/Mail/>
> List-Unsubscribe: <mailto:www-zig-request@w3.org?subject=unsubscribe>
> Resent-Bcc: 
> X-Spam-Status: No, hits=-4.4 required=5.0 tests=IN_REP_TO version=2.20
> X-Spam-Level: 
> 
> 
> Hi Ray,
> 
> I have replied to your email below.
> 
> Sorry to be a broken record, but I think its *critical* to get scanning
> to work. Once scanning is fixed, the query stuff falls out in the wash.
> The attribute architecture is not only for querying - its for everything
> that uses attributes. This includes scanning.
> 
> If you step away from doing searches for a moment, and just look at
> scanning indexes, you immediately and clearly hit the problem (in my
> opinion anyway! ;-). If I have term-lists that contain words from titles
> and the complete values of titles, then how do I express an attribute
> list for scanning?
> 
> If I read the textual descriptions of the various attribute types, then
> format/structure sounds ideal. The Bib-2 attribute values make complete
> sense for scanning. Its the Util attribute values that are strange.
> Why specify 'any of these words' when doing a scan to identify that
> I want the title as a scan? Its semantically wrong. You want to identify
> the fact that you want words independently to the search-oriented operator
> of how to handle multiple terms in a search.
> 
> Once you reach this point, you realise the descriptions for the attribute
> types are good. The overall architecture is good. Its just that the utility
> attribute set has not defined words vs strings, and that some comparison
> operators (any/all/adj) have slipped into format/structure by mistake.
> 
> So I strongly recommended for a moment forgetting searches, and thinking
> about SCAN requests. What are the attribute lists for scanning title
> as keywords and title as complete values?
> 
> 
> 
> On Fri, Jul 18, 2003 at 05:46:57PM -0400, Ray Denenberg wrote:
> > There's consensus (among those who have participated in this
> > discussion) that allTheseWords, anyOfTheseWords, adjacentWords should
> > be changed from Structure/format  to Comparison attributes.
> > 
> > There's less consensus about adding two new Structure/format
> > attributes, (1) word(s), and (2) string (or 'completeValue').  Mike
> > feels strongly that they should be added, and I don't feel strongly but
> > am somewhat uncomfortable about adding them (without clarifying certain
> > other parts of the proposal). I don't know how strongly Alan feel.  And
> > I'd like to get other opinions.
> 
> Actually, I think the discussion has been the opposite. I think there is
> strong consensus that word and string should be in format/structure.
> This is because they should be talking about the format or interpretation
> of the structure of the value supplied. This is ideal for doing index
> scanning too as the attribute is also for what is returned by a SCAN
> request - it describes the format/structure of the returned scan terms.
> It is not purely a query attribute.
> 
> As a *result* of this consensus, it was realised and agreed to all,any,adj
> words should move out - they are in the wrong spot. Comparison is a more
> correct place. It makes sense with scanning too. Comparisons are query
> operators, not scanning stuff (this is a little hand-wavy here, which is
> always dangerous as I know I can come up with example applications where
> this is not true).
> 
> But I am happy to get other people's opinions. I think I have finally
> managed to express what I meant clearly so that Mike and Rob understand
> and agree with what I am saying (they may have actually reached where
> I am at before me as a result of the CQL work).
> 
> > This is how I see it: if the query term is a set of words, and the
> > comparison attribute is one of the above three, then clearly a
> > structure/attribute to indicate  "words" is not necessary.
> 
> To keep arguments simpler, I have tried to avoid the different
> term extraction rules side of things. But I want to support different
> definitions of what a 'word' is. To me, allWords, anyWords etc really
> should be allTerms, anyTerms, etc. They define what to do if there
> are multiple terms extracted from the query. The terms can be words.
> But the terms could also be floating point numbers defining a line
> segment, or coordinate pairs, or special things in chemical formulas etc.
> I think its better if comparison operators, whenever possible, should
> define how to *compare* values, not how to extract values to be
> compared. Orthogonality is good. It allows new term extraction rules
> to be added orthogonally.
> 
> For example, Bib-2 already defines additional format/structure attributes.
> So its not a possibility, its a current reality. Its not just words
> and strings we are talking about - its the ability to define multiple
> ways to structure terms extracted from records (and queries), then
> then keeping this independent to comparison operators.
> 
> I would love to change any/all/adj from 'words' to 'terms' in
> general. I think they will make sense when people define other
> concepts of how to extract terms from record content. However, this
> seemed to hard for people to swallow so I backed off trying to
> get at least the major problem fixed.
> 
> > Conversely, if the desire is to search for words (as opposed to a
> > complete string) then can the comparison attribute be anything but one
> > of these three? 
> 
> Probably not. However, I believe a goal of the AA is to be extensible,
> and I can see cases where different projects may want different concepts
> of what a 'word' is. I am thinking more of chemical formulas, geographic
> coordinates, other rich and complex data types etc.
> 
> This would be done by defining a new 'chemical' attribute set with a set
> of chemsitry specific access point names and new format/structure attributes
> related to chemical formulas. (Note: I know almost nothing about chemistry.
> I am using it as an example only.)
> 
> > However, what if the term is a single word? If the intent is to
> > search for it as a word (not a string), I don't think Alan's proposal
> > addresses whether this should fit within the three attributes proposed
> > - all three would mean the same thing, and so there may be  sentiment
> > for separating out the single-word case. If so, then I can see a
> > stronger argument for having  'word' and 'string' format/structure
> > values.
> 
> I think what you are saying is a good example of why comparions operators
> should not be used to define what terms are. Its a good example of one
> of the many little nasty side effects that come up. That is why I
> strongly believe any/all/adj words should not imply term structure.
> 
> > So I see two possibilities:
> >
> > 1.  A single-word search would be handled by one of the
> > word-comparison attributes (one of these would be "singled-out" for
> > this use),  no format/structure attribute included. If the term is a
> > single-word but is to be searched as a string, then another comparison
> > would be used. [aside: I'm not sure which one though. "Equal" seems to
> > be precluded, since the Utility set prose says that it cannot be used
> > with expansion/interpretation. On the other hand, Bath uses it.  This
> > may be another defect that we should address.]
> >
> > 2. When the term is a single-word, the comparison attribute may not
> > be one of the above three (they can only be used for multiple words)
> > and the format/structure 'word' or 'string' is supplied.
> > 
> > I think we need to nail down one of these two, and I don't really care which. 
> 
> I think the above assumes there is only one definition of what a 'word'
> is, and I think the goal of the AA is to be a framework for expansion,
> not restrictive. So I don't think you can ever preclude using a
> format/structure attribute in a query. I (personally) think it makes
> sense leaving format/structure open to identify different ways to
> extract multiple terms from a record (even different term extraction
> rules).
> 
> I understand where you are coming from, but I don't think either (1) or (2)
> above should be mandated. The problem is both options assume the client
> *knows* whether the search term contains one or more words. But is
> 'book-case' one word or two? Clients do not know the word extraction
> rules used by a server (there is no formal agreed interpretation of
> what a word is anywhere), so clients cannot know if a search string
> entered by a user is a single word or not.
> 
> So I think
> * Clients must be allowed to send all/any/adj for single or multiple
>   word queries.
> * Clients must always be allowed to send a format/structure attribute.
>   If omitted, its the servers choice as what to do.
> 
> I don't think its necessary to define the preferred way to do single
> word queries as distinct from multi-word queries - as it implies the
> client has to understand how to extract words from strings using
> the same rules as a server. If this is considered important, then
> I would look at adding a new comparison operator to the any/all/adj list
> of 'exactly one', which aborts the query with an error if there is not
> exactly one word (term) supplied. The responsibility is then given to
> the server rather than being on the client.
> 
> Alan
>
Received on Monday, 21 July 2003 05:38:34 UTC