Re: Limiting a Search by URL

On Fri, 24 Jan 2003, Chris Peterson/Amigos wrote:

> First of all, I apologize for starting a discussion and then not getting
> back to any of you.  Unfortunately, another project came up and it took the
> higher priority.  But I wanted to thank all of you for your thoughts on how
> to limit a search by part or all of a URL.  I've gone through all the email
> I received and I'll try to pull together what I think I know and what I
> know I don't -- I'm hoping ya'll will let me know where I'm off base.
> 
> First of all, a recap.  Here's what we want to do:  create a limiter so
> that the search term might be "nasa.gov" and we would receive hits that
> include both "http://jpl.nasa.gov" and "http://www.nasa.gov/missions.html."
> 
> I had a couple of people tell me that their ILS already does this by using
> an unanchored phrase search.  I believe the attribute combination for that
> would be:
> Use: 1032 (Doc-id)
> Relation: 3 (equal)
> Position: 3 (any position in field)
> Structure: 1 (phrase)
> Truncation: 100 (Do not truncate)
> Completeness: 1 (Incomplete subfield)
> 
> So, assuming the search term "nasa.gov," wouldn't you have to use
> truncation in order to get either of the results above?  

No, see below.

>                                                       I'm assuming that
> "nasa.gov" is a phrase that is being searched.  Would I receive the match
> "www.nasa.gov?"

Yes, see below.


> Others said they use the word structure.  This seems to me to be a good
> alternative.  As Ralph pointed out, if you searched "nasa.gov," chances
> would be pretty high for getting a NASA web site.  Even if I had gotten
> some of the sub-domains in the wrong order, I probably still would bring it
> up.
> 
> I spent a lot of time reading the message from Alan Kent.  I agree with
> everything he said, but there is one thing that I didn't make clear.
> Although I think most search terms used will be part of the domain name, I
> don't want to limit it to that.  Using "http://www.nasa.gov/missions.html,"
> I'd want to obtain that page using a search such as "nasa.gov/missions."
> Using the word structure above, I could do this.
> 
> Please let me know if I've understood the conversation and if I'm missing
> anything.  Again, thanks for your help!
> 
> Christine Peterson
> Library Liaison Officer, Amigos Library Services
> 14400 Midway Road, Dallas, TX  75244-3509
> 800/843-8482 x191 (message only)
> 512/671-1580 (phone and fax)
> EMAIL:  peterson@amigos.org

Christine,
If an unanchored phrase search is used, it's clear to me how the search
term would be transmitted by the Z-client and processed by the Z-server
(i.e., "nasa.gov/missions" would be transmitted, and it would be
processed as ordered, adjacent words).  How would a Z-client, that behaves
properly with regard to the Bath Profile, transmit a "word" search term
like "nasa.gov/missions"?  Isn't that really three words to most systems?  
Does the client transmit the search in three operands (i.e., "nasa AND gov
AND missions") or does the client send as one search term and let the 
target server decide how to normalize it?  In either case, a word search
would be processed as "nasa AND gov AND missions".  Order and adjacency
would not be required.  Is that okay?

Substituting "nasa.gov/library" for your search term example, our 
system returns the same record if either an unanchored phrase search
is used or if a word search of "nasa AND gov AND library" is used.
FYI, the URL in the record happens to be  
"www.aero-space.nasa.gov/library/enterprise.htm" 
Both searches retrieve the same record in *this* case.

(FYI, an unanchored phrase of "nasa.gov" or a word search of
"nasa AND gov" also match URLs in the same four records in our system:
       grin.hq.nasa.gov/
       www-sisn.jpl.nasa.gov/
       grin.hq.nasa.gov/
       www.aero-space.nasa.gov/library/enterprise.htm )

So, to recap, I think searches using either structure attribute would be
workable, but I prefer the precision of the unanchored phrase search.
A search term like "nasa.gov/library" would only match against URLs
in which those "words" appear together and in that order.  


Are we all okay with using Use attribute "1032"?  When I responded 
to your earlier email, I didn't point out that there is another Use
attribute which may be a better choice (i.e., 1209).

    1032  (Doc-ID: An identifier or Doc-ID, assigned by a server,
           that uniquely identifies a document on that server. May 
           or may not be persistent.  May be, for example, a URL.)

    1209  (Identifier - URN:  Uniform Resource Name)

        (Our server happens to support both attributes, but we map to
         different internal searches: the first only looks at 856 fields,
         the second includes other fields which might also contain a URN.)  

I don't care which Use attribute the library community wants to use, but
the semantics for "1209" might be a better "fit".

Larry

------------------------------------------------------------
Larry E. Dixson                    Internet:    ldix@loc.gov
Network Development and MARC
   Standards Office, LM639
Library of Congress                Telephone: (202) 707-5807
Washington, D.C.  20540-4402       Fax:       (202) 707-0115

Received on Monday, 27 January 2003 11:39:51 UTC