Re: What are the problems with IDML?

Joseph Lazio asked:
>  I'll echo Dan C.'s comments.  Thanks for defending this in public.

It's our pleasure, and thank you for your input.  We are truly interested
in getting the buy-in of this community.  Perhaps our goals (explained
below) will shed some light on why we offer IDML as a solution.

>  Let me ask a simple question:  What are you trying to do?

That is a great question!  I'm glad someone finally asked.

We are trying to solve the problem of resource discovery from the viewpoint
of both the user and the publisher.  It's hard (and getting harder) to find
things on the internet.  Big surprise.  ;-)  It's also hard to control how
you get found.

By "things" I have in mind two distinct types of objects: informational
objects (web pages, or roughly the domain of Library Science) and
commercial objects (products and services for sale, or roughly the domain
of E-Commerce).

Today's search engines help solve this problem in two ways: Automated
processing of unstructured text at the page level (e.g., AltaVista,
Excite), or Human categorization of content at the site level (e.g., Yahoo).

Yet searching for information using these methods is often frustrating.
Precision and recall are often ridiculously low, and there is little
recourse for refining queries in any way besides trying different keywords.
Some search engines do offer advanced keyword queries, but A) even they
don't always work and B) try explaining Boolean logic to the typical, i.e.,
unsophisticated user.

Searching for commercial products using these methods is even worse.  Try
shopping for "khakis" at AltaVista.  You do get some vendors that sell
khakis.  But you also get images from CK and stuff like "Subculturalism:
why the Revolution will always wear Khakis".  Bad enough for shoppers, but
if I'm a merchant trying to sell my stuff over the Net, what recourse do I
have?

The popular press has even picked up on the frustration with searching the
net [1-3].

So, if we agree it is hard to find things on the internet, then
what possible solutions could we come up with?

META tags can help here.  Yet as Dan points out, META is simply a container
that is not intended to provide any guidelines for what contents to use,
i.e., a data model.  So for example, Joseph said "he'd use" the following
to specify location:

> <META NAME="location"
>  CONTENT="Cornell University, Ithaca, NY, 14853-6801, USA">
>

However, I could also use this

  <META NAME="location"
   CONTENT="Cornell U., Ithaca, New York, 14853, United States">

  or

  <META NAME="location"
   CONTENT="Cornell, Ithaca, N. York, 14853, United States of America">

  or even

  <META NAME="location" CONTENT="This is located at Cornell">

Until someone states a standard data model for specifying location, a
search engine could not possibly answer the question "Show me all web pages
about X and published in Ithaca, New York, USA".  IDML states a data model.
As Joseph correctly stated:

> Of course, there's the real question of whether user agents know what
> to do with META.

This is why we proposed IDML.  Yes, you could use META tags to instantiate
portions of the IDML data model, but the instantiation isn't as important
as having a data model geared to the requirements of searching and indexing
in a specific domain.  IDML simply bundles the two together.

So, IDML defines values for the following for web-pages (ID-INFO):

        Language - using ISO-639
        Location - using ISO 3166 and CIA world factbook, since there is
                   no standard (yet) on states/provinces/sub-divisions
        Subject -  Our stab at a simple, compact general subject ontology
        Keywords - = META keywords
        Description - = META description

and the following for products (ID-PRODUCT):

        Department - our stab at a taxonomy of consumer products
        Currency - using ISO-4217
        Price    - floating point numbers accepted ;-)
        Location - as above (e.g., where is this apartment?)
        Language - as above (e.g., what language is this book in?)
        Description - what is this product, in plain English (or what-
                      ever language you choose :-)

and the following for the publisher of web-content (ID-PUBLISHER):

        Publisher type - again, a small taxonomy (that needs improvement)
                         the purpose of this is to differentiate sites
                         that are by companies from those by
                         governments, individuals, organizations, etc.
        Location - as above (where is this publisher?)

and finally, the following for the robots looking for IDML (ID-SYSTEM):

        crawl - specifies if the robot should crawl beyond this
                page -- this addresses some of the spidering BOF
                questions raised at the indexing workshop [4]. We
		have "on our drawing board" plans to extend the
		ID-SYSTEM tag to include re-crawling intervals (e.g.,
		when should the robot revisit the site?).

Of course version 1.0 isn't perfect.  But, even if all you had to work with
was v1.0, and all web pages specified this, and all search engines utilized
it, don't you think it would make the process of finding both informational
and commercial objects easier?  And the process of controlling how you get
found?

We created Identify (http://www.identify.com) to showcase what might be
possible with IDML.  If you haven't tried it yet, please do so.  Put some
IDML in a web page and try it out!

So, Joseph, that's my long answer to "what are you trying to do?"  I hope
this explanation helps clarify our goals and our solution.  Please keep the
ideas coming.

Thanks,

-Doug

[1] "Caught in the Web", The Wall Street Journal, 17 June 1996, Page R14
[2] "After This, the Mall is Way Cool", US News On-Line,
     http://www.usnews.com/usnews/nycu/webbiz2.htm
[3] "Has The Net Finally Reached The Wall?", Business Week,
     August 26, 1996, Page 62.
[4]  http://www.w3.org/pub/WWW/Search/9605-Indexing-Workshop/ReportOutcomes/Spidering.txt



-- 


J. Douglas Donohoe
-------------------------------------------------------------------
Emerge Consulting			   Chief Technology Officer
415.328.6700			              http://www.emerge.com
donohoe@emerge.com			    http://www.identify.com

Received on Tuesday, 20 August 1996 19:32:43 UTC