- From: Doug Donohoe <donohoe@emerge.com>
- Date: Tue, 20 Aug 1996 16:27:05 +0800
- To: lazio@spacenet.tn.cornell.edu
- CC: www-html@w3.org
Joseph Lazio asked: > I'll echo Dan C.'s comments. Thanks for defending this in public. It's our pleasure, and thank you for your input. We are truly interested in getting the buy-in of this community. Perhaps our goals (explained below) will shed some light on why we offer IDML as a solution. > Let me ask a simple question: What are you trying to do? That is a great question! I'm glad someone finally asked. We are trying to solve the problem of resource discovery from the viewpoint of both the user and the publisher. It's hard (and getting harder) to find things on the internet. Big surprise. ;-) It's also hard to control how you get found. By "things" I have in mind two distinct types of objects: informational objects (web pages, or roughly the domain of Library Science) and commercial objects (products and services for sale, or roughly the domain of E-Commerce). Today's search engines help solve this problem in two ways: Automated processing of unstructured text at the page level (e.g., AltaVista, Excite), or Human categorization of content at the site level (e.g., Yahoo). Yet searching for information using these methods is often frustrating. Precision and recall are often ridiculously low, and there is little recourse for refining queries in any way besides trying different keywords. Some search engines do offer advanced keyword queries, but A) even they don't always work and B) try explaining Boolean logic to the typical, i.e., unsophisticated user. Searching for commercial products using these methods is even worse. Try shopping for "khakis" at AltaVista. You do get some vendors that sell khakis. But you also get images from CK and stuff like "Subculturalism: why the Revolution will always wear Khakis". Bad enough for shoppers, but if I'm a merchant trying to sell my stuff over the Net, what recourse do I have? The popular press has even picked up on the frustration with searching the net [1-3]. So, if we agree it is hard to find things on the internet, then what possible solutions could we come up with? META tags can help here. Yet as Dan points out, META is simply a container that is not intended to provide any guidelines for what contents to use, i.e., a data model. So for example, Joseph said "he'd use" the following to specify location: > <META NAME="location" > CONTENT="Cornell University, Ithaca, NY, 14853-6801, USA"> > However, I could also use this <META NAME="location" CONTENT="Cornell U., Ithaca, New York, 14853, United States"> or <META NAME="location" CONTENT="Cornell, Ithaca, N. York, 14853, United States of America"> or even <META NAME="location" CONTENT="This is located at Cornell"> Until someone states a standard data model for specifying location, a search engine could not possibly answer the question "Show me all web pages about X and published in Ithaca, New York, USA". IDML states a data model. As Joseph correctly stated: > Of course, there's the real question of whether user agents know what > to do with META. This is why we proposed IDML. Yes, you could use META tags to instantiate portions of the IDML data model, but the instantiation isn't as important as having a data model geared to the requirements of searching and indexing in a specific domain. IDML simply bundles the two together. So, IDML defines values for the following for web-pages (ID-INFO): Language - using ISO-639 Location - using ISO 3166 and CIA world factbook, since there is no standard (yet) on states/provinces/sub-divisions Subject - Our stab at a simple, compact general subject ontology Keywords - = META keywords Description - = META description and the following for products (ID-PRODUCT): Department - our stab at a taxonomy of consumer products Currency - using ISO-4217 Price - floating point numbers accepted ;-) Location - as above (e.g., where is this apartment?) Language - as above (e.g., what language is this book in?) Description - what is this product, in plain English (or what- ever language you choose :-) and the following for the publisher of web-content (ID-PUBLISHER): Publisher type - again, a small taxonomy (that needs improvement) the purpose of this is to differentiate sites that are by companies from those by governments, individuals, organizations, etc. Location - as above (where is this publisher?) and finally, the following for the robots looking for IDML (ID-SYSTEM): crawl - specifies if the robot should crawl beyond this page -- this addresses some of the spidering BOF questions raised at the indexing workshop [4]. We have "on our drawing board" plans to extend the ID-SYSTEM tag to include re-crawling intervals (e.g., when should the robot revisit the site?). Of course version 1.0 isn't perfect. But, even if all you had to work with was v1.0, and all web pages specified this, and all search engines utilized it, don't you think it would make the process of finding both informational and commercial objects easier? And the process of controlling how you get found? We created Identify (http://www.identify.com) to showcase what might be possible with IDML. If you haven't tried it yet, please do so. Put some IDML in a web page and try it out! So, Joseph, that's my long answer to "what are you trying to do?" I hope this explanation helps clarify our goals and our solution. Please keep the ideas coming. Thanks, -Doug [1] "Caught in the Web", The Wall Street Journal, 17 June 1996, Page R14 [2] "After This, the Mall is Way Cool", US News On-Line, http://www.usnews.com/usnews/nycu/webbiz2.htm [3] "Has The Net Finally Reached The Wall?", Business Week, August 26, 1996, Page 62. [4] http://www.w3.org/pub/WWW/Search/9605-Indexing-Workshop/ReportOutcomes/Spidering.txt -- J. Douglas Donohoe ------------------------------------------------------------------- Emerge Consulting Chief Technology Officer 415.328.6700 http://www.emerge.com donohoe@emerge.com http://www.identify.com
Received on Tuesday, 20 August 1996 19:32:43 UTC