- From: Ian Hickson <ian@hixie.ch>
- Date: Tue, 5 May 2009 03:01:51 +0000 (UTC)
One of the use cases I collected from the e-mails sent in over the past few months was the following: USE CASE: Getting data out of poorly written Web pages, so that the user can find more information about the page's contents. SCENARIOS: * Alfred merges data from various sources in a static manner, generating a new set of data. Bob later uses this static data in conjunction with other data sets to generate yet another set of static data. Julie then visits Bob's page later, and wants to know where and when the various sources of data Bob used come from, so that she can evaluate its quality. (In this instance, Alfred and Bob are assumed to be uncooperative, since creating a static mashup would be an example of a poorly-written page.) * TV guide listings - If the TV guide provider does not render a link to IMDB, the browser should recognise TV shows and give implicit links. (In this instance, it is assumed that the TV guide provider is uncooperative, since it isn't providing the links the user wants.) * Students and teachers should be able to discover each other -- both within an institution and across institutions -- via their blogging. (In this instance, it is assumed that the teachers and students aren't cooperative, since they would otherwise be able to find each other by listing their blogs in a common directory.) * Tim wants to make a knowledge base seeded from statements made in Spanish and English, e.g. from people writing down their thoughts about George W. Bush and George H.W. Bush. (In this instance, it is assumed that the people writing the statements aren't cooperative, since if they were they could just add the data straight into the knowledge base.) REQUIREMENTS: * Does not need cooperation of the author (if the page author was cooperative, the page would be well-written). * Shouldn't require the consumer to write XSLT or server-side code to derive this information from the page. One class of the solutions that was proposed to address this is the idea of getting the author to mark up microdata (small bits of data) in the page, annotating the information that is needed to complete the scenarios. Such formats could be RDFa, Microformats, n3, a custom format for HTML5, or any number of other syntaxes. However, it's not clear that this would help in this case, since the underlying assumption with these particular problems is that the author isn't actively cooperating with the user (likely due to ignorance, of course, not malice). Let's examine these use cases with a microdata solution in mind: * Alfred merges data from various sources in a static manner, generating a new set of data. Bob later uses this static data in conjunction with other data sets to generate yet another set of static data. Julie then visits Bob's page later, and wants to know where and when the various sources of data Bob used come from, so that she can evaluate its quality. Here, Julie is two steps removed from the original data. Since we are assuming here that Alfred and Bob are not cooperating with Julie, we must also assume that they haven't included this information on the page. If they haven't included it, then microdata doesn't help, as there is nothing to mark up. * TV guide listings - If the TV guide provider does not render a link to IMDB, the browser should recognise TV shows and give implicit links. (In this instance, it is assumed that the TV guide provider is uncooperative, since it isn't providing the links the user wants.) If the TV guide listing page was cooperative, it would just provide the links to the IMDB that the user wants. It isn't; we cannot, therefore, assume that it would be ready to include microdata that would let the user find the relevant page on the IMDB using a tool that consumes Microdata. * Students and teachers should be able to discover each other -- both within an institution and across institutions -- via their blogging. The obvious solution here is for the students and teachers to simply register their blogs in a common directory. However, assuming that they are not even doing that, it is unlikely that they _would_ include some kind of microdata in their pages to solve the problem. * Tim wants to make a knowledge base seeded from statements made in Spanish and English, e.g. from people writing down their thoughts about George W. Bush and George H.W. Bush. If the people writing down their thoughts were to be "hip" enough to write their thoughts using microdata annotations, they'd also be able to just add them to the knowledge base directly. So again, we must assume that this is a case where we can't rely on microdata. Thus, we have our first requirement: * Does not need cooperation of the author. If we take the author out of the equation, there are three other parties that could help solve this problem: 1. The user. 2. The user's client tool provider (e.g. browser vendor). 3. Third party tool providers (e.g. web sites, search engines). Relying on the user to solve these problems is somewhat missing the point of solving the problems, so let's focus on the browser and on other tools. The other requirement listed above, from someone who presumably wishes to avoid the user having to do any extra work, is: * Shouldn't require the consumer to write XSLT or server-side code to derive this information from the page. This is worth bearing in mind as we look at how browsers and other tools might help solve the problem. First let's look at the scenarios again, from the perspective of the client software: * Alfred merges data from various sources in a static manner, generating a new set of data. Bob later uses this static data in conjunction with other data sets to generate yet another set of static data. Julie then visits Bob's page later, and wants to know where and when the various sources of data Bob used come from, so that she can evaluate its quality. >From the browser's point of view, Julie is viewing a Web page with some data, say, some HTML <table>s, and requests the browser's help in identifying the source of the data. It's not clear to me that the browser could do _anything_ at this point that would solve the problem. Without help from the page, finding the origin of data is a search problem, and the browser doesn't really have anywhere to begin from. * TV guide listings - If the TV guide provider does not render a link to IMDB, the browser should recognise TV shows and give implicit links. >From the browser's point of view, the user is visiting a page with various bits of text on it. There has been some work in the area of having browsers give implicit links, but that has historically not been successful at all: http://en.wikipedia.org/wiki/Smart_tag_(Microsoft) However, as that Wikipedia page points out, what _has_ been moderately successful (so far) is the idea of having the browser offer links when the user selects some text. Thus, if the user is an IMDB user, he could select the TV show title, and select "IMDB" from the resulting menu. This solution does solve the problem without XSLT or server-side consumer code. Thus, this appears to be a solution to this particular scenario. * Students and teachers should be able to discover each other -- both within an institution and across institutions -- via their blogging. Form the browser's point of view, the user is browsing one page, and desires other pages that are similar in a particular way. Again, this is fundamentally a search problem, so it's not clear that there's anything that could be done to address it from the browser. * Tim wants to make a knowledge base seeded from statements made in Spanish and English, e.g. from people writing down their thoughts about George W. Bush and George H.W. Bush. Here the client is not a browser, but some other tool, whose job it is to populate a knowledge base from statements in Spanish and English. Almost by definition then, it seems like this tool should, as part of its operation, be able to convert English and Spanish into the knowledge base's format. Such tools currently are not widely available to the general public. That's probably ok, though, since to be honest, the general public is unlikely to make direct use of knowledge bases at this point anyway. Whether this requires some code from the user (as opposed to being automatic) depends on the software, but software that can interpret such statements (i.e. AI or NLP software) would presumably do so without help from the user. Unfortunately, such solutions are somewhat hypothetical at this point. Thus client software is a possible solution, but not a great one. Let's look at the scenarios again from the point of view of a third-party software provider, e.g. a search engine: * Alfred merges data from various sources in a static manner, generating a new set of data. Bob later uses this static data in conjunction with other data sets to generate yet another set of static data. Julie then visits Bob's page later, and wants to know where and when the various sources of data Bob used come from, so that she can evaluate its quality. This is a problem that can be solved by a provider with a large amount of data. For prose, e.g. to search for the original source of a quote or syndicated blog post, this problem is in fact mostly solved -- search for a representative string from the document, and use your search engine's features to focus on the document that first used that phrase. This is a harder problem for numbers, though, and is even harder if the mashup doesn't include the original data. * TV guide listings - If the TV guide provider does not render a link to IMDB, the browser should recognise TV shows and give implicit links. A search engine can be used to search for the show, which will then provide links that the user wants relating to that show. This solution has the advantage of already being part of a user's daily routine. * Students and teachers should be able to discover each other -- both within an institution and across institutions -- via their blogging. Search engines are in a unique position to find pages similar to each other, and indeed many search engines have had a "similar pages" feature for some time. It seems plausible that tools could be developed that specifically search for blogs that appear to be from students or teachers, if there is a demand for this. * Tim wants to make a knowledge base seeded from statements made in Spanish and English, e.g. from people writing down their thoughts about George W. Bush and George H.W. Bush. Third party tools do not seem useful for solving this problem (except insofar as they could be used to augment the solution described for this problem earlier with client-side tools). In conclusion: Some of these scenarios are easy to solve with existing technology. Others will require advances in natural language processing or other large-corpus technologies. None, it seems, of this particular set of use cases are particularly well addressed by microdata markup. For finding information about a page, such as what data an analysis is based on, or finding other pages from people with similar interests, when the page author isn't really interested in participating in a system that would aid this, tools such as search engines seem like the most promising solution. These solutions are still immature, though, and much work in this area is to be expected going forward. For using information on a page, such as a TV Show's title, to navigate to other sites, such as IMDB, the browser's own UI is probably the best starting point. Features such as IE8's Accelerators and Mac OS X's "Search in Google" address this use case adequately and in an extensible and simple fashion that the user can tune to his or her preferences. Finally, turning text in multiple languages into machine-processable data is a problem that will likely be the continued target of focused research both in corporate R&D and in academia, and efforts like Wolfram Alpha indicate that this is in fact an area with growing interest. Thus, I haven't added anything to HTML5 to address the above use cases. A number of further use cases remain to be examined. I will send further e-mail hopefully this week as I address them. -- Ian Hickson U+1047E )\._.,--....,'``. fL http://ln.hixie.ch/ U+263A /, _.. \ _\ ;`._ ,. Things that are impossible just take longer. `._.-(,_..'--(,_..'`-.;.'
Received on Monday, 4 May 2009 20:01:51 UTC