- From: Ian Hickson <ian@hixie.ch>
- Date: Tue, 19 May 2009 23:07:15 +0000 (UTC)
Some of the use cases I collected from the e-mails sent in over the past few months were the following: USE CASE: Exposing contact details so that users can add people to their address books or social networking sites. SCENARIOS: * Instead of giving a colleague a business card, someone gives their colleague a URL, and that colleague's user agent extracts basic profile information such as the person's name along with references to other people that person knows and adds the information into an address book. * A scholar and teacher wants other scholars (and potentially students) to be able to easily extract information about who he is to add it to their contact databases. * Fred copies the names of one of his Facebook friends and pastes it into his OS address book; the contact information is imported automatically. * Fred copies the names of one of his Facebook friends and pastes it into his Webmail's address book feature; the contact information is imported automatically. * David can use the data in a web page to generate a custom browser UI for including a person in our address book without using brittle screen-scraping. REQUIREMENTS: * A user joining a new social network should be able to identify himself to the new social network in way that enables the new social network to bootstrap his account from existing published data (e.g. from another social nework) rather than having to re-enter it, without the new site having to coordinate (or know about) the pre-existing site, without the user having to give either sites credentials to the other, and without the new site finding out about relationships that the user has intentionally kept secret. (http://w2spconf.com/2008/papers/s3p2.pdf) * Data should not need to be duplicated between machine-readable and human-readable forms (i.e. the human-readable form should be machine-readable). * Shouldn't require the consumer to write XSLT or server-side code to read the contact information. * Machine-readable contact information shouldn't be on a separate page than human-readable contact information. * The information should be convertible into a dedicated form (RDF, JSON, XML, vCard) in a consistent manner, so that tools that use this information separate from the pages on which it is found have a standard way of conveying the information. * Should be possible for different parts of a contact to be given in different parts of the page. For example, a page with contact details for people in columns (with each row giving the name, telephone number, etc) should still have unambiguous grouped contact details parseable from it. * Parsing rules should be unambiguous. * Should not require changes to HTML5 parsing rules. USE CASE: Exposing calendar events so that users can add those events to their calendaring systems. SCENARIOS: * A user visits the Avenue Q site and wants to make a note of when tickets go on sale for the tour's stop in his home town. The site says "October 3rd", so the user clicks this and selects "add to calendar", which causes an entry to be added to his calendar. * A student is making a timeline of important events in Apple's history. As he reads Wikipedia entries on the topic, he clicks on dates and selects "add to timeline", which causes an entry to be added to his timeline. * TV guide listings - browsers should be able to expose to the user's tools (e.g. calendar, DVR, TV tuner) the times that a TV show is on. * Paul sometimes gives talks on various topics, and announces them on his blog. He would like to mark up these announcements with proper scheduling information, so that his readers' software can automatically obtain the scheduling information and add it to their calendar. Importantly, some of the rendered data might be more informal than the machine-readable data required to produce a calendar event. * David can use the data in a web page to generate a custom browser UI for adding an event to our calendaring software without using brittle screen-scraping. * http://livebrum.co.uk/: the author would like people to be able to grab events and event listings from his site and put them on their site with as much information as possible retained. "The fantasy would be that I could provide code that could be cut and pasted into someone else's HTML so the average blogger could re-use and re-share my data." * User should be able to subscribe to http://livebrum.co.uk/ then sort by date and see the items sorted by event date, not publication date. REQUIREMENTS: * Should be discoverable. * Should be compatible with existing calendar systems. * Should be unlikely to get out of sync with prose on the page. * Shouldn't require the consumer to write XSLT or server-side code to read the calendar information. * Machine-readable event data shouldn't be on a separate page than human-readable dates. * The information should be convertible into a dedicated form (RDF, JSON, XML, iCalendar) in a consistent manner, so that tools that use this information separate from the pages on which it is found have a standard way of conveying the information. * Should be possible for different parts of an event to be given in different parts of the page. For example, a page with calendar events in columns (with each row giving the time, date, place, etc) should still have unambiguous calendar events parseable from it. * Should be possible for authors to find out if people are reusing the information on their site. * Code should not be ugly (e.g. should not be mixed in with markup used mostly for styling). * There should be "obvious parsing tools for people to actually do anything with the data (other than add an event to a calendar)". * Solution should not feel "disconnected" from the Web the way that calendar file downloads do. * Parsing rules should be unambiguous. * Should not require changes to HTML5 parsing rules. USE CASE: Allow users to maintain bibliographies or otherwise keep track of sources of quotes or references. SCENARIOS: * Frank copies a sentence from Wikipedia and pastes it in some word processor: it would be great if the word processor offered to automatically create a bibliographic entry. * Patrick keeps a list of his scientific publications on his web site. He would like to provide structure within this publications page so that Frank can automatically extract this information and use it to cite Patrick's papers without having to transcribe the bibliographic information. * A scholar and teacher wants other scholars (and potentially students) to be able to easily extract information about what he has published to add it to their bibliographic applications. * A scholar and teacher wants to publish scholarly documents or content that includes extensive citations that readers can then automatically extract so that they can find them in their local university library. These citations may be for a wide range of different sources: an interview posted on YouTube, a legal opinion posted on the Supreme Court web site, a press release from the White House. REQUIREMENTS: * Machine-readable bibliographic information shouldn't be on a separate page than human-readable bibliographic information. * The information should be convertible into a dedicated form (RDF, JSON, XML, BibTex) in a consistent manner, so that tools that use this information separate from the pages on which it is found have a standard way of conveying the information. * Parsing rules should be unambiguous. * Should not require changes to HTML5 parsing rules. The first two use cases can basically be done today using the hCard and hCalendar Microformats, but the parsing rules for these Microformats are somewhat vague, and they aren't easily extensible without hardcoding extensions into parsers. I propose, therefore, to take the hCard and vCalendar vocabularies, and recast them onto the new microdata model. http://www.whatwg.org/specs/web-apps/current-work/#vcard http://www.whatwg.org/specs/web-apps/current-work/#vevent I have used the knowledge and experience collected and carefully documented by the Microformats team on their wiki, and written a direct mapping of those vocabularies to microdata, along with very explicit definitions for how to convert this data to vCard and iCalendar files, something which was lacking in the hCard and hCalendar definitions: http://www.whatwg.org/specs/web-apps/current-work/#vcard-0 http://www.whatwg.org/specs/web-apps/current-work/#icalendar The third use case requires a vocabulary for citations, which isn't something for which a widely deployed solution exists in text/html yet. There are a large number of options: - Refer - RIS - BibTeX - Metadata Object Description Schema - Z39.80 - Dublin Core and variants thereof - part of Journal Publishing Tag Set Tag Library - part of XML Resume - part of OOXML - part of ODF - part of DocBook - the Ann Arbor District Library XML format - SRU - My alma mater's format (University of Bath reference type) - Bibliontology - The Citation Oriented Bibliographic Vocabulary - ISBD - OpenURL COinS ...and many more. A case could probably be made for any one of these. Based on availability of tools, simplicity in the format (just name-value pairs vs deeply nested trees of typed data), actual use in citation-happy fields, extensibility, use of an understandable vocabulary (e.g. "author" vs "%A"), etc, I ended up picking the BibTeX vocabulary. It isn't perfect; for example, it's not going to be a great solution for citing YouTube clips yet. But since it is relatively easy to extend (and indeed, it has historically been extended by several groups), it seems like if this feature gets good adoption, we will be able to extend it to support more types. Thus, BibTeX vocabulary for microdata: http://www.whatwg.org/specs/web-apps/current-work/#bibtex Exporting microdata to BibTeX: http://www.whatwg.org/specs/web-apps/current-work/#bibtex-0 The vocabularies and exports are pretty much useless on their own, though. There are two ways that make this actually useful: - There's a scripting API that exposes the microdata and so people can write generic client-side scripts to expose data on the page, and - User agents are now required to export vCard, iCalendar, and BibTeX when someone drags a selection that includes data marked up with those vocabularies. The latter in particular is IMHO very important. Both of these features require browser implementation support, which IMHO is important to making anything like this work widely (and has been a sore point with previous solutions in this space). I shall now go through the scenarios and requirements to show how they can now be addressed. USE CASE: Exposing contact details so that users can add people to their address books or social networking sites. SCENARIOS: * Instead of giving a colleague a business card, someone gives their colleague a URL, and that colleague's user agent extracts basic profile information such as the person's name along with references to other people that person knows and adds the information into an address book. This is possible today without using HTML, just make the URL point to a vCard text/directory resource. * A scholar and teacher wants other scholars (and potentially students) to be able to easily extract information about who he is to add it to their contact databases. This is now easy -- given microdata with a vCard, the scholars need but drag that information to their contact databases, and assuming those contact databases support vCard, they can import the information directly. Alternatively, a script can be written in less than 200 lines of code to convert the microdata to vCard (or other formats) for direct download. (I wrote proof-of-concept scripts using the APIs in the spec to export vCard, vEvent, and BibTeX data. The vCard one was about 140 lines; the BibTeX one was about 60 lines. The vEvent one is in the spec as an example -- search for getCalendar() -- and is less than 40 lines.) * Fred copies the names of one of his Facebook friends and pastes it into his OS address book; the contact information is imported automatically. Assuming the OS address book supports vCard, this is now supported natively -- all Facebook has to do is encode the data as vCard microdata. * Fred copies the names of one of his Facebook friends and pastes it into his Webmail's address book feature; the contact information is imported automatically. If his Webmail supports HTML5 drag and drop (copy-and-paste is defined in terms of drag-and-drop), then an HTML5 user agent will include all the microdata of the copied selection in a JSON blob, including the vCard data. (Actual vCard will also be included.) This is now thus automatically supported assuming that the sites both use the same vocabulary, implement the drag-and-drop API, and the user has an HTML5 browser. * David can use the data in a web page to generate a custom browser UI for including a person in our address book without using brittle screen-scraping. The spec defines exactly how to get a vCard out of a random HTML page, so screen-scraping should no longer be necessary. REQUIREMENTS: * A user joining a new social network should be able to identify himself to the new social network in way that enables the new social network to bootstrap his account from existing published data (e.g. from another social nework) rather than having to re-enter it, without the new site having to coordinate (or know about) the pre-existing site, without the user having to give either sites credentials to the other, and without the new site finding out about relationships that the user has intentionally kept secret. (http://w2spconf.com/2008/papers/s3p2.pdf) Assuming both sites support the same vocabulary and can identify people uniquely somehow, this is now possible using microdata (just as it has been possible using custom microformat-like vocabularies before, or RDFa and other embedded data formats before). Whether sites will support this is up to the sites in question; I see no way to force the issue. As far as I can tell the privacy problem listed above is not intrinsicly solved by the microdata solution. I cannot find a solution to those problems at the HTML level; they seem inherently application-bound. * Data should not need to be duplicated between machine-readable and human-readable forms (i.e. the human-readable form should be machine-readable). By and large, this is met. For some of the more esoteric vEvent features (like repeating rules) I have opted for not really supporting them natively, but just allowing authors to use the vEvent rules directly. This is not really an issue as far as I can tell because those features aren't widely used (and even seem to be getting dropped in the newer version of iCalendar). * Shouldn't require the consumer to write XSLT or server-side code to read the contact information. While it's possible for people to write custom code to process this data, the spec requires browsers to support this natively, making this unnecessary for these vocabularies. * Machine-readable contact information shouldn't be on a separate page than human-readable contact information. This requirement is met. * The information should be convertible into a dedicated form (RDF, JSON, XML, vCard) in a consistent manner, so that tools that use this information separate from the pages on which it is found have a standard way of conveying the information. I haven't defined a way to convert this data to XML, but I have provided explicit ways to convert to JSON, RDF, and vCard. * Should be possible for different parts of a contact to be given in different parts of the page. For example, a page with contact details for people in columns (with each row giving the name, telephone number, etc) should still have unambiguous grouped contact details parseable from it. Using subject="", this is possible. * Parsing rules should be unambiguous. I hope the parsing rules described in the spec are clear enough. Please let me know if there are any problems. * Should not require changes to HTML5 parsing rules. The HTML5 parsing rules did not change. USE CASE: Exposing calendar events so that users can add those events to their calendaring systems. SCENARIOS: * A user visits the Avenue Q site and wants to make a note of when tickets go on sale for the tour's stop in his home town. The site says "October 3rd", so the user clicks this and selects "add to calendar", which causes an entry to be added to his calendar. As demonstrated in the spec, it is not relatively easy to expose this data and requires little code to convert this data into a form supported by most calendars. In addition, this can also be supported using copy-and-paste or drag-and-drop if the source, destination, and browser all cooperate according to the spec. * A student is making a timeline of important events in Apple's history. As he reads Wikipedia entries on the topic, he clicks on dates and selects "add to timeline", which causes an entry to be added to his timeline. I couldn't find a way to address this as described unless Wikipedia and the timeline utility cooperated directly. (Drag-and-drop and copy-and- paste cases can be easily supported, though.) * TV guide listings - browsers should be able to expose to the user's tools (e.g. calendar, DVR, TV tuner) the times that a TV show is on. Assuming TV guide listings can be described in vEvent form, this is now possible using drag-and-drop and copy-and-paste. * Paul sometimes gives talks on various topics, and announces them on his blog. He would like to mark up these announcements with proper scheduling information, so that his readers' software can automatically obtain the scheduling information and add it to their calendar. Importantly, some of the rendered data might be more informal than the machine-readable data required to produce a calendar event. This seems easily handled now. * David can use the data in a web page to generate a custom browser UI for adding an event to our calendaring software without using brittle screen-scraping. The example in the spec demonstrates that this is now possible with relatively little code. * http://livebrum.co.uk/: the author would like people to be able to grab events and event listings from his site and put them on their site with as much information as possible retained. "The fantasy would be that I could provide code that could be cut and pasted into someone else's HTML so the average blogger could re-use and re-share my data." I have included an example in the spec from livebrum.co.uk showing how this is possible. * User should be able to subscribe to http://livebrum.co.uk/ then sort by date and see the items sorted by event date, not publication date. This isn't directly possible, but if a tool exists that can sort event data by date, then given the event data it seems possible to do this easily. For example, a Web Calendar product could support parsing microdata vEvents out of a Web page and then could offer to subscribe to such a page as a feed. REQUIREMENTS: * Should be discoverable. This isn't met by the microdata vEvent vocabulary intrinsically. I expect that a convention will arise where people put little icons near their microdata saying "look, we have vEvent data you can drag to your calendar!" or some such. * Should be compatible with existing calendar systems. The vEvent part of iCalendar is well established, so this seems met, at least in principle. The details (e.g. drag and drop support) probably need some work. * Should be unlikely to get out of sync with prose on the page. By making the prose on the page the source for the microdata, this seems resolved. * Shouldn't require the consumer to write XSLT or server-side code to read the calendar information. This is mostly met in the same way as for contact data. * Machine-readable event data shouldn't be on a separate page than human-readable dates. This is achieved using inline microdata. * The information should be convertible into a dedicated form (RDF, JSON, XML, iCalendar) in a consistent manner, so that tools that use this information separate from the pages on which it is found have a standard way of conveying the information. Output in all those formats except raw XML is explicitly supported in the spec. * Should be possible for different parts of an event to be given in different parts of the page. For example, a page with calendar events in columns (with each row giving the time, date, place, etc) should still have unambiguous calendar events parseable from it. subject="" supports this. * Should be possible for authors to find out if people are reusing the information on their site. This isn't met. I couldn't find a good way to do this. When JavaScript is enabled, drag-and-drop, copy-and-paste, and other mechanisms can be detected and logged via script, but really there's no good way to detect all uses of microdata. (Providing a ping=""-like feature for this seems like overkill and wouldn't help with non-end-user use anyway.) * Code should not be ugly (e.g. should not be mixed in with markup used mostly for styling). This appears to be met. * There should be "obvious parsing tools for people to actually do anything with the data (other than add an event to a calendar)". There aren't any obvious tools yet, but since two separate implementations arose in less than 24 hours from the point where the microdata stuff was released, it seems like this will prove easy enough to do. * Solution should not feel "disconnected" from the Web the way that calendar file downloads do. This seems met. * Parsing rules should be unambiguous. * Should not require changes to HTML5 parsing rules. The same applies here as with vCard. USE CASE: Allow users to maintain bibliographies or otherwise keep track of sources of quotes or references. SCENARIOS: * Frank copies a sentence from Wikipedia and pastes it in some word processor: it would be great if the word processor offered to automatically create a bibliographic entry. This will require new code in the word processor, but the information, in an HTML5-compliant browser according to this proposal, would include the information required to do this. * Patrick keeps a list of his scientific publications on his web site. He would like to provide structure within this publications page so that Frank can automatically extract this information and use it to cite Patrick's papers without having to transcribe the bibliographic information. This seems to be handled directly now if the page is written using the BibTeX vocabulary. * A scholar and teacher wants other scholars (and potentially students) to be able to easily extract information about what he has published to add it to their bibliographic applications. This seems met in the same way. * A scholar and teacher wants to publish scholarly documents or content that includes extensive citations that readers can then automatically extract so that they can find them in their local university library. These citations may be for a wide range of different sources: an interview posted on YouTube, a legal opinion posted on the Supreme Court web site, a press release from the White House. Not all of these types are immediately supported by the BibTeX vocabulary. I recommend that we extend the BibTeX set over time if this feature gains a critical mass. REQUIREMENTS: * Machine-readable bibliographic information shouldn't be on a separate page than human-readable bibliographic information. This is met. * The information should be convertible into a dedicated form (RDF, JSON, XML, BibTex) in a consistent manner, so that tools that use this information separate from the pages on which it is found have a standard way of conveying the information. This is met explicitly for three of those types; for other types it can be done easily enough also though it is not defined in the spec. * Parsing rules should be unambiguous. * Should not require changes to HTML5 parsing rules. These are met in the same way as with vCard and vEvent microdata. In conclusion, to address these use cases and scenarios I've introduced three vocabularies based on past practices -- vCard, vEvent, and BibTeX -- to the HTML5 specification, and I've defined how these vocabularies work in the context of the drag-and-drop model, which I believe is the core part of this proposal that has been lacking in other proposals previously. A number of further use cases remain to be examined, including one with scenarios regarding validating custom vocabularies and allowing editors to provide help with custom vocabularies. I will send further e-mail next week as I address them. -- Ian Hickson U+1047E )\._.,--....,'``. fL http://ln.hixie.ch/ U+263A /, _.. \ _\ ;`._ ,. Things that are impossible just take longer. `._.-(,_..'--(,_..'`-.;.'
Received on Tuesday, 19 May 2009 16:07:15 UTC