- From: HTML Weekly Issue Tracker <sysbot+tracker@w3.org>
- Date: Wed, 28 Apr 2010 21:49:17 +0000 (GMT)
- To: public-html-wg-issue-tracking@w3.org
ISSUE-115 (rel-canonical-human): link tag: rel: associate pages about the same person across many sites for searches without a canonical page and despite a confusingly indistinct name [HTML 5 spec] http://www.w3.org/html/wg/tracker/issues/115 Raised by: Maciej Stachowiak On product: HTML 5 spec Escalated from: http://www.w3.org/Bugs/Public/show_bug.cgi?id=7681 Requested by: Nick Levinson Different websites may have pages about the same person. Several people may have the same name and all may be written about on multiple sites. Search engines have difficulty associating pages that are about the same person without erroneously intermixing other people with the same name, especially when none of the people are extraordinarily famous and popular in searches (when they are, search engines may have algorithms for more sophisticated associational analysis). Libraries solve this for authors by distinguishing among them with birth years and death years. Other biographical sources offer vague dates for when someone flourished and, to distinguish someone, commonly provide nationalities or birth places. Without a standard method, one search engine, A9, tried grouping people in results and, in my observation, failed abysmally. They no longer offer the service. This proposal would provide page authors with a tool that search engines could read for much better grouping of results. A link element naming the person and providing, in the element, data that is standardized could help search engines organize their listings to reduce accidental intermixing. It wouldn't be perfect; e.g., a person may have reported multiple ages from which different birth years are calculated; a website owner may erroneously enter the wrong data; nationality may vary with a citizenship change; or historians may disagree. But, in general, listings with this element could be more successfully separated. Writing and parsing the link element would be a bit more complex than with other link elements, but I think this is manageable and the method I propose has been applied elsewhere. I propose that the rel value be "canonical-human" and that its title attribute be reserved for a special meaning and syntax. The title attribute's syntax would be in the form of title="name: Asashi T. Fung; born: 1723; died: 1799; flourished: 1740s-1750s; nationality: FR; birthplace: Honolulu, Hawaii, US; ident-scheme: ; ident: ;". No href attribute is needed. Each subattribute (e.g., "name") would be optional. For example, "flourished" would likely be used only when birth and death years are unknown. For the subattribute birthplace, if a subvalue is supplied, a nation would be required. The nation of the birthplace would be represented by one of the same codes used for nationality. For the subattribute ident-scheme, a list of schema could be developed later, perhaps each to be prefixed by a code for the scheme's nation and a hyphen. Schemes could include privately-owned but widely available databases of moderately-well-known people. Subvalues for ident-scheme and ident must not be entered until a list of schema and the style of ident values for a scheme is centralized and then the scheme must be in that list and ident's subvalue must conform to the specified style. If only whitespace or a null is between the colon and the semicolon, that is equivalent to the subattribute not appearing. A final semicolon before the closing quote mark is optional and may be imputed. More subattributes might be added in the future, so page authors must not invent new ones in the meantime. No subvalue (e.g., "1723") could contain a colon or a seimcolon. If one is needed or wanted, a character entity must represent the colon or the semicolon. The nationality and the birthplace would include a nation using standard two-letter codes. For nations that no longer exist and do not have two-letter codes, e.g., Roman Empire and Van Lang, longer codes must be used, since about 200 2-letter codes are already in use and only 676 exist, and longer codes would prevent future conflict or exhaustion. A list of deceased nations and their longer codes would have to be established, possibly based on a standard gazetteer. The rel value of "canonical-human" avoids the legal meaning of _person_ in the U.S., and probably in other nations that rely on U.K. common law traditions, where it includes corporations and other legally-recognized entities. A value of "canonical-individual" may be too confusing if misunderstood as being about, say, pages and not people at all. This is already in the RelExtensions wiki, albeit without details. No rev value would be meaningful. Multiple link elements with this rel value would be permitted, and UAs should apply all of them. That permits multiple names (e.g., spellings), ident-schemes, and idents to identify the person more certainly. Several other technologies fall short for the purpose: --- Rel="me" is not adequate. The "me" points to a canonical page, not a person. And which page is canonical for Attila the Hun might not be subject to general agreement. --- The Google Social Graph API is limited to URLs and name (http://code.google.com/apis/socialgraph/docs/attributes.html) (and I'm unclear how you use the API for HTML markup). The person in whom you're interested has to have URLs you consider authoritative. They may not exist. To use it to describe specific data about a person other than URLs requires believing any URLs you cite are stable. You'd often have to limit the URLs to those you control. That makes otherme not very useful for many famous and semifamous people, including those in history. Many Web pages are about historical figures and many more are about modern people who are likely to be significant in history, like heads of state. --- FOAF is for XML and therefore is compatible with XHTML, but is a bit more complicated to use with HTML, because some of its requirements don't apply to elsewhere in HTML. FOAF has many good features but, of 8 I proposed here, it lacks 6: death date, when flourished, nationality, birth place, and a way to refer to authoritative sources if they're not openly online (e.g., subscription databases and Who's Who books) (http://xmlns.com/foaf/spec/). In addition, despite having read probably dozens of books on Web matters (among hundreds on computers generally), I didn't recall FOAF. It deserves publicity, but HTML already has that and already has a mechanism to do what I'm proposing, a mechanism described in books on the language. --- hCard and the closely-related RDFa grammar Google supports are too limited, because they don't have enough fields available. Parsers are to ignore anything not understood. A proposal for a date-of-death field is pending for hCard, but not for other fields, and accepting the one proposal may require abandoning the nearly 1:1 relationship with the vCard RFC. Multiple birth dates are required when we know, say, a person was born October 16 but not whether that was in 1919 or 1918, often the case with entertainers and older women, but hCard limits to a single date of birth or requires more vagueness than the known facts may justify. Birth and death dates may come from different calendars for people whose lives straddle a calendar change (one occurred about two and half centuries ago in the U.S.) and hCard doesn't accommodate those changes. While fn is flexible enough, n isn't for some name methods found internationally and n is impliable from fn, the inflexibility creating erroneous results not attributable to the content author. An ident-scheme might refer to a large collection of biographies that may be in book form or in an access-limited website and thus not have a URL or a full URL, and hCard doesn't offer compatible properties. Example: --- Let's say Prof. X writes about Attila the Hun. So does Prof. Y. The two professors don't trust each other, but they agree on their subject and when he flourished. They don't want to link to each other's pages because they don't want to trust their rivals' work or stability. At the same time, search engines' content analyses are more geared to popular writing. One way scholarly writing may differ is by using key terms less often per thousand words of total text, because it's presumed readers already know what they're reading about, and that lowers ranking, which may increase the spread between their papers, making finding same-subject people results harder for searchers. And requiring search engines to analyze free-form text like "he was brought by the stork as the most beautiful baby you ever saw on April 16, 1963" to extract an identifying birthdate is too much to ask of an algorithm, so any technology we use for this general purpose is likely to need hand-coding, making the page author's time a factor. --- Solution: If both professors place rel canonical-human "Attila the Hun" and what happen to be the same dates for flourishing or birth and death in their pages, once per page head, search engines can recognize that Prof. X and Prof. Y are almost certainly talking about the same person. The certainty will go up when using standard biographical identifiers. This becomes even more important when the name in question is coincidentally shared by multiple people, say, a Panamanian judge and an Indian moviemaker, and searchers aren't sure which nationality or occupation makes their subject important. The searchers want the search engines to separate the results by subject person. And the rel being essentially a line or two saves authoring time. This proposal is for describing the human. Multiple pages across many websites that have the same link element, because they contain the same personally identifying information, can be associated as representing the same person. That helps search engines. An enhancement request for "canonical-organization" is separate.
Received on Wednesday, 28 April 2010 21:49:19 UTC