ISSUE-115 (rel-canonical-human): link tag: rel: associate pages about the same person across many sites for searches without a canonical page and despite a confusingly indistinct name [HTML 5 spec]

ISSUE-115 (rel-canonical-human): link tag: rel: associate pages about the same person across many sites for searches without a canonical page and despite a confusingly indistinct name [HTML 5 spec]

http://www.w3.org/html/wg/tracker/issues/115

Raised by: Maciej Stachowiak
On product: HTML 5 spec

Escalated from: http://www.w3.org/Bugs/Public/show_bug.cgi?id=7681
Requested by: Nick Levinson

Different websites may have pages about the same person. Several people may
have the same name and all may be written about on multiple sites. Search
engines have difficulty associating pages that are about the same person
without erroneously intermixing other people with the same name, especially
when none of the people are extraordinarily famous and popular in searches
(when they are, search engines may have algorithms for more sophisticated
associational analysis).

Libraries solve this for authors by distinguishing among them with birth years
and death years. Other biographical sources offer vague dates for when someone
flourished and, to distinguish someone, commonly provide nationalities or birth
places.

Without a standard method, one search engine, A9, tried grouping people in
results and, in my observation, failed abysmally. They no longer offer the
service. This proposal would provide page authors with a tool that search
engines could read for much better grouping of results.

A link element naming the person and providing, in the element, data that is
standardized could help search engines organize their listings to reduce
accidental intermixing. It wouldn't be perfect; e.g., a person may have
reported multiple ages from which different birth years are calculated; a
website owner may erroneously enter the wrong data; nationality may vary with a
citizenship change; or historians may disagree. But, in general, listings with
this element could be more successfully separated.

Writing and parsing the link element would be a bit more complex than with
other link elements, but I think this is manageable and the method I propose
has been applied elsewhere.

I propose that the rel value be "canonical-human" and that its title attribute
be reserved for a special meaning and syntax. The title attribute's syntax
would be in the form of title="name: Asashi T. Fung; born: 1723; died: 1799;
flourished: 1740s-1750s; nationality: FR; birthplace: Honolulu, Hawaii, US;
ident-scheme: ; ident: ;". No href attribute is needed.

Each subattribute (e.g., "name") would be optional. For example, "flourished"
would likely be used only when birth and death years are unknown.

For the subattribute birthplace, if a subvalue is supplied, a nation would be
required. The nation of the birthplace would be represented by one of the same
codes used for nationality.

For the subattribute ident-scheme, a list of schema could be developed later,
perhaps each to be prefixed by a code for the scheme's nation and a hyphen.
Schemes could include privately-owned but widely available databases of
moderately-well-known people. Subvalues for ident-scheme and ident must not be
entered until a list of schema and the style of ident values for a scheme is
centralized and then the scheme must be in that list and ident's subvalue must
conform to the specified style.

If only whitespace or a null is between the colon and the semicolon, that is
equivalent to the subattribute not appearing.

A final semicolon before the closing quote mark is optional and may be imputed.

More subattributes might be added in the future, so page authors must not
invent new ones in the meantime.

No subvalue (e.g., "1723") could contain a colon or a seimcolon. If one is
needed or wanted, a character entity must represent the colon or the semicolon.

The nationality and the birthplace would include a nation using standard
two-letter codes. For nations that no longer exist and do not have two-letter
codes, e.g., Roman Empire and Van Lang, longer codes must be used, since about 
  200 2-letter codes are already in use and only 676 exist, and longer codes
would prevent future conflict or exhaustion. A list of deceased nations and
their longer codes would have to be established, possibly based on a standard
gazetteer.

The rel value of "canonical-human" avoids the legal meaning of _person_ in the
U.S., and probably in other nations that rely on U.K. common law traditions,
where it includes corporations and other legally-recognized entities. A value
of "canonical-individual" may be too confusing if misunderstood as being about,
say, pages and not people at all.

This is already in the RelExtensions wiki, albeit without details.

No rev value would be meaningful.

Multiple link elements with this rel value would be permitted, and UAs should
apply all of them. That permits multiple names (e.g., spellings),
ident-schemes, and idents to identify the person more certainly.

Several other technologies fall short for the purpose:

--- Rel="me" is not adequate. The "me" points to a canonical page, not a
person. And which page is canonical for Attila the Hun might not be subject to
general agreement.

--- The Google Social Graph API is limited to URLs and name
(http://code.google.com/apis/socialgraph/docs/attributes.html) (and I'm unclear
how you use the API for HTML markup). The person in whom you're interested has
to have URLs you consider authoritative. They may not exist. To use it to
describe specific data about a person other than URLs requires believing any
URLs you cite are stable. You'd often have to limit the URLs to those you
control. That makes otherme not very useful for many famous and semifamous
people, including those in history. Many Web pages are about historical figures
and many more are about modern people who are likely to be significant in
history, like heads of state.

--- FOAF is for XML and therefore is compatible with XHTML, but is a bit more
complicated to use with HTML, because some of its requirements don't apply to
elsewhere in HTML. FOAF has many good features but, of 8 I proposed here, it
lacks 6: death date, when flourished, nationality, birth place, and a way to
refer to authoritative sources if they're not openly online (e.g., subscription
databases and Who's Who books) (http://xmlns.com/foaf/spec/). In addition,
despite having read probably dozens of books on Web matters (among hundreds on
computers generally), I didn't recall FOAF. It deserves publicity, but HTML
already has that and already has a mechanism to do what I'm proposing, a
mechanism described in books on the language.

--- hCard and the closely-related RDFa grammar Google supports are too limited,
because they don't have enough fields available. Parsers are to ignore anything
not understood. A proposal for a date-of-death field is pending for hCard, but
not for other fields, and accepting the one proposal may require abandoning the
nearly 1:1 relationship with the vCard RFC. Multiple birth dates are required
when we know, say, a person was born October 16 but not whether that was in
1919 or 1918, often the case with entertainers and older women, but hCard
limits to a single date of birth or requires more vagueness than the known
facts may justify. Birth and death dates may come from different calendars for
people whose lives straddle a calendar change (one occurred about two and half
centuries ago in the U.S.) and hCard doesn't accommodate those changes. While
fn is flexible enough, n isn't for some name methods found internationally and
n is impliable from fn, the inflexibility creating erroneous results not
attributable to the content author. An ident-scheme might refer to a large
collection of biographies that may be in book form or in an access-limited
website and thus not have a URL or a full URL, and hCard doesn't offer
compatible properties.

Example:

--- Let's say Prof. X writes about Attila the Hun. So does Prof. Y. The two
professors don't trust each other, but they agree on their subject and when he
flourished. They don't want to link to each other's pages because they don't
want to trust their rivals' work or stability. At the same time, search
engines' content analyses are more geared to popular writing. One way scholarly
writing may differ is by using key terms less often per thousand words of total
text, because it's presumed readers already know what they're reading about,
and that lowers ranking, which may increase the spread between their papers,
making finding same-subject people results harder for searchers. And requiring
search engines to analyze free-form text like "he was brought by the stork as
the most beautiful baby you ever saw on April 16, 1963" to extract an
identifying birthdate is too much to ask of an algorithm, so any technology we
use for this general purpose is likely to need hand-coding, making the page
author's time a factor.

 --- Solution: If both professors place rel canonical-human "Attila the Hun"
and what happen to be the same dates for flourishing or birth and death in
their pages, once per page head, search engines can recognize that Prof. X and
Prof. Y are almost certainly talking about the same person. The certainty will
go up when using standard biographical identifiers. This becomes even more
important when the name in question is coincidentally shared by multiple
people, say, a Panamanian judge and an Indian moviemaker, and searchers aren't
sure which nationality or occupation makes their subject important. The
searchers want the search engines to separate the results by subject person.
And the rel being essentially a line or two saves authoring time.

This proposal is for describing the human. Multiple pages across many websites
that have the same link element, because they contain the same personally
identifying information, can be associated as representing the same person.
That helps search engines.

An enhancement request for "canonical-organization" is separate.

Received on Wednesday, 28 April 2010 21:49:19 UTC