Canonicalization of schema.org URLs

This issue is relevant to two recently-opened issues and a Google+
discussion, which is why I'm posting to the list rather than adding a
comment to one or the other on Github.

schema.org URLs should resolve the HTTPS form
https://github.com/schemaorg/schemaorg/issues/716

Extension pages should give canonical URL explicitly
https://github.com/schemaorg/schemaorg/issues/717

Ganymede goes live as schema.org v 2.1 [see comments]
https://plus.google.com/106943062990152739506/posts/eT7SjF22rhy

There seems to be a nascent consensus forming around "standard" methods of
canonicalizing schema.org URLs:

   - 301 redirect from the HTTP form of a URL to the HTTPS form
   - 301 redirect from www.schema.org to schema.org

Note that this consensus is around the desired end state, and that - as Dan
Brickley noted in a comment on 716 - there are challenges that need be
overcome before these measures can be rolled out.

The third "standard" canonicalization method - in this case specifically
for search engines - is:


   - Declare the preferred URL using rel="canonical"

This protocol has long been supported by all the sponsors [1,2,3], and the
thrust of the declaration is fairly straightforward:  "hey search engine,
thanks for visiting this URL, but the one you really should be indexing and
serving up in response to queries is value provided by the href attribute
of rel='canonical'."

As per 717 and Richard Wallis' work on that, we now have a statement about
the "Canonical URL" on types and properties that reside on a
reviewed/hosted extension.

E.g. on:
https://bib.schema.org/Thesis
We have:
Canonical URL: http://schema.org/Thesis

On one hand this is problematic only insofar as this use of "canonical" is
at variance with the use of "canonical" in rel="canonical".

That is, for that declaration the canonical would be exactly the opposite
as the one provided in the on-page text.  If one had to direct the search
engines which URL to index it would absolutely be
https://bib.schema.org/Thesis because that's where the information about
https://bib.schema.org/Thesis resides.  In the event that the search
engines were directed to use http://schema.org/Thesis preferentially over
https://bib.schema.org/Thesis and actually did so, none of the
descriptions, no information aside from the term would ever be indexed for
an extension, as the only data on page is a stub.  I.e. the only data
http://schema.org/Thesis provides is a link to the place to where that type
is fully described.

All fine and well if the variance between the two uses is merely a matter
of semantics (ha) - that is, if no actual rel="canonical" value were
provided for either page, allowing both to be indexed -  although just to
reduce confusion it would be nice if the descriptive and declarative terms
were used in the same way.

On the other hand, I don't actually understand what  "Canonical URL" means
on those pages.  This is the URL that ... which human or data consumer
should use?  If I wanted to find out information (either as a human or data
consumer) about bib.schema.org/Thesis I'm not going to find it at
schema.org/Thesis, as the latter only points to the former, which seems
circular.  Color me confused, and I appreciate any elucidation on what
"Canonical URL" is meant to convey on these extension pages (sorry I was
unable to come to clear understanding of its employment despite your
detailed comment on the subject Dan).

While, indeed it might turn out that this is correct from a data handling
point of view, it could readily be misconstrued by webmasters (who are
mostly accustomed to it's employment in the rel="canonical" sense) that
they should be using the schema.org URL, rather than the bib.schema.org
URL, in their declarations.  As per the example on the bib page, maybe this
isn't a misstep at all, but exactly how extension URLs are supposed to work?

<div itemscope itemtype="http://schema.org/Thesis">

Again, all fine and well if that's the case except, again, we have this use
of "canonical" to mean "use this URL if you're marking up code", contrary
to the HTML declaration that would, applied here, mean "use this URL if
you're a search engine", as that's very much wrong for reasons described
above.  In this scenario it might be worthwhile to come up with an
alternative to the on-page term "Canonical URL" to avoid confusion.

I'll note too, that there's the possibility a type or property could
duplicated by different extensions unless they're kept locked down.

https://furniture.schema.org/chair
https://meetings.schema.org/chair

And even if locking-down extension terms so that schema.org/[term] can only
refer to its use by the core or a single extension, as Dan has noted this
possibility is further complicated by external extensions.

FWIW in my view of a perfect world, bib.schema.org/Thesis would be the
canonical in both the colloquial and declarative sense, and it would be
possible to declare it as such:
<div itemscope itemtype="http://bib.schema.org/Thesis">

However, I suspect it's precisely the impossibility of that itemtype
declaration that's the sticking point, especially viewed in context of a
subsequently declared property that resides in the core vocabulary.
<span itemprop="name">A meandering dissertation on the use of "canonical"
at schema.org</span>

Thanks for any feedback.

[1] https://support.google.com/webmasters/answer/139066?hl=en
[2]
https://yandex.com/support/webmaster/controlling-robot/html.xml#canonical
[3]
https://blogs.bing.com/webmaster/2011/10/06/managing-redirects-301s-302s-and-canonicals/
- Bing's respect for rel="canonical" currently extends to Yahoo! as Bing
fuels Yahoo results

(Yes, "canonicalization" and ""canonicalization" *do* generate spelling
error flags.:).

Received on Friday, 7 August 2015 18:23:41 UTC