[whatwg] Helping people seaching for content filtered by license from Ian Hickson on 2009-05-08 (public-whatwg-archive@w3.org from May 2009)

From: Ian Hickson <ian@hixie.ch>
Date: Fri, 8 May 2009 19:57:31 +0000 (UTC)
Message-ID: <Pine.LNX.4.62.0905072258380.18851@hixie.dreamhostps.com>

One of the use cases I collected from the e-mails sent in over the past
few months was the following:

USE CASE: Help people searching for content to find content covered by
licenses that suit their needs.

SCENARIOS:
* If a user is looking for recipes of pies to reproduce on his blog, he
might want to exclude from his results any recipes that are not
available under a license allowing non-commercial reproduction.
* Lucy wants to publish her papers online. She includes an abstract of
each one in a page, but because they are under different copyright
rules, she needs to clarify what the rules are. A harvester such as
the Open Access project can actually collect and index some of them
with no problem, but may not be allowed to index others. Meanwhile, a
human finds it more useful to see the abstracts on a page than have to
guess from a bunch of titles whether to look at each abstract.
* There are mapping organisations and data producers and people who take
photos, and each may place different policies. Being able to keep that
policy information helps people with further mashups avoiding
violating a policy. For example, if GreatMaps.com has a public domain
policy on their maps, CoolFotos.org has a policy that you can use data
other than images for non-commercial purposes, and Johan Ichikawa has
a photo there of my brother's cafe, which he has licensed as "must pay
money", then it would be reasonable for me to copy the map and put it
in a brochure for the cafe, but not to copy the data and photo from
CoolFotos. On the other hand, if I am producing a non-commercial guide
to cafes in Melbourne, I can add the map and the location of the cafe
photo, but not the photo itself.
* Tara runs a video sharing web site for people who want licensing
information to be included with their videos. When Paul wants to blog
about a video, he can paste a fragment of HTML provided by Tara
directly into his blog. The video is then available inline in his
blog, along with any licensing information about the video.
* Fred's browser can tell him what license a particular video on a site
he is reading has been released under, and advise him on what the
associated permissions and restrictions are (can he redistribute this
work for commercial purposes, can he distribute a modified version of
this work, how should he assign credit to the original author, what
jurisdiction the license assumes, whether the license allows the work
to be embedded into a work that uses content under various other
licenses, etc).
* Flickr has images that are CC-licensed, but the pages themselves are
not.
* Blogs may wish to reuse CC-licensed images without licensing the whole
blog as CC, but while still including attribution and license
information (which may be required by the licenses in question).

REQUIREMENTS:
* Content on a page might be covered by a different license than other
content on the same page.
* When licensing a subpart of the page, existing implementations must
not just assume that the license applies to the whole page rather than
just part of it.
* License proliferation should be discouraged.
* License information should be able to survive from one site to another
as the data is transfered.
* Expressing copyright licensing terms should be easy for content
creators, publishers, and redistributors to provide.
* It should be more convenient for the users (and tools) to find and
evaluate copyright statements and licenses than it is today.
* Shouldn't require the consumer to write XSLT or server-side code to
process the license information.
* Machine-readable licensing information shouldn't be on a separate page
than human-readable licensing information.
* There should not be ambiguous legal implications.
* Parsing rules should be unambiguous.
* Should not require changes to HTML5 parsing rules.

The scenarios described above fall into three categories: searching for
content, publishing content, and obtaining legal advice.

First, I will examine the search scenario:

* If a user is looking for recipes of pies to reproduce on his blog, he
might want to exclude from his results any recipes that are not
available under a license allowing non-commercial reproduction.

This is technically possible today. The rel="license" link type allows
authors to specify the license that applies to the main content on a page,
in this case recipes, search engines can be programmed with the most
common licenses, and the user can tell the search engine what
characteristics he wants ("compatible with GPLv2", "no advertising
clause", "doesn't have patent implications", "allows redistribution to
countries on the US blacklist").

This has some implications:

- Each unit of content (recipe in this case) must have its own
independent page at a distinct URL. This is actually good practice
anyway today for making content discoverable from search engines, and
it is compatible with what people already do, so this seems fine.

- New licenses are discouraged, as they would not be automatically
supported by search engines. This is needed by one of the requirements:

* License proliferation should be discouraged.

This solution is already deployed on such sites as Flickr, and already
supported on search engines such as Google.

Next, I will look at the content publishing scenarios:

* Lucy wants to publish her papers online. She includes an abstract of
each one in a page, but because they are under different copyright
rules, she needs to clarify what the rules are. A harvester such as
the Open Access project can actually collect and index some of them
with no problem, but may not be allowed to index others. Meanwhile, a
human finds it more useful to see the abstracts on a page than have to
guess from a bunch of titles whether to look at each abstract.

This really boils down to two points:

- Being able to include the license of various items on a page for humans
to read.

- Being able to control what harvesters (spiders) index.

Being able to include a license on a page is easy: you just include the
license name and a link to the license. Since this is for the user in this
case, there is no need for any special markup.

Controlling harvesters is a separate problem. This is actually a
well-understood problem space with a number of very well-understood
solutions. For site-wide control, there is robots.txt, which can target
individual spiders (as in this case). On a page-by-page basis, there is
the <meta> element's "noindex" value.

Thus this particular scenario doesn't require any new features.

* There are mapping organisations and data producers and people who take
photos, and each may place different policies. Being able to keep that
policy information helps people with further mashups avoiding
violating a policy. For example, if GreatMaps.com has a public domain
policy on their maps, CoolFotos.org has a policy that you can use data
other than images for non-commercial purposes, and Johan Ichikawa has
a photo there of my brother's cafe, which he has licensed as "must pay
money", then it would be reasonable for me to copy the map and put it
in a brochure for the cafe, but not to copy the data and photo from
CoolFotos. On the other hand, if I am producing a non-commercial guide
to cafes in Melbourne, I can add the map and the location of the cafe
photo, but not the photo itself.

This doesn't seem to require any technological solution at all; it seems
to be purely a legal issue. So long as the licenses are clearly stated, as
they presumably must be (for example, the MIT license requires the
copyright text to follow the text even as it is copied, the Creative
Commons licenses require the license or its URL to be published with any
reproductions, etc), there is no need for any markup.

* Tara runs a video sharing web site for people who want licensing
information to be included with their videos. When Paul wants to blog
about a video, he can paste a fragment of HTML provided by Tara
directly into his blog. The video is then available inline in his
blog, along with any licensing information about the video.

(Really? A video sharing site dedicated to people who want licensing
information to be included with their videos? That's a pretty specific
audience, wow.)

This can be done with HTML5 today. For example, here is the markup you
could include to allow someone to embed a video on their site while
including the copyright or license information:

<figure>
<video src="http://example.com/videodata/sJf-ulirNRk" controls>
<a href="http://video.example.com/watch?v=sJf-ulirNRk">Watch</a>
</video>
<legend>
Pillar post surgery, starting to heal.
<small>&copy; copyright 2008 Pillar. All Rights Reserved.</small>
</legend>
</figure>

* Flickr has images that are CC-licensed, but the pages themselves are
not.

I've clarified the HTML5 spec's definition of rel=license and included an
example showing a page based on what Flickr is doing.

For search, this has the right result (when a search page returns an
arbitrary Flickr page, it's because they are looking for the image, which
is what the rel=license link Flickr gives is for), and it is clear to the
user (they see the license information clearly, and there's no confusion
that it might apply to the rest of the page).

* Blogs may wish to reuse CC-licensed images without licensing the whole
blog as CC, but while still including attribution and license
information (which may be required by the licenses in question).

The example above shows this for a movie, but it works as well for a
photo:

<figure>
<img src="http://nearimpossible.com/DSCF0070-1-tm.jpg" alt="">
<legend>
Picture by Bob.
<small><a href="http://creativecommons.org/licenses/by-nc-sa/2.5/legalcode">Creative
Commons Attribution-Noncommercial-Share Alike 2.5 Generic License</a></small>
</legend>
</figure>

Admittedly, if this scenario is taken in the context of the first
scenario, meaning that Bob wants this image to be discoverable through
search, but doesn't want to include it on a page of its own, then extra
syntax to mark this particular image up would be useful.

However, in my research I found very few such cases. In every case where I
found multiple media items on a single page with no dedicated page, either
every item was licensed identically and was the main content of the page,
or each item had its own separate page, or the items were licensed under
the same license as the page. In all three of these cases, rel=license
already solves the problem today. This discourages people from using
multiple licenses, of course, but that's actually a good thing, as it
discourages license proliferation.

* Fred's browser can tell him what license a particular video on a site
he is reading has been released under, and advise him on what the
associated permissions and restrictions are (can he redistribute this
work for commercial purposes, can he distribute a modified version of
this work, how should he assign credit to the original author, what
jurisdiction the license assumes, whether the license allows the work
to be embedded into a work that uses content under various other
licenses, etc).

Advising a user on the legal implications of a license is something that
needs trained professionals, but given a particular license, advice could
be provided in canned form. So it seems like this is already possible, the
user just has to select a license from a list of licenses. A user agent
could pre-select a license based on the value of the page's rel=license
link(s), or based on scanning the page for mention of a license, too.

I will now examine how these solutions fit the requirements:

* Content on a page might be covered by a different license than other
content on the same page.

For the search case, this is handled by separating that content so that
each page that is to be discoverable via a license filter is discoverable
at independent URLs. This appears to fit the existing practice.

* When licensing a subpart of the page, existing implementations must
not just assume that the license applies to the whole page rather than
just part of it.

This is resolved by not having a mechanism for machine-readably licensing
just part of a page, and instead putting such content on its own page,
which leads to a better experience anyway from a search perspective.

* License proliferation should be discouraged.

I've mentioned several ways in which this is achieved.

* License information should be able to survive from one site to another
as the data is transfered.

This is clearly possible if the license information is mere text, and
indeed some of the licenses already require this anyway.

* Expressing copyright licensing terms should be easy for content
creators, publishers, and redistributors to provide.

It is simple to express license terms, since no special syntax beyond the
legally-required verbiage is necessary.

* It should be more convenient for the users (and tools) to find and
evaluate copyright statements and licenses than it is today.

This requirement is not met. I do not know of any way to improve matters
beyond what is available today, unfortunately. This would probably require
lesiglative simplifications of copyright law, which, while probably quite
desireable, are somewhat out of scope of HTML5.

* Shouldn't require the consumer to write XSLT or server-side code to
process the license information.

No XSLT is necessary for consuming rel="license". Some code is necessary
to make a search engine to index rel="license" data, but that seems
inevitable and, in practice, is a small amount of code relative to the
rest of the code involved in an indexing project.

* Machine-readable licensing information shouldn't be on a separate page
than human-readable licensing information.

Since the only machine-readable licensing information that HTML5 provides
for is rel="license", and that can only be included as part of a link on
the page with the license statement, this requirement is met.

* There should not be ambiguous legal implications.

This requirement is met only insofar as rel="license" doesn't make any
firm legal statements; the page is required to be unambiguous about what
is considered the main content. In practice I think we can see that sites
like Flickr have found that this is possible. Beyond that, since the only
legal implications here are those that would be present without any markup
at all (i.e. copyright statements, etc), the legal implications of HTML5's
features seem minimal.

* Parsing rules should be unambiguous.

The rules for parsing rel="" values are well-understood and clear, I
believe.

* Should not require changes to HTML5 parsing rules.

No parsing rules changed during the addressing of these use cases.

In conclusion, it seems most of these use cases are already handled by the
current text in the spec and do not show a need for a more elaborate
scheme. The rel="license" feature in particular handles search adequately,
and is already deployed both in consumers and generators. Two areas where
we could add more syntax-level support would be in licensing subparts
explicitly, and in providing machine-readable licenses. The former seems
like an obvious need but actual deployed content doesn't seem to need it,
since most individually licensed works exist on pages of their own
already, or are covered by the same license as other works on the same
page. The latter is a dangerous area to get into, as licenses have very
specific legal wording. We could never make the machine-readable text have
legal standing, and thus people couldn't rely on it to draw conclusions
anyway. Furthermore, it would encourage license proliferation which is
already a problem.

A number of further use cases remain to be examined, including some more
specifically looking at attribution rather than licensing. I will send
further e-mail next week as I address them.

--
Ian Hickson U+1047E )\._.,--....,'``. fL
http://ln.hixie.ch/ U+263A /, _.. \ _\ ;`._ ,.
Things that are impossible just take longer. `._.-(,_..'--(,_..'`-.;.'

Received on Friday, 8 May 2009 12:57:31 UTC