- From: Ian Hickson <ian@hixie.ch>
- Date: Fri, 8 May 2009 19:57:31 +0000 (UTC)
One of the use cases I collected from the e-mails sent in over the past few months was the following: USE CASE: Help people searching for content to find content covered by licenses that suit their needs. SCENARIOS: * If a user is looking for recipes of pies to reproduce on his blog, he might want to exclude from his results any recipes that are not available under a license allowing non-commercial reproduction. * Lucy wants to publish her papers online. She includes an abstract of each one in a page, but because they are under different copyright rules, she needs to clarify what the rules are. A harvester such as the Open Access project can actually collect and index some of them with no problem, but may not be allowed to index others. Meanwhile, a human finds it more useful to see the abstracts on a page than have to guess from a bunch of titles whether to look at each abstract. * There are mapping organisations and data producers and people who take photos, and each may place different policies. Being able to keep that policy information helps people with further mashups avoiding violating a policy. For example, if GreatMaps.com has a public domain policy on their maps, CoolFotos.org has a policy that you can use data other than images for non-commercial purposes, and Johan Ichikawa has a photo there of my brother's cafe, which he has licensed as "must pay money", then it would be reasonable for me to copy the map and put it in a brochure for the cafe, but not to copy the data and photo from CoolFotos. On the other hand, if I am producing a non-commercial guide to cafes in Melbourne, I can add the map and the location of the cafe photo, but not the photo itself. * Tara runs a video sharing web site for people who want licensing information to be included with their videos. When Paul wants to blog about a video, he can paste a fragment of HTML provided by Tara directly into his blog. The video is then available inline in his blog, along with any licensing information about the video. * Fred's browser can tell him what license a particular video on a site he is reading has been released under, and advise him on what the associated permissions and restrictions are (can he redistribute this work for commercial purposes, can he distribute a modified version of this work, how should he assign credit to the original author, what jurisdiction the license assumes, whether the license allows the work to be embedded into a work that uses content under various other licenses, etc). * Flickr has images that are CC-licensed, but the pages themselves are not. * Blogs may wish to reuse CC-licensed images without licensing the whole blog as CC, but while still including attribution and license information (which may be required by the licenses in question). REQUIREMENTS: * Content on a page might be covered by a different license than other content on the same page. * When licensing a subpart of the page, existing implementations must not just assume that the license applies to the whole page rather than just part of it. * License proliferation should be discouraged. * License information should be able to survive from one site to another as the data is transfered. * Expressing copyright licensing terms should be easy for content creators, publishers, and redistributors to provide. * It should be more convenient for the users (and tools) to find and evaluate copyright statements and licenses than it is today. * Shouldn't require the consumer to write XSLT or server-side code to process the license information. * Machine-readable licensing information shouldn't be on a separate page than human-readable licensing information. * There should not be ambiguous legal implications. * Parsing rules should be unambiguous. * Should not require changes to HTML5 parsing rules. The scenarios described above fall into three categories: searching for content, publishing content, and obtaining legal advice. First, I will examine the search scenario: * If a user is looking for recipes of pies to reproduce on his blog, he might want to exclude from his results any recipes that are not available under a license allowing non-commercial reproduction. This is technically possible today. The rel="license" link type allows authors to specify the license that applies to the main content on a page, in this case recipes, search engines can be programmed with the most common licenses, and the user can tell the search engine what characteristics he wants ("compatible with GPLv2", "no advertising clause", "doesn't have patent implications", "allows redistribution to countries on the US blacklist"). This has some implications: - Each unit of content (recipe in this case) must have its own independent page at a distinct URL. This is actually good practice anyway today for making content discoverable from search engines, and it is compatible with what people already do, so this seems fine. - New licenses are discouraged, as they would not be automatically supported by search engines. This is needed by one of the requirements: * License proliferation should be discouraged. This solution is already deployed on such sites as Flickr, and already supported on search engines such as Google. Next, I will look at the content publishing scenarios: * Lucy wants to publish her papers online. She includes an abstract of each one in a page, but because they are under different copyright rules, she needs to clarify what the rules are. A harvester such as the Open Access project can actually collect and index some of them with no problem, but may not be allowed to index others. Meanwhile, a human finds it more useful to see the abstracts on a page than have to guess from a bunch of titles whether to look at each abstract. This really boils down to two points: - Being able to include the license of various items on a page for humans to read. - Being able to control what harvesters (spiders) index. Being able to include a license on a page is easy: you just include the license name and a link to the license. Since this is for the user in this case, there is no need for any special markup. Controlling harvesters is a separate problem. This is actually a well-understood problem space with a number of very well-understood solutions. For site-wide control, there is robots.txt, which can target individual spiders (as in this case). On a page-by-page basis, there is the <meta> element's "noindex" value. Thus this particular scenario doesn't require any new features. * There are mapping organisations and data producers and people who take photos, and each may place different policies. Being able to keep that policy information helps people with further mashups avoiding violating a policy. For example, if GreatMaps.com has a public domain policy on their maps, CoolFotos.org has a policy that you can use data other than images for non-commercial purposes, and Johan Ichikawa has a photo there of my brother's cafe, which he has licensed as "must pay money", then it would be reasonable for me to copy the map and put it in a brochure for the cafe, but not to copy the data and photo from CoolFotos. On the other hand, if I am producing a non-commercial guide to cafes in Melbourne, I can add the map and the location of the cafe photo, but not the photo itself. This doesn't seem to require any technological solution at all; it seems to be purely a legal issue. So long as the licenses are clearly stated, as they presumably must be (for example, the MIT license requires the copyright text to follow the text even as it is copied, the Creative Commons licenses require the license or its URL to be published with any reproductions, etc), there is no need for any markup. * Tara runs a video sharing web site for people who want licensing information to be included with their videos. When Paul wants to blog about a video, he can paste a fragment of HTML provided by Tara directly into his blog. The video is then available inline in his blog, along with any licensing information about the video. (Really? A video sharing site dedicated to people who want licensing information to be included with their videos? That's a pretty specific audience, wow.) This can be done with HTML5 today. For example, here is the markup you could include to allow someone to embed a video on their site while including the copyright or license information: <figure> <video src="http://example.com/videodata/sJf-ulirNRk" controls> <a href="http://video.example.com/watch?v=sJf-ulirNRk">Watch</a> </video> <legend> Pillar post surgery, starting to heal. <small>© copyright 2008 Pillar. All Rights Reserved.</small> </legend> </figure> * Flickr has images that are CC-licensed, but the pages themselves are not. I've clarified the HTML5 spec's definition of rel=license and included an example showing a page based on what Flickr is doing. For search, this has the right result (when a search page returns an arbitrary Flickr page, it's because they are looking for the image, which is what the rel=license link Flickr gives is for), and it is clear to the user (they see the license information clearly, and there's no confusion that it might apply to the rest of the page). * Blogs may wish to reuse CC-licensed images without licensing the whole blog as CC, but while still including attribution and license information (which may be required by the licenses in question). The example above shows this for a movie, but it works as well for a photo: <figure> <img src="http://nearimpossible.com/DSCF0070-1-tm.jpg" alt=""> <legend> Picture by Bob. <small><a href="http://creativecommons.org/licenses/by-nc-sa/2.5/legalcode">Creative Commons Attribution-Noncommercial-Share Alike 2.5 Generic License</a></small> </legend> </figure> Admittedly, if this scenario is taken in the context of the first scenario, meaning that Bob wants this image to be discoverable through search, but doesn't want to include it on a page of its own, then extra syntax to mark this particular image up would be useful. However, in my research I found very few such cases. In every case where I found multiple media items on a single page with no dedicated page, either every item was licensed identically and was the main content of the page, or each item had its own separate page, or the items were licensed under the same license as the page. In all three of these cases, rel=license already solves the problem today. This discourages people from using multiple licenses, of course, but that's actually a good thing, as it discourages license proliferation. * Fred's browser can tell him what license a particular video on a site he is reading has been released under, and advise him on what the associated permissions and restrictions are (can he redistribute this work for commercial purposes, can he distribute a modified version of this work, how should he assign credit to the original author, what jurisdiction the license assumes, whether the license allows the work to be embedded into a work that uses content under various other licenses, etc). Advising a user on the legal implications of a license is something that needs trained professionals, but given a particular license, advice could be provided in canned form. So it seems like this is already possible, the user just has to select a license from a list of licenses. A user agent could pre-select a license based on the value of the page's rel=license link(s), or based on scanning the page for mention of a license, too. I will now examine how these solutions fit the requirements: * Content on a page might be covered by a different license than other content on the same page. For the search case, this is handled by separating that content so that each page that is to be discoverable via a license filter is discoverable at independent URLs. This appears to fit the existing practice. * When licensing a subpart of the page, existing implementations must not just assume that the license applies to the whole page rather than just part of it. This is resolved by not having a mechanism for machine-readably licensing just part of a page, and instead putting such content on its own page, which leads to a better experience anyway from a search perspective. * License proliferation should be discouraged. I've mentioned several ways in which this is achieved. * License information should be able to survive from one site to another as the data is transfered. This is clearly possible if the license information is mere text, and indeed some of the licenses already require this anyway. * Expressing copyright licensing terms should be easy for content creators, publishers, and redistributors to provide. It is simple to express license terms, since no special syntax beyond the legally-required verbiage is necessary. * It should be more convenient for the users (and tools) to find and evaluate copyright statements and licenses than it is today. This requirement is not met. I do not know of any way to improve matters beyond what is available today, unfortunately. This would probably require lesiglative simplifications of copyright law, which, while probably quite desireable, are somewhat out of scope of HTML5. * Shouldn't require the consumer to write XSLT or server-side code to process the license information. No XSLT is necessary for consuming rel="license". Some code is necessary to make a search engine to index rel="license" data, but that seems inevitable and, in practice, is a small amount of code relative to the rest of the code involved in an indexing project. * Machine-readable licensing information shouldn't be on a separate page than human-readable licensing information. Since the only machine-readable licensing information that HTML5 provides for is rel="license", and that can only be included as part of a link on the page with the license statement, this requirement is met. * There should not be ambiguous legal implications. This requirement is met only insofar as rel="license" doesn't make any firm legal statements; the page is required to be unambiguous about what is considered the main content. In practice I think we can see that sites like Flickr have found that this is possible. Beyond that, since the only legal implications here are those that would be present without any markup at all (i.e. copyright statements, etc), the legal implications of HTML5's features seem minimal. * Parsing rules should be unambiguous. The rules for parsing rel="" values are well-understood and clear, I believe. * Should not require changes to HTML5 parsing rules. No parsing rules changed during the addressing of these use cases. In conclusion, it seems most of these use cases are already handled by the current text in the spec and do not show a need for a more elaborate scheme. The rel="license" feature in particular handles search adequately, and is already deployed both in consumers and generators. Two areas where we could add more syntax-level support would be in licensing subparts explicitly, and in providing machine-readable licenses. The former seems like an obvious need but actual deployed content doesn't seem to need it, since most individually licensed works exist on pages of their own already, or are covered by the same license as other works on the same page. The latter is a dangerous area to get into, as licenses have very specific legal wording. We could never make the machine-readable text have legal standing, and thus people couldn't rely on it to draw conclusions anyway. Furthermore, it would encourage license proliferation which is already a problem. A number of further use cases remain to be examined, including some more specifically looking at attribution rather than licensing. I will send further e-mail next week as I address them. -- Ian Hickson U+1047E )\._.,--....,'``. fL http://ln.hixie.ch/ U+263A /, _.. \ _\ ;`._ ,. Things that are impossible just take longer. `._.-(,_..'--(,_..'`-.;.'
Received on Friday, 8 May 2009 12:57:31 UTC