Re: There's No Money in Linked Data

Thanks, Denny. Well spoken.

Everybody should read this. I know that I will want to point people to 
this text in the future. Can you make a permanent URL and title that can 
be used to cite this text? No need to make a PDF, but something more 
than mailing list archives would be nice.

-- Markus

On 18/05/13 10:06, Denny Vrandečić wrote:
> *
>
> **
>
> *tl;dr - If you publish data, attach the CC0 license to it, but that’s
> basically just advertising - don’t think it means anything.*
>
> *If you use data, you do not have to care much about the data license.*
>
> *If you republish data, it’s a bit more complicated, but not as horrible
> as you might think.*
>
> *
>
> Imagine a student reading a CC-BY-SA published textbook on compilers.
> Next thing, based on that knowledge, he writes a parser and publishes
> the binary on the Web. Does he have to acknowledge the textbook? Does he
> have to publish his code under the same license?
>
>
> Imagine a designer creating an image with GIMP, a fantastic open source
> image processing tool, published under the GPL. Or a developer writing
> his code in Eclipse. Or a website being served from a Linux box. What
> legal implications does it have for the license of the image? For the
> source code? For the served page?
>
>
> Imagine a search engine that changes its background color depending on
> the type of thing you are searching for. You enter a city - it turns
> gray. You enter a person - red for females, blue for males, and purple
> for others. You enter a company - yellow. And so on. Let us assume that
> the search engine does that by figuring out the thing you are searching
> for and then asking DBpedia for its type. Since DBpedia is licensed
> under CC-BY-SA, does this mean we have to put a link on the search
> result acknowledging DBpedia? Does this mean we have to publish our
> search index under CC-BY-SA as well?
>
>
> Imagine Red Cross publishing pages about the countries they work in, and
> adding the population data to each of them from Freebase, the location
> from OpenStreetMaps, the local name of the country from GeoNames, and
> the capital from DBpedia. What amount of legal disclaimer would need to
> be displayed on the page? Maybe some of the data items derive from
> another source? What about their licenses? What about this license
> stacking effect?
>
>
>
> There are some rather vague ideas floating about how the whole
> intellectual property law apparatus works for data. I have mulled over
> this for a long time, and read more laws and court cases than I care to
> admit. I want to try to make a few points in the following.
>
>
> Let’s start with the basics. What laws do actually apply?
>
>
> Copyright law protects the expression, not the idea - the form, not the
> content. You can watch the newest Iron Man movie, and you are legally
> allowed to annoy your friends with retellings of the movie as often as
> you want. But you are not allowed to film it with your phone camera in
> the theater and display it to your friends. If you learn something from
> a textbook, you are free to write your own textbook, adding other
> knowledge you have acquired, possibly from other textbooks and
> publications. Only if you start copying the original texts to closely,
> you will get into legal trouble.
>
>
> Almost all of the above mentioned licenses - all Creative Commons
> licenses currently available, as well as the GFDL or the GPL - are based
> on copyright laws. The GPL has started, as Stallmann admits, as a legal
> hack of copyright law. This makes a lot of sense, since these licenses
> have not meant to cover data, but expressions: texts, music, and the
> like. This means, these licenses cannot extend beyond that. They only
> cover the expression. They cover the actual RDF/XML file, the string of
> characters. Not the content. Not the graph.
>
>
> (Note that ODBL and the current draft of the upcoming fourth revision of
> CC go beyond copyright and include database right where applicable, i.e.
> within the legislation of the EU. This extension is irrelevant for the US.)
>
>
> This means that such licenses, like GFDL for data, have no restricting
> effect if you want to use the data. Only if you want to republish the
> data files more or less verbatim (in whole or partially, standalone or
> as part of a bigger project), you need to think about the original
> license. Merely including the data (not the files!) has no effect
> stemming from copyright.
>
>
> This also makes intuitively sense: if someone takes Wikipedia and counts
> the distribution of words and letters in Wikipedia, the subsequent
> publication of the results is not restricted by the original license
> Wikipedia was published. If someone takes the whole Web, and creates a
> graph of all links on the Web, and starts to apply some algorithms on
> this graph, the subsequent usage of the results of these algorithms are
> not subject to any of the licenses of the original texts published on
> the Web. Copyright simply does not extend this far. And that is good.
>
>
>
> So much to copyright. Unfortunately, the European Union went a step
> further. They recognized that copyright does not apply to databases.
> They also recognized that the EU was not doing well in their competition
> against the US, with regards to publishing databases. So they decided to
> level the field by introducing a completely new right, the database
> right. This protects the effort that goes into creating databases -
> basically their schema (which columns should I have) and the coverage
> (which rows do I have in my database). Ten years later the EU made an
> evaluation of the effectiveness of the laws, and came to some
> interesting conclusions: first, technically the newly database rights
> made things more complicated; second, most publishers obviously do not
> understand it, but are happy with what they think it means (which
> usually contradicts with what it actually means); and third, it
> completely failed in its goal to advance the database publishing sector.
> The report offers options to drop the whole database rights thing again,
> but so far nothing has happened.
>
>
> Also, this novel database right got a few major blows by the European
> Court of Justice, where it clearly stated that the right does not cover
> the creation of the database, merely the effort put into obtaining,
> selecting, and cleaning a database. This means, e.g. that the
> publication of match dates and fixtures by FIFA can not be protected
> under the database right. On the other hand, if an external Website
> keeps statistics of all FIFA player, how much their cost, where they
> currently are, etc., then their database as a whole could be protected.
>
>
> But to make it clear: the database right does not apply to single data
> items in the database: should I keep a database of all cities in the UK
> and their populations, and if someone asks for the population of Oxford
> from my database, the database rights do not prevent them from
> republishing and using that data item as they like. Eurostat cannot sue
> you if you tell someone the population of France.
>
>
> To summarize on database rights: the EU, and only the EU, have
> introduced in 1996 the so called database rights. They are independent
> of copyright, and cover a database as a whole in certain circumstances.
> If you are in the EU, and want to use the data, database right does not
> restrict you. It only restricts you from republishing the database as a
> whole or in relevant parts.
>
>
>
> Besides the legal foundations of the data licenses, one also has to
> consider that copyright law refers dominantly to the right to copy the
> data, not to use it: if you want to count how often certain explicit
> words are uttered in a movie like Pulp Fiction, you are free to do so.
> If you want to count and compare the death count in certain books and
> movies (like, Rambo, War and Peace, and the Bible - the results might
> surprise you), you are free to do so. You are free to publish the
> results, and you are even more free to use them internally in your
> organization.
>
>
>
> Having said that, I still recommend to add the CC0 license to a dataset
> when you publish it. I grudge every time I do it, but it still makes
> sense. Not because I believe that it means much: as said, the data in it
> is free anyway. But because a lot of other people believe that it means
> a lot. They might believe that if they integrate a point of data from a
> CC-BY-SA licensed dataset in their own dataset, they have to publish it
> under CC-BY-SA as well. They might believe that mixing a CC-BY-SA
> dataset with an ODBL dataset and displaying the results is legally
> impossible. Maybe they don’t even believe it, but they are required to
> ask their lawyers, and their lawyers will prefer to play it safe for
> their clients (it is their job!) and advise them accordingly. And for
> all of these people, the CC0 license is an item of assurance. So if you
> want your dataset to be usable by them, just add a CC0 license to it.
> And grudge about it.
>
>
>
> There is a completely independent aspect of why it could make sense to
> cite your data sources, which is trust and provenance. Even if a dataset
> is not published under a CC-BY-like license, meaning that it requires
> attribution, it often makes sense to keep the provenance and attribution
> intact - simply because the user of your data might ask for the source
> themselves, and might want to check on their credibility. But
> attribution for increasing your credibility is something entirely
> different than attribution because you think you are legally obliged due
> to the used data.
>
>
>
> If I were an organization or individual with sufficient financial
> backup, I would even offer to pick up your legal battles if a data
> publisher ever sues you for using their data (not for republishing it
> verbatim, though). I hope that maybe an organization or individual will
> step up at some point to do so, but I wouldn’t hold my breath for it.
> Both the US Supreme Court and the European Court of Justice have
> repeatedly decided in favour of the freedom of data, be it the results
> of games, be it telephone numbers, be it horse racing fixtures.
>
>
> So, as paradoxical as it sounds: Data is free. Free the data!
>
>
>
> There is a battle over minds going on. The one side fights for the
> establishment and extension of intellectual property rights. In the last
> decades, even years, they have achieved some considerable victories.
> Copyright law, as it was introduced in the United States, was meant for
> 14 years, and had to be explicitly stated. Today it holds not only for
> the lifetime of the creator, but also an additional 70 years (to
> incentivize the creator to produce more, because an author would be much
> less motivated to write if they knew that half a century after their
> death their highly beloved publisher wouldn’t make profit out of their
> work anymore). Today, copyright applies automatically, without any
> registration or statement. There is no need to put the little c in a
> circle anywhere. It is there, automatically, everywhere.
>
>
> The extension from works to content, from expression to ideas, is
> another dimension, this time in scope instead of time, in the continuous
> struggle to extend and expand intellectual property rights. It is not
> just a battle over the laws, but also, and more importantly, over our
> believes and minds, to make us more accepting towards the notion that
> ideas and knowledge belong to companies and individuals, and are not
> part of our commons.
>
>
> Every time data is published under a restrictive license, “they” have
> managed to conquer another strategic piece of territory. Restrictive in
> this case includes CC-BY, CC-BY-SA, CC-BY-NC, GFDL, ODBL, and (god
> forbid!) CC-BY-SA-NC-ND, and many other such licenses.
>
>
> Every time you wonder what license some data has that you want to use,
> or whether you need to ask the data publisher if you can use it, “they”
> have won another battle.
>
>
> Every time you integrate two data sources and want to publish the
> results, and start to wonder how to fulfill your legal obligation
> towards the original dataset publishers, “they” laugh and welcome you as
> a member of their fifth column.
>
>
> Let them win, and some day you will be sued for mentioning a number.
>
>
>
> Links:
>
> I am not linking to the obvious texts, which are the actual laws. Read
> them. They are not as impenetrable as you think. I mean, heck, if you
> can make sense of an RDF/XML file, you shouldn’t be scared of some legal
> text.
>
>
> Evaluation of the European Commission on the effect of database rights
>
> http://ec.europa.eu/internal_market/copyright/docs/databases/evaluation_report_en.pdf
>
>
> US Supreme Court, Baker v. Selden - on the extent of copyright with
> regards to the expression, not the content
>
> http://www.justia.us/us/101/99/case.html
>
>
>
> Sorry for the far too long reply. It is not meant as a critical reply to
> Pascal and his colleagues’ text, but rather something that has been
> brooding in me for a while. This text triggered me to write it down, and
> in the framework of their text I would read it as a contribution to
> point 5 of their way forward.
>
>
>
> This text was written by me on a Saturday morning, as a completely
> personal opinion. It does not represent the official point of view of
> any current, former, or future employer, nor of any project I ever was,
> am, or will be affiliated with or am thought to be affiliated with.
>
>
> *
>
> *


-- 
Dr. Markus Kroetzsch
Department of Computer Science, University of Oxford
Room 306, Parks Road, OX1 3QD Oxford, United Kingdom
+44 (0)1865 283529               http://korrekt.org/

Received on Saturday, 18 May 2013 19:48:51 UTC