Re: Initiating contact - OpenWebSearch.eu from Laurent Le Meur on 2023-09-19 (public-tdmrep@w3.org from September 2023)

From: Laurent Le Meur <laurent@edrlab.org>
Date: Tue, 19 Sep 2023 19:37:25 +0200
To: "Dinzinger, Michael" <michael.dinzinger@uni-passau.de>
Cc: "public-tdmrep@w3.org" <public-tdmrep@w3.org>, "Granitzer, Michael" <Michael.Granitzer@Uni-Passau.De>, "Zerhoudi, Saber" <Saber.Zerhoudi@uni-passau.de>
Message-Id: <23AA14FD-8D58-43A2-8816-3C187F09E175@edrlab.org>
Hi Michael, 

I think that your synthesis of the situation is very good, except for the "without fearing any legal consequences" part, especially in Europe. 
There are and there will be legal actions by publishers, even if proving that a specific bot has scraped opted-out content is difficult (but most bots leave traces, even if not via user agents). 

About TDM-Policy: you say: "the way, in which website/content owners are compensated: not via website traffic, but via attribution and/or financial compensation according to a license."
This is what TDM-Policy is made for := helping TDM/AI actors have contacts with content providers and strike deals. 
A policy is not a license. A policy is "what can be agreed"; it is a generic document. 
A license is "what has been agreed"; it is a doc tailored for one contact 
We didn't want to go as far as creating a way to automate licensing; at least today. 
And a ODRL TDM-Policy can be simply copied from a sample, or generated via a script from any form.
ODRL is a W3C recommandation. We are working in the W3C "world". It makes sense to use ODRL then.  
And yes we can create standard policies with well-known url: but this must come later, when the main part, the TDM-reservation flag, has become a standard way to express opt-out.   

Best regards
L


> Le 19 sept. 2023 à 13:13, Dinzinger, Michael <michael.dinzinger@uni-passau.de> a écrit :
> 
> Dear all,
> 
> thanks again for your messages. I'd also agree that it makes sense to keep a clear distinction between Web Search-related use of crawled web data (for building a Web Index and feeding a Search Engine) and Text & Data Mining-related use.
> 
> Among others, I read the blog post [1] on this topic. The author describes a common agreement between websites and Search Engines. This 'business deal' encompasses that the websites are indirectly compensated via the user traffic that the Search Engine causes. So this made me think that a main point for distinguishing between TDM-related crawling and Web Search-related crawling is the way, in which website/content owners are compensated: not via website traffic, but via attribution and/or financial compensation according to a license. Would you agree to this?
> So what happens right now is that crawling for the purpose of TDM becomes more and more important, among others because of the massive hunger of training data which was set off by the boom of Generative AI applications. This boom unfortunately coincides with the lack of widely adopted machine-readable web standards which allow the rightsholders to declare usage controls (meaning obligations or conditions, like attribution or financial compensation) over their data. And what makes matters even worse is that it is technically difficult to retrace whether a piece of information is used in an ML training process or not, tempting ML practitioners to simply assume the consent of the rightsholders without fearing any legal consequences.
> 
> Please tell me if you disagree with my thoughts or whether I have misunderstood something. Apart from this, I also have a conceptual question regarding the specification of TDM Rep.
> What is the motivation behind using the property 'tdm-policy' instead of for example 'tdm-license'? As far as I see, the policy is always a very individual document, which is tailored for one particular piece of web content, whereas a license can be a very common, standardized document (like the CC licenses). Furthermore, the TDM policy, which is strictly formatted in the ODRL profile, seems very expressive and enables the automatic acquisition of licenses from the rightsholders. But it also adds a lot of complexity because for every web asset which is published under a specific license (except from Public Domain and All rights reserved), there has to be a corresponding ODRL-formatted JSON-LD object on some web server. For some cases, wouldn't it be easier to simply declare a standardized license in "Creative Commons"-style?
> For example:<meta name="tdm-reservation" content="1">
> <meta name="tdm-license" content="https://creativecommons.org/licenses/by-nc/4.0/">
> Is there any particular reason, why you decided for TDM Policies instead of licenses? Couldn't both be possible?
> 
> Thank you in advance for your time and kind regards
> Michael
> 
> [1] https://searchengineland.com/crawlers-search-engines-generative-ai-companies-429389
> 
> Von: Leonard Rosenthol <lrosenth@adobe.com>
> Gesendet: Dienstag, 8. August 2023 18:24
> An: Dinzinger, Michael; public-tdmrep@w3.org; Laurent Le Meur
> Cc: Granitzer, Michael; Zerhoudi, Saber
> Betreff: Re: Initiating contact - OpenWebSearch.eu
>  Legal issues aside – mostly because they are complex and vary by jurisdiction – I think there is another aspect to this entire conversation.
>  Historically, the site-level control provided with something like Robots.txt (REP) was acceptable for both site owners as well as users, who would put their content on sites that they didn’t necessary control (incl., but not limited to, social media platforms).
>  However, with what has happened in the context of AI/ML, there is now a desire for authors of content/assets to be able to directly control the re-usability of their content/assets…Currently that focus is on TDM as well as AI/ML training, and so the need for controls at that level is clear.
>  So far, what I am not seen/heard, is anyone saying that they wish to control indexing/searching in the same way.  And as long as that search index remains strictly for human-initiated search interaction – that’s probably fine.  But if a machine could potentially utilize that index (and, even more so, the content that is in the search index) – then I think we start down the slippery slope where indexing == TDM.
>  Leonard
>  From: Dinzinger, Michael <michael.dinzinger@uni-passau.de>
> Date: Tuesday, August 8, 2023 at 12:07 PM
> To: public-tdmrep@w3.org <public-tdmrep@w3.org>, Laurent Le Meur <laurent@edrlab.org>
> Cc: Granitzer, Michael <Michael.Granitzer@Uni-Passau.De>, Zerhoudi, Saber <Saber.Zerhoudi@uni-passau.de>
> Subject: AW: Initiating contact - OpenWebSearch.eu
> EXTERNAL: Use caution when clicking on links or opening attachments.
>  Hello Laurent,
>  thank you for your reply!
>  > For you, does "Indexing" encompass "Text & Data Mining" or are there adjacent notions?
>  So the questions is whether "TDM" is necessary in the Indexing Pipeline. This pipeline starts with Crawling, i.e. downloading web content from the internet, and ends with actual Indexing, i.e. turning the cleaned web content into a usable index/inverted file. From my perspective, these two steps do not require any Text & Data Mining.
> In between these two steps, the web content is cleaned, analyzed and enriched with metadata. These tasks consist of boilerplate removal, detection of malicious webpages, language detection, detection of geo-location, etc. All these tasks may contribute to a higher quality of the final Web Index, facilitate downstream Search Applications, facilitate further research, or are technically necessary for the creation of the Web Index.
> I do not think that these preprocessing and enrichment steps lay within the scope of the originally intended meaning of TDM, given by the EU directive [1]. Therefore, I would see 'Indexing' and 'Text & Data Mining' as two distinct terms, however, I'm not 100% sure and I will follow up on this question with my colleagues.
> What is your opinion on this question?
>  Ideas and initiatives like [2, 3, 4, 5] and also C2PA [6] show that there is a lack of a widely-adopted way to opt-out from AI/ML training, which should be different from opting out from appearing in a search engine. So without a doubt, 'Indexing' and 'AI Training' are two distinct crawling purposes and indicate different usage of the crawled web-content.
> As I mentioned, I want to investigate more on the question whether you can also make such a clear distinction between the terms 'Indexing' and 'Text & Data Mining'. If yes, it would make things easier and result in a clearer guidance to webmasters and content owners:
> - If you do not want your content crawled, indexed and appearing in a search engine, you may use the REP.
> - If crawling and indexing is okay, but no AI/ML Training, you may use the TDM Rep (or some alternative). It will not undo the decision of the Robots.txt File.   
>  > Therefore opting out from TDM implies opting out from AI training. Other initiatives (C2PA) pretend that this is not the case, and opting-out from AI must be done separately from opting out from TDM. Could your organization give some advice on that aspect?
>  At the moment, I cannot give any definite answer. As I said before, I only see that there is a wish for an opt-out-opportunity from AI/ML training, which is grounded on some legal bases (which is now seemingly given in Europe with the TDM-opt-out-mechanism).
>  > Does it pretend to crawl content stored in the EU under the US fair use policy? How would it be legally possible?
>  I might say something wrong here, so for more detailed infos you might want to take a look here [7] or another source. For the application of Fair Use, the location of utilizing/processing the corresponding piece of work is taken into consideration and not the location of creation nor the hosting-location. The latter one would be rather difficult in the times of global content delivery networks.
>  > So far we don't have the resources to create an "official" OSS codebase (in which language? JS? Typescript? Go?)
>  We use the StormCrawler as crawling software [8]. It is written in Java. Furthermore, there is a library called "crawler-commons" with codebases in Java, which are not crawler-specific [9]. So, there is an implementation of the REP and the Sitemaps Protocol. This could be the right Repo for further protocol implementations, which may find application in not one, but several Java crawler projects.
>  Best regards
> Michael
>  [1] https://w3c.github.io/tdm-reservation-protocol/docs/tdm-meaning.html
> [2] https://genlaw.github.io/CameraReady/42.pdf
> [3] https://blog.google/technology/ai/ai-web-publisher-controls-sign-up/
> [4] https://www.deviantart.com/team/journal/UPDATE-All-Deviations-Are-Opted-Out-of-AI-Datasets-934500371
> [5] https://site.spawning.ai/spawning-ai-txt
> [6]https://c2pa.org/specifications/specifications/1.3/specs/C2PA_Specification.html#_training_and_data_mining
> [7] https://arxiv.org/abs/2303.15715
> [8] http://stormcrawler.net/
> [9] https://github.com/crawler-commons/crawler-commons
>  Von: Laurent Le Meur <laurent@edrlab.org>
> Gesendet: Donnerstag, 3. August 2023 17:45
> An: Dinzinger, Michael
> Cc: public-tdmrep@w3.org; Granitzer, Michael; Zerhoudi, Saber
> Betreff: Re: Initiating contact - OpenWebSearch.eu
>  Thanks a lot for your message Michael. I didn't know about the ambitious OpenWebSearch project. 
>  You can help us in our search for scoping the notion of Web Index. 
> For you, does "Indexing" encompass  "Text & Data Mining" or are there adjacent notions? 
> Our take is that there are adjacent, as indexing does not rely on the simultaneous processing of multiple resources (one can index a single resource). 
> This is important, as many publishers are concerned that a "notdm" signal would provoke a "noindex" decision from Google. 
>  You may also be able to help us differentiate TDM and AI. We consider that the first stage of an AI training system is scraping, the second stage is applying TDM techniques, and then there is a third stage relative to pure AI (LLM ...). Therefore opting out from TDM implies opting out from AI training. Other initiatives (C2PA) pretend that this is not the case, and opting-out from AI must be done separately from opting out from TDM. Could your organization give some advice on that aspect?
>  Even though CommonCrawl is engineered from Germany, it is a US-based organization, which collects and distributes web data under the fair use policy. 
> Interesting. Does it pretend to crawl content stored in the EU under the US fair use policy? How would it be legally possible?
> To my understanding, the efforts of CommonCrawl could not be done in Europe because there is no fair use policy, but other legal frameworks, namely the EU DSM Directive and in Germany the §44 UrhG
> Agreed.
>  However, the REP without extensions does not provide a sufficient granularity, lacks the functionality to express licensing statements and is in general not yet ready for the emerging AI era
> Agreed
>  1) Our crawler [8] should be able to parse the TDM Rep. I suppose there must be a existing reference implementation for a protocol parser on the web, however, I didn't find one.
> There was a crawler POC made by a French company, but not open-source. Spawning AI has a partial implementation open-sourced in Python. 
> So far we don't have the resources to create an "official" OSS codebase (in which language? JS? Typescript? Go?)
>  2) For a wider adoption of the protocol, it would make sense to provide webmasters with the opportunity to generate the tdmrep.json on e.g. a web platform, such as an Open Webmaster Console, which is part of our efforts in the OpenWebSearch.eu project.
> Yep sure. 
>  . If I understand right, the protocol was originally intended for the domain of EPUB [9].
> Nope. The domain is ANY Web resource (html page, image, video ...). This is why the recommended solution is acting on HTTP Headers, the second being generating a tdmrep.json file). 
>  Let's continue the conversation. 
>  Best regards
> Laurent
> 
> 
> Le 28 juil. 2023 à 16:11, Dinzinger, Michael <michael.dinzinger@uni-passau.de> a écrit :
> 
> Dear members of the TDM Rep Community Group,
> 
> I hope this email finds you well. I contact you as we are interested in details concerning the TDM Reservation Protocol and we would like to join your efforts towards a simple and practical web protocol for expressing machine-readable information in the context of Text and Data Mining.
> 
> Personally, I am a PhD student at the University of Passau and employed by the OpenWebSearch.euproject [1]. This is an European research project with 14 participating institutions, which has started in September 2022. The project members have set themselves the goal of building an Open Web Index and promoting the open access to information through web search, while complying to European laws, norms and ethics. In this way, we want to differentiate from the existing Web Indices, which are operated by the well-known non-European Big-Tech giants. The Open Web Index may tap the web as a resource for European researchers and companies and leverage web-data-driven innovations in Europe. At the University of Passau, we are responsible for crawling, the first step in the pipeline of building an Open Web Index, and I am particularly interested in the topic of legal compliance.
> 
> At the latest since the rise of AI applications, there is a high demand for web texts and data. Looking at the datasets, which are e.g. used to train generative language models [2, 3], we see that these were built upon CommonCrawl [4]. Even though CommonCrawl is engineered from Germany, it is a US-based organization, which collects and distributes web data under the fair use policy. To my understanding, the efforts of CommonCrawl could not be done in Europe because there is no fair use policy, but other legal frameworks, namely the EU DSM Directive and in Germany the §44 UrhG.
> 
> In our opinion, the SOTA large-scale crawling efforts, such as our current project efforts, CommonCrawl, etc., do not go far enough in considering licensing or usage information, which restricts the usage and distribution of the web content. The compliance with the latest European directive on copyright is so-to-say part of our project identity and we want to push forward on it. I share your opinion that there is a simple, technical solution necessary to fix this shortcoming for providing this kind of information in a machine-readable form. At the moment, the SOTA large-scale crawling efforts rely heavily on the robots.txt file as this seems to be the only commonly adopted solution for implementing the machine-readable opt-out mechanism. However, the REP without extensions does not provide a sufficient granularity, lacks the functionality to express licensing statements and is in general not yet ready for the emerging AI era. In this context, I saw that Google recently kicked off a public discussion to explore machine-readable means for web publisher choice and control for emerging AI and research use cases [5]. Besides that, there are more examples of simple solutions, which allow content owners to opt out from the usage of their content in AI applications [6, 7].
> 
> In the OpenWebSearch.eu project, we would like to explore and maybe adopt the TDM Rep, in case it fits our requirements. As I said, for me it looks very promising, not least because of its simplicity. On our end, the adoption of the protocol would make sense in two aspects:
> 1) Our crawler [8] should be able to parse the TDM Rep. I suppose there must be a existing reference implementation for a protocol parser on the web, however, I didn't find one.
> 2) For a wider adoption of the protocol, it would make sense to provide webmasters with the opportunity to generate the tdmrep.json on e.g. a web platform, such as an Open Webmaster Console, which is part of our efforts in the OpenWebSearch.eu project.
> 
> To put in a nutshell, we are looking forward to get a fruitful discussion going. To begin with, we are interested in the use cases of TDM Rep. If I understand right, the protocol was originally intended for the domain of EPUB [9]. So the question would be whether the TDM Rep can also be applied for a broader use case scenario in the context of crawling and indexing?
> In case you have remarks or questions, do not hesitate to contact us.
> 
> [1] https://openwebsearch.eu/
> [2] https://arxiv.org/abs/1911.00359
> [3] https://oscar-project.org/publication/2019/clmc7/asynchronous/
> [4] https://commoncrawl.org/
> [5] https://blog.google/technology/ai/ai-web-publisher-controls-sign-up/
> [6] https://site.spawning.ai/spawning-ai-txt
> [7] https://www.deviantart.com/team/journal/UPDATE-All-Deviations-Are-Opted-Out-of-AI-Datasets-934500371
> [8] https://openwebsearch.eu/owler/
> [9] https://github.com/w3c/tdm-reservation-protocol/blob/main/docs/use-cases.md
> 
> Best regards
> Michael Dinzinger
> 
> Research associate
> at the Chair of Data Science
> 
> University of Passau
> (ITZ) Room 234
> Innstr. 43
> 94032 Passau
> 
> michael.dinzinger@uni-passau.de
Received on Tuesday, 19 September 2023 17:37:45 UTC