AW: Initiating contact - OpenWebSearch.eu from Dinzinger, Michael on 2023-08-08 (public-tdmrep@w3.org from August 2023)

From: Dinzinger, Michael <michael.dinzinger@uni-passau.de>
Date: Tue, 8 Aug 2023 16:06:06 +0000
To: "public-tdmrep@w3.org" <public-tdmrep@w3.org>, Laurent Le Meur <laurent@edrlab.org>
CC: "Granitzer, Michael" <Michael.Granitzer@Uni-Passau.De>, "Zerhoudi, Saber" <Saber.Zerhoudi@uni-passau.de>
Message-ID: <4e57caf3474f4ea4b5df909f93220949@uni-passau.de>
Hello Laurent,


thank you for your reply!


> For you, does "Indexing" encompass "Text & Data Mining" or are there adjacent notions?


So the questions is whether "TDM" is necessary in the Indexing Pipeline. This pipeline starts with Crawling, i.e. downloading web content from the internet, and ends with actual Indexing, i.e. turning the cleaned web content into a usable index/inverted file. From my perspective, these two steps do not require any Text & Data Mining.

In between these two steps, the web content is cleaned, analyzed and enriched with metadata. These tasks consist of boilerplate removal, detection of malicious webpages, language detection, detection of geo-location, etc. All these tasks may contribute to a higher quality of the final Web Index, facilitate downstream Search Applications, facilitate further research, or are technically necessary for the creation of the Web Index.

I do not think that these preprocessing and enrichment steps lay within the scope of the originally intended meaning of TDM, given by the EU directive [1]. Therefore, I would see 'Indexing' and 'Text & Data Mining' as two distinct terms, however, I'm not 100% sure and I will follow up on this question with my colleagues.

What is your opinion on this question?


Ideas and initiatives like [2, 3, 4, 5] and also C2PA [6] show that there is a lack of a widely-adopted way to opt-out from AI/ML training, which should be different from opting out from appearing in a search engine. So without a doubt, 'Indexing' and 'AI Training' are two distinct crawling purposes and indicate different usage of the crawled web-content.

As I mentioned, I want to investigate more on the question whether you can also make such a clear distinction between the terms 'Indexing' and 'Text & Data Mining'. If yes, it would make things easier and result in a clearer guidance to webmasters and content owners:

- If you do not want your content crawled, indexed and appearing in a search engine, you may use the REP.

- If crawling and indexing is okay, but no AI/ML Training, you may use the TDM Rep (or some alternative). It will not undo the decision of the Robots.txt File.


> Therefore opting out from TDM implies opting out from AI training. Other initiatives (C2PA) pretend that this is not the case, and opting-out from AI must be done separately from opting out from TDM. Could your organization give some advice on that aspect?


At the moment, I cannot give any definite answer. As I said before, I only see that there is a wish for an opt-out-opportunity from AI/ML training, which is grounded on some legal bases (which is now seemingly given in Europe with the TDM-opt-out-mechanism).


> Does it pretend to crawl content stored in the EU under the US fair use policy? How would it be legally possible?


I might say something wrong here, so for more detailed infos you might want to take a look here [7] or another source. For the application of Fair Use, the location of utilizing/processing the corresponding piece of work is taken into consideration and not the location of creation nor the hosting-location. The latter one would be rather difficult in the times of global content delivery networks.


> So far we don't have the resources to create an "official" OSS codebase (in which language? JS? Typescript? Go?)


We use the StormCrawler as crawling software [8]. It is written in Java. Furthermore, there is a library called "crawler-commons" with codebases in Java, which are not crawler-specific [9]. So, there is an implementation of the REP and the Sitemaps Protocol. This could be the right Repo for further protocol implementations, which may find application in not one, but several Java crawler projects.

Best regards
Michael

[1] https://w3c.github.io/tdm-reservation-protocol/docs/tdm-meaning.html
[2] https://genlaw.github.io/CameraReady/42.pdf
[3] https://blog.google/technology/ai/ai-web-publisher-controls-sign-up/
[4] https://www.deviantart.com/team/journal/UPDATE-All-Deviations-Are-Opted-Out-of-AI-Datasets-934500371
[5] https://site.spawning.ai/spawning-ai-txt
[6] https://c2pa.org/specifications/specifications/1.3/specs/C2PA_Specification.html#_training_and_data_mining
[7] https://arxiv.org/abs/2303.15715
[8] http://stormcrawler.net/
[9] https://github.com/crawler-commons/crawler-commons


________________________________
Von: Laurent Le Meur <laurent@edrlab.org>
Gesendet: Donnerstag, 3. August 2023 17:45
An: Dinzinger, Michael
Cc: public-tdmrep@w3.org; Granitzer, Michael; Zerhoudi, Saber
Betreff: Re: Initiating contact - OpenWebSearch.eu

Thanks a lot for your message Michael. I didn't know about the ambitious OpenWebSearch project.

You can help us in our search for scoping the notion of Web Index.
For you, does "Indexing" encompass  "Text & Data Mining" or are there adjacent notions?
Our take is that there are adjacent, as indexing does not rely on the simultaneous processing of multiple resources (one can index a single resource).
This is important, as many publishers are concerned that a "notdm" signal would provoke a "noindex" decision from Google.

You may also be able to help us differentiate TDM and AI. We consider that the first stage of an AI training system is scraping, the second stage is applying TDM techniques, and then there is a third stage relative to pure AI (LLM ...). Therefore opting out from TDM implies opting out from AI training. Other initiatives (C2PA) pretend that this is not the case, and opting-out from AI must be done separately from opting out from TDM. Could your organization give some advice on that aspect?

Even though CommonCrawl is engineered from Germany, it is a US-based organization, which collects and distributes web data under the fair use policy.
Interesting. Does it pretend to crawl content stored in the EU under the US fair use policy? How would it be legally possible?
To my understanding, the efforts of CommonCrawl could not be done in Europe because there is no fair use policy, but other legal frameworks, namely the EU DSM Directive and in Germany the §44 UrhG
Agreed.

However, the REP without extensions does not provide a sufficient granularity, lacks the functionality to express licensing statements and is in general not yet ready for the emerging AI era
Agreed

1) Our crawler [8] should be able to parse the TDM Rep. I suppose there must be a existing reference implementation for a protocol parser on the web, however, I didn't find one.
There was a crawler POC made by a French company, but not open-source. Spawning AI has a partial implementation open-sourced in Python.
So far we don't have the resources to create an "official" OSS codebase (in which language? JS? Typescript? Go?)

2) For a wider adoption of the protocol, it would make sense to provide webmasters with the opportunity to generate the tdmrep.json on e.g. a web platform, such as an Open Webmaster Console, which is part of our efforts in the OpenWebSearch.eu<http://OpenWebSearch.eu> project.
Yep sure.

. If I understand right, the protocol was originally intended for the domain of EPUB [9].
Nope. The domain is ANY Web resource (html page, image, video ...). This is why the recommended solution is acting on HTTP Headers, the second being generating a tdmrep.json file).

Let's continue the conversation.

Best regards
Laurent

Le 28 juil. 2023 à 16:11, Dinzinger, Michael <michael.dinzinger@uni-passau.de<mailto:michael.dinzinger@uni-passau.de>> a écrit :

Dear members of the TDM Rep Community Group,

I hope this email finds you well. I contact you as we are interested in details concerning the TDM Reservation Protocol and we would like to join your efforts towards a simple and practical web protocol for expressing machine-readable information in the context of Text and Data Mining.

Personally, I am a PhD student at the University of Passau and employed by the OpenWebSearch.eu<http://OpenWebSearch.eu>project [1]. This is an European research project with 14 participating institutions, which has started in September 2022. The project members have set themselves the goal of building an Open Web Index and promoting the open access to information through web search, while complying to European laws, norms and ethics. In this way, we want to differentiate from the existing Web Indices, which are operated by the well-known non-European Big-Tech giants. The Open Web Index may tap the web as a resource for European researchers and companies and leverage web-data-driven innovations in Europe. At the University of Passau, we are responsible for crawling, the first step in the pipeline of building an Open Web Index, and I am particularly interested in the topic of legal compliance.

At the latest since the rise of AI applications, there is a high demand for web texts and data. Looking at the datasets, which are e.g. used to train generative language models [2, 3], we see that these were built upon CommonCrawl [4]. Even though CommonCrawl is engineered from Germany, it is a US-based organization, which collects and distributes web data under the fair use policy. To my understanding, the efforts of CommonCrawl could not be done in Europe because there is no fair use policy, but other legal frameworks, namely the EU DSM Directive and in Germany the §44 UrhG.

In our opinion, the SOTA large-scale crawling efforts, such as our current project efforts, CommonCrawl, etc., do not go far enough in considering licensing or usage information, which restricts the usage and distribution of the web content. The compliance with the latest European directive on copyright is so-to-say part of our project identity and we want to push forward on it. I share your opinion that there is a simple, technical solution necessary to fix this shortcoming for providing this kind of information in a machine-readable form. At the moment, the SOTA large-scale crawling efforts rely heavily on the robots.txt file as this seems to be the only commonly adopted solution for implementing the machine-readable opt-out mechanism. However, the REP without extensions does not provide a sufficient granularity, lacks the functionality to express licensing statements and is in general not yet ready for the emerging AI era. In this context, I saw that Google recently kicked off a public discussion to explore machine-readable means for web publisher choice and control for emerging AI and research use cases [5]. Besides that, there are more examples of simple solutions, which allow content owners to opt out from the usage of their content in AI applications [6, 7].

In the OpenWebSearch.eu project, we would like to explore and maybe adopt the TDM Rep, in case it fits our requirements. As I said, for me it looks very promising, not least because of its simplicity. On our end, the adoption of the protocol would make sense in two aspects:
1) Our crawler [8] should be able to parse the TDM Rep. I suppose there must be a existing reference implementation for a protocol parser on the web, however, I didn't find one.
2) For a wider adoption of the protocol, it would make sense to provide webmasters with the opportunity to generate the tdmrep.json on e.g. a web platform, such as an Open Webmaster Console, which is part of our efforts in the OpenWebSearch.eu project.

To put in a nutshell, we are looking forward to get a fruitful discussion going. To begin with, we are interested in the use cases of TDM Rep. If I understand right, the protocol was originally intended for the domain of EPUB [9]. So the question would be whether the TDM Rep can also be applied for a broader use case scenario in the context of crawling and indexing?
In case you have remarks or questions, do not hesitate to contact us.

[1] https://openwebsearch.eu/
[2] https://arxiv.org/abs/1911.00359
[3] https://oscar-project.org/publication/2019/clmc7/asynchronous/
[4] https://commoncrawl.org/
[5] https://blog.google/technology/ai/ai-web-publisher-controls-sign-up/
[6] https://site.spawning.ai/spawning-ai-txt
[7] https://www.deviantart.com/team/journal/UPDATE-All-Deviations-Are-Opted-Out-of-AI-Datasets-934500371
[8] https://openwebsearch.eu/owler/
[9] https://github.com/w3c/tdm-reservation-protocol/blob/main/docs/use-cases.md

Best regards
Michael Dinzinger

Research associate
at the Chair of Data Science

University of Passau
(ITZ) Room 234
Innstr. 43
94032 Passau

michael.dinzinger@uni-passau.de
Received on Tuesday, 8 August 2023 16:06:16 UTC