Initiating contact - OpenWebSearch.eu from Dinzinger, Michael on 2023-07-28 (public-tdmrep@w3.org from July 2023)

From: Dinzinger, Michael <michael.dinzinger@uni-passau.de>
Date: Fri, 28 Jul 2023 14:11:41 +0000
To: "public-tdmrep@w3.org" <public-tdmrep@w3.org>
CC: "Granitzer, Michael" <Michael.Granitzer@Uni-Passau.De>, "Zerhoudi, Saber" <Saber.Zerhoudi@uni-passau.de>
Message-ID: <357cd2552f32444295895ee9571d14c2@uni-passau.de>

Dear members of the TDM Rep Community Group,

I hope this email finds you well. I contact you as we are interested in details concerning the TDM Reservation Protocol and we would like to join your efforts towards a simple and practical web protocol for expressing machine-readable information in the context of Text and Data Mining.

Personally, I am a PhD student at the University of Passau and employed by the OpenWebSearch.eu project [1]. This is an European research project with 14 participating institutions, which has started in September 2022. The project members have set themselves the goal of building an Open Web Index and promoting the open access to information through web search, while complying to European laws, norms and ethics. In this way, we want to differentiate from the existing Web Indices, which are operated by the well-known non-European Big-Tech giants. The Open Web Index may tap the web as a resource for European researchers and companies and leverage web-data-driven innovations in Europe. At the University of Passau, we are responsible for crawling, the first step in the pipeline of building an Open Web Index, and I am particularly interested in the topic of legal compliance.

At the latest since the rise of AI applications, there is a high demand for web texts and data. Looking at the datasets, which are e.g. used to train generative language models [2, 3], we see that these were built upon CommonCrawl [4]. Even though CommonCrawl is engineered from Germany, it is a US-based organization, which collects and distributes web data under the fair use policy. To my understanding, the efforts of CommonCrawl could not be done in Europe because there is no fair use policy, but other legal frameworks, namely the EU DSM Directive and in Germany the §44 UrhG.

In our opinion, the SOTA large-scale crawling efforts, such as our current project efforts, CommonCrawl, etc., do not go far enough in considering licensing or usage information, which restricts the usage and distribution of the web content. The compliance with the latest European directive on copyright is so-to-say part of our project identity and we want to push forward on it. I share your opinion that there is a simple, technical solution necessary to fix this shortcoming for providing this kind of information in a machine-readable form. At the moment, the SOTA large-scale crawling efforts rely heavily on the robots.txt file as this seems to be the only commonly adopted solution for implementing the machine-readable opt-out mechanism. However, the REP without extensions does not provide a sufficient granularity, lacks the functionality to express licensing statements and is in general not yet ready for the emerging AI era. In this context, I saw that Google recently kicked off a public discussion to explore machine-readable means for web publisher choice and control for emerging AI and research use cases [5]. Besides that, there are more examples of simple solutions, which allow content owners to opt out from the usage of their content in AI applications [6, 7].

In the OpenWebSearch.eu project, we would like to explore and maybe adopt the TDM Rep, in case it fits our requirements. As I said, for me it looks very promising, not least because of its simplicity. On our end, the adoption of the protocol would make sense in two aspects:
1) Our crawler [8] should be able to parse the TDM Rep. I suppose there must be a existing reference implementation for a protocol parser on the web, however, I didn't find one.
2) For a wider adoption of the protocol, it would make sense to provide webmasters with the opportunity to generate the tdmrep.json on e.g. a web platform, such as an Open Webmaster Console, which is part of our efforts in the OpenWebSearch.eu project.

To put in a nutshell, we are looking forward to get a fruitful discussion going. To begin with, we are interested in the use cases of TDM Rep. If I understand right, the protocol was originally intended for the domain of EPUB [9]. So the question would be whether the TDM Rep can also be applied for a broader use case scenario in the context of crawling and indexing?
In case you have remarks or questions, do not hesitate to contact us.

[1] https://openwebsearch.eu/
[2] https://arxiv.org/abs/1911.00359
[3] https://oscar-project.org/publication/2019/clmc7/asynchronous/
[4] https://commoncrawl.org/
[5] https://blog.google/technology/ai/ai-web-publisher-controls-sign-up/
[6] https://site.spawning.ai/spawning-ai-txt
[7] https://www.deviantart.com/team/journal/UPDATE-All-Deviations-Are-Opted-Out-of-AI-Datasets-934500371
[8] https://openwebsearch.eu/owler/
[9] https://github.com/w3c/tdm-reservation-protocol/blob/main/docs/use-cases.md

Best regards

Michael Dinzinger

Research associate
at the Chair of Data Science

University of Passau
(ITZ) Room 234
Innstr. 43
94032 Passau

michael.dinzinger@uni-passau.de<mailto:Vorname.nachname@uni-passau.de>

Received on Friday, 28 July 2023 14:55:50 UTC