- From: Laurent Le Meur <laurent@edrlab.org>
- Date: Thu, 3 Aug 2023 17:45:41 +0200
- To: "Dinzinger, Michael" <michael.dinzinger@uni-passau.de>
- Cc: "public-tdmrep@w3.org" <public-tdmrep@w3.org>, "Granitzer, Michael" <Michael.Granitzer@Uni-Passau.De>, "Zerhoudi, Saber" <Saber.Zerhoudi@uni-passau.de>
- Message-Id: <8D363085-1193-4594-9CAC-66E34A267B79@edrlab.org>
Thanks a lot for your message Michael. I didn't know about the ambitious OpenWebSearch project. You can help us in our search for scoping the notion of Web Index. For you, does "Indexing" encompass "Text & Data Mining" or are there adjacent notions? Our take is that there are adjacent, as indexing does not rely on the simultaneous processing of multiple resources (one can index a single resource). This is important, as many publishers are concerned that a "notdm" signal would provoke a "noindex" decision from Google. You may also be able to help us differentiate TDM and AI. We consider that the first stage of an AI training system is scraping, the second stage is applying TDM techniques, and then there is a third stage relative to pure AI (LLM ...). Therefore opting out from TDM implies opting out from AI training. Other initiatives (C2PA) pretend that this is not the case, and opting-out from AI must be done separately from opting out from TDM. Could your organization give some advice on that aspect? > Even though CommonCrawl is engineered from Germany, it is a US-based organization, which collects and distributes web data under the fair use policy. Interesting. Does it pretend to crawl content stored in the EU under the US fair use policy? How would it be legally possible? > To my understanding, the efforts of CommonCrawl could not be done in Europe because there is no fair use policy, but other legal frameworks, namely the EU DSM Directive and in Germany the §44 UrhG Agreed. > However, the REP without extensions does not provide a sufficient granularity, lacks the functionality to express licensing statements and is in general not yet ready for the emerging AI era Agreed > 1) Our crawler [8] should be able to parse the TDM Rep. I suppose there must be a existing reference implementation for a protocol parser on the web, however, I didn't find one. There was a crawler POC made by a French company, but not open-source. Spawning AI has a partial implementation open-sourced in Python. So far we don't have the resources to create an "official" OSS codebase (in which language? JS? Typescript? Go?) > 2) For a wider adoption of the protocol, it would make sense to provide webmasters with the opportunity to generate the tdmrep.json on e.g. a web platform, such as an Open Webmaster Console, which is part of our efforts in the OpenWebSearch.eu project. Yep sure. > . If I understand right, the protocol was originally intended for the domain of EPUB [9]. Nope. The domain is ANY Web resource (html page, image, video ...). This is why the recommended solution is acting on HTTP Headers, the second being generating a tdmrep.json file). Let's continue the conversation. Best regards Laurent > Le 28 juil. 2023 à 16:11, Dinzinger, Michael <michael.dinzinger@uni-passau.de> a écrit : > > Dear members of the TDM Rep Community Group, > > I hope this email finds you well. I contact you as we are interested in details concerning the TDM Reservation Protocol and we would like to join your efforts towards a simple and practical web protocol for expressing machine-readable information in the context of Text and Data Mining. > > Personally, I am a PhD student at the University of Passau and employed by the OpenWebSearch.euproject [1]. This is an European research project with 14 participating institutions, which has started in September 2022. The project members have set themselves the goal of building an Open Web Index and promoting the open access to information through web search, while complying to European laws, norms and ethics. In this way, we want to differentiate from the existing Web Indices, which are operated by the well-known non-European Big-Tech giants. The Open Web Index may tap the web as a resource for European researchers and companies and leverage web-data-driven innovations in Europe. At the University of Passau, we are responsible for crawling, the first step in the pipeline of building an Open Web Index, and I am particularly interested in the topic of legal compliance. > > At the latest since the rise of AI applications, there is a high demand for web texts and data. Looking at the datasets, which are e.g. used to train generative language models [2, 3], we see that these were built upon CommonCrawl [4]. Even though CommonCrawl is engineered from Germany, it is a US-based organization, which collects and distributes web data under the fair use policy. To my understanding, the efforts of CommonCrawl could not be done in Europe because there is no fair use policy, but other legal frameworks, namely the EU DSM Directive and in Germany the §44 UrhG. > > In our opinion, the SOTA large-scale crawling efforts, such as our current project efforts, CommonCrawl, etc., do not go far enough in considering licensing or usage information, which restricts the usage and distribution of the web content. The compliance with the latest European directive on copyright is so-to-say part of our project identity and we want to push forward on it. I share your opinion that there is a simple, technical solution necessary to fix this shortcoming for providing this kind of information in a machine-readable form. At the moment, the SOTA large-scale crawling efforts rely heavily on the robots.txt file as this seems to be the only commonly adopted solution for implementing the machine-readable opt-out mechanism. However, the REP without extensions does not provide a sufficient granularity, lacks the functionality to express licensing statements and is in general not yet ready for the emerging AI era. In this context, I saw that Google recently kicked off a public discussion to explore machine-readable means for web publisher choice and control for emerging AI and research use cases [5]. Besides that, there are more examples of simple solutions, which allow content owners to opt out from the usage of their content in AI applications [6, 7]. > > In the OpenWebSearch.eu project, we would like to explore and maybe adopt the TDM Rep, in case it fits our requirements. As I said, for me it looks very promising, not least because of its simplicity. On our end, the adoption of the protocol would make sense in two aspects: > 1) Our crawler [8] should be able to parse the TDM Rep. I suppose there must be a existing reference implementation for a protocol parser on the web, however, I didn't find one. > 2) For a wider adoption of the protocol, it would make sense to provide webmasters with the opportunity to generate the tdmrep.json on e.g. a web platform, such as an Open Webmaster Console, which is part of our efforts in the OpenWebSearch.eu project. > > To put in a nutshell, we are looking forward to get a fruitful discussion going. To begin with, we are interested in the use cases of TDM Rep. If I understand right, the protocol was originally intended for the domain of EPUB [9]. So the question would be whether the TDM Rep can also be applied for a broader use case scenario in the context of crawling and indexing? > In case you have remarks or questions, do not hesitate to contact us. > > [1] https://openwebsearch.eu/ > [2] https://arxiv.org/abs/1911.00359 > [3] https://oscar-project.org/publication/2019/clmc7/asynchronous/ > [4] https://commoncrawl.org/ > [5] https://blog.google/technology/ai/ai-web-publisher-controls-sign-up/ > [6] https://site.spawning.ai/spawning-ai-txt > [7] https://www.deviantart.com/team/journal/UPDATE-All-Deviations-Are-Opted-Out-of-AI-Datasets-934500371 > [8] https://openwebsearch.eu/owler/ > [9] https://github.com/w3c/tdm-reservation-protocol/blob/main/docs/use-cases.md > > Best regards > Michael Dinzinger > > Research associate > at the Chair of Data Science > > University of Passau > (ITZ) Room 234 > Innstr. 43 > 94032 Passau > > michael.dinzinger@uni-passau.de
Received on Thursday, 3 August 2023 15:45:49 UTC