Re: Signaling opt-out from TDM / AI scrapping in EPUB files from Eric Hellman on 2023-08-03 (public-publishingbg@w3.org from August 2023)

From: Eric Hellman <eric@hellman.net>
Date: Thu, 3 Aug 2023 11:45:31 -0400
To: Leonard Rosenthol <lrosenth@adobe.com>
Cc: Laurent Le Meur <laurent@edrlab.org>, "Johnson, Rick" <Rick.Johnson@vitalsource.com>, "public-publishingbg@w3.org" <public-publishingbg@w3.org>, "public-publishingcg@w3.org" <public-publishingcg@w3.org>, Giulia Marangoni <giulia.marangoni@aie.it>
Message-Id: <123C4143-0E64-42C7-AF4C-D13DAE052333@hellman.net>
I think a better comparison is "do not track" which was soooo successful. Don't forget that we already have lots of "AI" agents on our PCs and in our corporate networks (looking for malicious code, providing autocomplete, smart search, etc. and these agents won't care a damn thing about your protocols. Robots.txt works because it helps spidering bots do their jobs, not because of any "honor" among robots.

Eric


> On Aug 3, 2023, at 10:37 AM, Leonard Rosenthol <lrosenth@adobe.com> wrote:
> 
> Rick – the current situation is really no different than how Robots.txt has served the web for decades.  It is entirely an honor system solution which has served us well – until now…
>  
> As Laurent says, the work towards establishment of legislation in various countries – US, UK, EU, China, etc. – will move to make these solutions legally binding…though it is still early days to know what the laws will be, how they will vary around the world and what the actual “penalty process” would even look like.
>  
> But as an industry, we need to start somewhere…and that is where solutions like TDM and C2PA (which complement each other well, as they address different “levels” of control) come in.
>  
> Leonard
>  
> From: Laurent Le Meur <laurent@edrlab.org <mailto:laurent@edrlab.org>>
> Date: Thursday, August 3, 2023 at 10:26 AM
> To: Johnson, Rick <Rick.Johnson@vitalsource.com <mailto:Rick.Johnson@vitalsource.com>>
> Cc: public-publishingbg@w3.org <mailto:public-publishingbg@w3.org> <public-publishingbg@w3.org <mailto:public-publishingbg@w3.org>>, public-publishingcg@w3.org <mailto:public-publishingcg@w3.org> <public-publishingcg@w3.org <mailto:public-publishingcg@w3.org>>, Giulia Marangoni <giulia.marangoni@aie.it <mailto:giulia.marangoni@aie.it>>
> Subject: Re: Signaling opt-out from TDM / AI scrapping in EPUB files
> 
> EXTERNAL: Use caution when clicking on links or opening attachments.
> 
>  
> 
> Hi Rick, 
>  
> Most publishers are worried about the available rules (fair use in the US, CDSM in the EU). The opt-out mechanism currently offered by the CDSM can only be considered as a baseline, it would be much more effective if there was an obligation by TDM / AI actors to also publish the full list of resources they have used for training their systems. Which is a challenge by itself (e.g. I imagine that the featured snippets functionality of Google search, which is without doubt based on some sort of Text Mining, consumes the full set of pages they index). 
>  
> But I do not see how they can "abandon" looking for protections, at least in Europe, because in this case, the law will let their content open to free scraping. 
> Middleware solutions (based on reverse proxy) are complementary, as they can intercept rogue bots, but they are costly and complex to generalize. 
>  
> Laurent
>  
> 
> 
> Le 3 août 2023 à 16:13, Johnson, Rick <Rick.Johnson@vitalsource.com <mailto:Rick.Johnson@vitalsource.com>> a écrit :
>  
> I would be curious if people believe that publishers will be satisfied with a passive ‘honor’ system where actors on the web (good and bad) are expected to read and follow permissions stated in metadata?
>  
> If (as I suspect) they are not, would the expectation be (for them, or for a standard) to escalate to other protections, or abandon this as ineffective?
>  
>  
> Rick Johnson | Co-Founder and Vice President of Product Strategy and Accessibility
> VitalSource Technologies, LLC
> get.vitalsource.com  <https://get.vitalsource.com/>
>  
>  
>  
> From: Laurent Le Meur <laurent@edrlab.org <mailto:laurent@edrlab.org>>
> Date: Thursday, August 3, 2023 at 4:47 AM
> To: public-publishingbg@w3.org <mailto:public-publishingbg@w3.org> <public-publishingbg@w3.org <mailto:public-publishingbg@w3.org>>, public-publishingcg@w3.org <mailto:public-publishingcg@w3.org> <public-publishingcg@w3.org <mailto:public-publishingcg@w3.org>>
> Cc: Giulia Marangoni <giulia.marangoni@aie.it <mailto:giulia.marangoni@aie.it>>
> Subject: Signaling opt-out from TDM / AI scrapping in EPUB files
> 
> Dear all, 
>  
> There is now pressure from publishers to protect "Web" content from scrapping by TDM (Text and Data Mining) and AI (Artificial Intelligence) actors. 
>  
> The W3C TDM Reservation Protocol (TDMRep) has been created for enabling publishers' opt-out from TDM scrapping. TDMRep acts at the level of HTTP headers, and can therefore signal a reservation of rights on any Web resource. But many publishers would like to also signal a TDM opt-out inside files, especially inside EPUB files so that publications can be protected even if the website from which they are downloaded does not contain any opt-out signal. 
>  
> At the request of the TDM Rep CG, I'm therefore reaching you to discuss the best way to address this need. 
> The upcoming TPAC seems to be the perfect time to discuss the matter. 
> Could we program some time during a session to address this request?
>  
> Note: the TDMRep defines two metadata properties, one named "tdm-reservation", a boolean value that indicates if TDM rights are reserved for this resource, and another named "tdm-policy", an optional link to details on how to get a license for using the resource for TDM or AI. 
> I can prepare a first proposal relative to the inclusion of these properties inside an EPUB package. 
>  
> You'll find more information about TDMRep on 
> - https://w3c.github.io/tdm-reservation-protocol/ <https://w3c.github.io/tdm-reservation-protocol/> = introduction to the spec, guidelines, notes ... 
> - https://www.w3.org/2022/tdmrep/ <https://www.w3.org/2022/tdmrep/> = the specification
> - https://www.w3.org/community/tdmrep/ <https://www.w3.org/community/tdmrep/> = the CG page, with meeting notes
>  
> Best regards
> Laurent LE MEUR
> EDRLab
> co-chair of the TDM Reservation Protocol CG
>  
> 
> 
> 
> Début du message réexpédié :
>  
> De: W3C Community Development Team <team-community-process@w3.org <mailto:team-community-process@w3.org>>
> Objet: Notes, July 26th, 2023 [via TDM Reservation Protocol Community Group]
> Date: 3 août 2023 à 13:19:39 UTC+2
> À: public-tdmrep@w3.org <mailto:public-tdmrep@w3.org>
> Renvoyé-De: public-tdmrep@w3.org <mailto:public-tdmrep@w3.org>
> Répondre à: TDM Reservation Protocol Community Group <team-community-process@w3.org <mailto:team-community-process@w3.org>>
>  
> Update on meetings and presentations of the TDM protocol 
> 
> On 5th June a webinar on the protocol and how to implement it was organized by FEP and EDRLab; more than 70 publishers attended, and positive feedback was received. 
> 
> On July 11th, in Bruxelles, the TDM protocol was presented by AIE at the “Seminar on best practices for opting-out of generative ML training”, organized by Open Future. AIE and FEP attended the event, which was an occasion to exchange with organizations representing other rightsholders in the content industry, the EC Commission, AI experts, and other projects/initiatives offering solutions for machine-readable opt-out, namely the C2PA coalition and Spawning. The latter integrates different opt-out methods in order to provide a service to AI companies that, given a URL in input, can check if there is an opt-out associated with the resource that AI players intend to use.  
> 
> Collaboration with Spawning AI
> 
> After some exchanges, Spawning AI has already integrated partially the opt-out solution developed by the TDM Rep CG in their service, and they are open to collaborating further with the CG. 
> 
> Discussion on possible developments of the protocol
> 
> EDRLab presented an overview of the different opt-out initiatives that are in touch with our CG. Some of them are media-specific (like the ones by IPTC and C2PA) and provide solutions at the content metadata level, other like Spawning AI (and the TDM Rep protocol) are applicable to any content type, at the URL level. Even though different solutions (content specific and not-content-specific) are complementary and can coexist in line with the different standards and practices in the content industry, there are significant differences in the semantic approach adopted by IPTC and C2PA on one hand, and the TDM Rep on the other: in particular, the different solutions reflect different views on whether the TDM concept would cover all/some AI usages, and whether indexing by search engines could be part of the opt-out. Such discrepancies are partly due to the different legal frameworks (US vs. EU) where such initiatives were developed.
> 
> Considering the rapid evolution of AI applications, and the ongoing discussion in the creative industries on rights reservation and licensing for AI, the CG agreed to continue to monitor the situation and exchange with the other initiatives in this field before taking any decision on the possible refinement of the protocol with new properties or values.
> 
> In the short term, it was agreed that:
> 
> the CG will check if the semantics of the protocol can be further clarified at the level of the specifications, to prevent any ambiguity and facilitate interoperability among different solutions.
> 
> the CG will work at a FAQ for non-techies that will further clarify the meaning of the TDM opt-out in light of the EU legal framework and will provide practical insight to the adopters on how to implement it in the context of AI.
> 
> Implementation in EPUB files
> 
> Given the increasing interest by the publishing sector – including, among GC members, Mondadori, Penguin Random House, and the STM association - for the integration of the TDM protocol in EPUB files, it was agreed that the CG will liaise with the W3C Publishing Community Group and the Publishing Business Group, which follow EPUB related developments, via EDRLab (who is member of both groups).
> 
> Particularly, it was agreed that: 
> 
> On behalf of the CG, EDRLab will send to the W3C Publishing Business Group a proposal to be discussed during their next meeting in September;
> 
> Should CG members have views or suggestions on the integration of TDM Rep in EPUB, they are requested to share them within the CG mailing list at their earliest convenience, so that they can be taken into account in the framework of the collaboration with the W3C Publishing Business and Community Groups
> 
> Other activities
> 
>  A FAQ for non-tech users: the group agreed to work on a FAQ; for more details see above;
> 
> 
> 
>  Keeping track of early adopters: group members are invited to share on the CG mailing list information about new adopters of the protocol. The list of the early adopted will be publicized on the website of the CG, in order to give visibility to it. Early adopters are also encouraged to publicize the adoption of the protocol on their own websites.
>
Received on Thursday, 3 August 2023 15:45:40 UTC