Re: Call from French orgs from the cultural industry for transparent AI in EU from Leonard Rosenthol on 2023-10-03 (public-tdmrep@w3.org from October 2023)

From: Leonard Rosenthol <lrosenth@adobe.com>
Date: Tue, 3 Oct 2023 14:24:56 +0000
To: Laurent Le Meur <laurent@edrlab.org>, "public-tdmrep@w3.org" <public-tdmrep@w3.org>
Message-ID: <DM8PR02MB81816D1446881CDAA25863D9CDC4A@DM8PR02MB8181.namprd02.prod.outlook.com>

Just an FYI that the AI/ML industry has already been working on this area - how to document the training data set in use – for over a year now.  Their work is based on existing standard such as SPDX, C2PA and others.

Leonard

From: Laurent Le Meur <laurent@edrlab.org>
Date: Tuesday, October 3, 2023 at 4:14 AM
To: public-tdmrep@w3.org <public-tdmrep@w3.org>
Subject: Call from French orgs from the cultural industry for transparent AI in EU

EXTERNAL: Use caution when clicking on links or opening attachments.

The article below was printed in Le Monde on Sept 29th. It is focusing on Generative AI  and the EU AI Act.

[cid:C8705F15-5A8A-461C-B295-9A2061AC8C88]
Tribune : Construisons dès aujourd’hui une Intelligence Artificielle de rang mondial respectueuse de la propriété littéraire et artistique - Syndicat national de l'édition<https://www.sne.fr/actu/tribune-construisons-des-aujourdhui-une-intelligence-artificielle-de-rang-mondial-respectueuse-de-la-propriete-litteraire-et-artistique/>
sne.fr<https://www.sne.fr/actu/tribune-construisons-des-aujourdhui-une-intelligence-artificielle-de-rang-mondial-respectueuse-de-la-propriete-litteraire-et-artistique/>

In summary, it calls for the EU to go further than simply requesting AI companies to publish summaries of copyrighted data used for training (this is the current trend). The request is to obtain total transparency through a detailed list of all works used by Generative AI systems for training, and their sources.

This request is shared by many practitioners, in the EU but also in the US.

Personal thinking: Providing URLs would not be sufficient, because many works appear on multiple URLs that are not managed by rights owners, and many URLs are transient. Such repositories of training sources should therefore index for each training source an ISCC code<https://iscc.foundation/iscc/>,  a date of import, a source url (if any), and optionally a few other metadata (some title). And they should be searchable by ISCC (or title).

This would make it easy to check that an opt-out has been respected, even if a work / content has been syndicated through multiple locations / websites.

What is your opinion on this?

Best regards
Laurent

Attachments

image/jpeg attachment: Image2.jpeg

Received on Tuesday, 3 October 2023 14:25:06 UTC