Re: ChatGPT and general personal profile data gathering from Harshvardhan J. Pandit on 2023-01-18 (semantic-web@w3.org from January 2023)

From: Harshvardhan J. Pandit <me@harshp.com>
Date: Wed, 18 Jan 2023 10:14:38 +0000
To: ProjectParadigm-ICT-Program <metadataportals@yahoo.com>, public-lod <public-lod@w3.org>, W3C AIKR CG <public-aikr@w3.org>, Public-cogai <public-cogai@w3.org>, semantic-web <semantic-web@w3.org>, W3c Semweb HCLS <public-semweb-lifesci@w3.org>, "public-philoweb@w3.o" <public-philoweb@w3.org>
Message-ID: <350921cc-03fa-9d75-467c-88aeeea5f994@harshp.com>

Hi.

Does (Chat)GPT spit out personal data? Is it accurate or is it nonsense?

This is the question I've been looking into ever since GPT3 was 
announced. So far, the answer seems to be inconclusive, as also reported 
elsewhere [1]. AFAIK, we don't know the exact specifics of what sources 
were used to build GPT. It is also quite likely that data was 'cleaned' 
to remove obvious information, such as emails and addresses. Hopefully 
some authority somewhere has asked OpenAI to answer this question - even 
if we don't have this info. in the public domain.

All this being said, if GPT4 or something else does build even bigger 
LLMs that enable somehow to query what I've said 10 years ago on a 
forum, e.g. with the prompt "What topics has Harsh posted on forums?", 
and which returns results which can also be publicly available via a 
search - I don't think this is much different than GPT being used like a 
search engine.

The line I draw is when GPT or LLMs might start 'summarising' and 
'inferring' things about me. For example, the prompt "What topics is 
Harsh interested in?" doesn't just return a list of websites but the 
topics from that website or conversations in it. This is where GDPR 
would get triggered substantially. We have already seen similar 
apprehensions for ClearViewAI which also scrapped personal data (photos) 
off the internet and did things with it (facial recognition) [2].

If we start normalising such personal data collection, and put the onus 
on the individual to find out what is known about them - it will have 
negative impacts and will be a bad thing for society. Its much better to 
have the onus on the creators and providers of LLMs and other 'services' 
to ensure they toe the line and have the necessary risks and impacts 
assessed and addressed. This is also what the GDPR says (sort of), but 
we need more specific guidelines and enforcements for AI.

So I disagree that too much is left to chance. There's definitely a 
frenzy but that's because this is something 'new and shiny' and so we 
tend to get distracted a lot. Moving ahead, quite a lot of us are taking 
the cautious approach, including governments with regulations [3].

[1] 
https://www.technologyreview.com/2022/08/31/1058800/what-does-gpt-3-know-about-me/
[2] 
https://edpb.europa.eu/news/national-news/2022/facial-recognition-italian-sa-fines-clearview-ai-eur-20-million_en
[3] https://en.wikipedia.org/wiki/Artificial_Intelligence_Act

Regards,
Harsh

On 16/01/2023 17:32, ProjectParadigm-ICT-Program wrote:
> Chat GPT is the new buzz in town. And according to the news media, 
> millions have already engaged the chat bot. I am curious to find out if 
> anyone has tested the chat bot to find out how much it knows about the 
> person chatting with it.
> 
> According to Open AI the data set, which is largely unspecified, runs 
> until Dec 31, 2021 in terms of the data used, collected (scraped) from 
> the Internet.
> 
> Hundreds of thousands of websites exist which have openly accessible 
> information about people with accounts on them ranging from work, 
> academic, professional, trade, industry, corporate and non-profit domains.
> Quite a large portion of these have Terms and Conditions and User 
> Agreements that state that the websites do not sell the data to third 
> parties.
> 
> But most of these website also may have fora, chat groups and messaging 
> systems, of which some content can be made publicly available on the 
> Internet, if the account holder so desires and chooses such an option.
> 
> Most AI algorithms owned and developed by large Internet companies, and 
> definitely not only Meta, Google, Amazon, Microsoft, but even startups 
> with substantial investor funding are creating large data sets which 
> remain hidden from scrutiny and oversight.
> 
> It is common knowledge that in quite a few countries around the world 
> combining data sets to collect personal data in public administrations 
> is carefully monitored, regulated and supervised.
> 
> The same does not hold for the Internet and Internet companies.
> 
> The current standard for data privacy protection is the General Data 
> Protection Regulation from the European Union, but even this GDPR is far 
> from perfect and being a "gold standard".
> 
> Therefore I think engaging chat bots with the intent of finding out how 
> much they know about our lives, and to limit the scope for now, to the 
> work, academic, professional, trade, industry, corporate and non-profit 
> domains should yield some clues to how far reaching this scraping for 
> data to include in data sets is going.
> 
> And it would be nice if this could be done in such a way e.g. via 
> templates or predetermined question sets to make this possible for 
> analysis in a project setting.
> 
> Too much is being left to chance, investors and market frenzy and less 
> to scrutiny and informed debate with regard to the potential of AI 
> products like ChatGPT.
> 
> Milton Ponson
> GSM: +297 747 8280
> PO Box 1154, Oranjestad
> Aruba, Dutch Caribbean
> Project Paradigm: Bringing the ICT tools for sustainable development to 
> all stakeholders worldwide through collaborative research on applied 
> mathematics, advanced modeling, software and standards development

-- 
---
Harshvardhan J. Pandit, Ph.D
Assistant Professor
ADAPT Centre, Dublin City University
https://harshp.com/

Received on Wednesday, 18 January 2023 10:14:54 UTC