Re: ChatGPT and general personal profile data gathering from Timothy Holborn on 2023-01-18 (public-cogai@w3.org from January 2023)

From: Timothy Holborn <timothy.holborn@gmail.com>
Date: Wed, 18 Jan 2023 21:56:20 +1000
To: "Harshvardhan J. Pandit" <me@harshp.com>
Cc: ProjectParadigm-ICT-Program <metadataportals@yahoo.com>, Public-cogai <public-cogai@w3.org>
Message-ID: <CAM1Sok3j9Q0ooH2jbioZdmzpijRTRscmGtHH8o31mOHi7rCADA@mail.gmail.com>
Heya,

I've found that it can contradict itself and certainly also appears to
'make stuff up'...

ie: WebID-DNS - https://twitter.com/WebCivics/status/1608453885693235202

noting also, the experiment relating to the term 'holborn'
https://devdocs.webizen.org/SocialFactors/WebScience/ArtificialMinds/AChatGPTExperimentHolborn/


Noting also - its described to be a 'language model', yet its more of a
'knowledge model'...  meaning, it's a lot more than simply vocabulary or
language, and in some ways, there's alot about language that it doesn't
appear to know very well...  (i'm doing some work on that general area of
consideration atm - noting the devdocs are an early incomplete draft, etc.).

Whilst noting otherwise - ChatGPT has been INCREDIBLY VALUABLE for me...
Whilst it certainly gets alot of things wrong, it is also far more helpful
than any other agent i've known for a very long time - to help me get work
i care about - done, even though i can't afford to pay others to help me
figure out solutions to problems, or help me learn skills that i'm not very
good at atm.  Noting also: i think it's important to figure out how to
create a labelling methodology to distinguish which agent is responsible
for what...
https://devdocs.webizen.org/SocialFactors/WebScience/SafetyProtocols/AgentLabelling/


Hope that helps....

Timothy Holborn

On Wed, 18 Jan 2023 at 20:15, Harshvardhan J. Pandit <me@harshp.com> wrote:

> Hi.
>
> Does (Chat)GPT spit out personal data? Is it accurate or is it nonsense?
>
> This is the question I've been looking into ever since GPT3 was
> announced. So far, the answer seems to be inconclusive, as also reported
> elsewhere [1]. AFAIK, we don't know the exact specifics of what sources
> were used to build GPT. It is also quite likely that data was 'cleaned'
> to remove obvious information, such as emails and addresses. Hopefully
> some authority somewhere has asked OpenAI to answer this question - even
> if we don't have this info. in the public domain.
>
> All this being said, if GPT4 or something else does build even bigger
> LLMs that enable somehow to query what I've said 10 years ago on a
> forum, e.g. with the prompt "What topics has Harsh posted on forums?",
> and which returns results which can also be publicly available via a
> search - I don't think this is much different than GPT being used like a
> search engine.
>
> The line I draw is when GPT or LLMs might start 'summarising' and
> 'inferring' things about me. For example, the prompt "What topics is
> Harsh interested in?" doesn't just return a list of websites but the
> topics from that website or conversations in it. This is where GDPR
> would get triggered substantially. We have already seen similar
> apprehensions for ClearViewAI which also scrapped personal data (photos)
> off the internet and did things with it (facial recognition) [2].
>
> If we start normalising such personal data collection, and put the onus
> on the individual to find out what is known about them - it will have
> negative impacts and will be a bad thing for society. Its much better to
> have the onus on the creators and providers of LLMs and other 'services'
> to ensure they toe the line and have the necessary risks and impacts
> assessed and addressed. This is also what the GDPR says (sort of), but
> we need more specific guidelines and enforcements for AI.
>
> So I disagree that too much is left to chance. There's definitely a
> frenzy but that's because this is something 'new and shiny' and so we
> tend to get distracted a lot. Moving ahead, quite a lot of us are taking
> the cautious approach, including governments with regulations [3].
>
> [1]
>
> https://www.technologyreview.com/2022/08/31/1058800/what-does-gpt-3-know-about-me/
> [2]
>
> https://edpb.europa.eu/news/national-news/2022/facial-recognition-italian-sa-fines-clearview-ai-eur-20-million_en
> [3] https://en.wikipedia.org/wiki/Artificial_Intelligence_Act
>
> Regards,
> Harsh
>
> On 16/01/2023 17:32, ProjectParadigm-ICT-Program wrote:
> > Chat GPT is the new buzz in town. And according to the news media,
> > millions have already engaged the chat bot. I am curious to find out if
> > anyone has tested the chat bot to find out how much it knows about the
> > person chatting with it.
> >
> > According to Open AI the data set, which is largely unspecified, runs
> > until Dec 31, 2021 in terms of the data used, collected (scraped) from
> > the Internet.
> >
> > Hundreds of thousands of websites exist which have openly accessible
> > information about people with accounts on them ranging from work,
> > academic, professional, trade, industry, corporate and non-profit
> domains.
> > Quite a large portion of these have Terms and Conditions and User
> > Agreements that state that the websites do not sell the data to third
> > parties.
> >
> > But most of these website also may have fora, chat groups and messaging
> > systems, of which some content can be made publicly available on the
> > Internet, if the account holder so desires and chooses such an option.
> >
> > Most AI algorithms owned and developed by large Internet companies, and
> > definitely not only Meta, Google, Amazon, Microsoft, but even startups
> > with substantial investor funding are creating large data sets which
> > remain hidden from scrutiny and oversight.
> >
> > It is common knowledge that in quite a few countries around the world
> > combining data sets to collect personal data in public administrations
> > is carefully monitored, regulated and supervised.
> >
> > The same does not hold for the Internet and Internet companies.
> >
> > The current standard for data privacy protection is the General Data
> > Protection Regulation from the European Union, but even this GDPR is far
> > from perfect and being a "gold standard".
> >
> > Therefore I think engaging chat bots with the intent of finding out how
> > much they know about our lives, and to limit the scope for now, to the
> > work, academic, professional, trade, industry, corporate and non-profit
> > domains should yield some clues to how far reaching this scraping for
> > data to include in data sets is going.
> >
> > And it would be nice if this could be done in such a way e.g. via
> > templates or predetermined question sets to make this possible for
> > analysis in a project setting.
> >
> > Too much is being left to chance, investors and market frenzy and less
> > to scrutiny and informed debate with regard to the potential of AI
> > products like ChatGPT.
> >
> > Milton Ponson
> > GSM: +297 747 8280
> > PO Box 1154, Oranjestad
> > Aruba, Dutch Caribbean
> > Project Paradigm: Bringing the ICT tools for sustainable development to
> > all stakeholders worldwide through collaborative research on applied
> > mathematics, advanced modeling, software and standards development
>
> --
> ---
> Harshvardhan J. Pandit, Ph.D
> Assistant Professor
> ADAPT Centre, Dublin City University
> https://harshp.com/
>
>
Received on Wednesday, 18 January 2023 11:57:11 UTC