Algorithmic Archive project introduction from Pierre Marshall on 2025-01-30 (public-swicg@w3.org from January 2025)

From: Pierre Marshall <pierre.marshall@bodleian.ox.ac.uk>
Date: Thu, 30 Jan 2025 11:50:43 +0000
To: "public-swicg@w3c.org" <public-swicg@w3c.org>
CC: Beatrice Cannelli <beatrice.cannelli@bodleian.ox.ac.uk>
Message-ID: <LNXP265MB034817ADFB8DC5DCD0DF7191EAE92@LNXP265MB0348.GBRP265.PROD.OUTLOOK.COM>

Hello everyone,

I've joined this group as part of the Algorithmic Archive project at the Bodleian Libraries.
This email is to introduce myself and my colleague, Beatrice Cannelli.

The Bodleian operates an existing web archiving programme, and we would like to extend this to include collecting social media. In addition to that the Bodleian has also received 'digital deposits' of personal archives, which include email and social media data; and we know that various research workers at the University of Oxford and other institutions have collected social media datasets for their own projects.

At the moment the project is in an initial scoping phase, where we are surveying what other institutions are doing, and testing the APIs of various social media platforms to see what can be collected.

My current archiving code is here:
https://github.com/extua/bodsky-archiver

I started with Bluesky, then Twitter, and I'll move onto another platform soon. Warning, this is not professional production-ready code! Just a proof-of-concept. Also we are not looking at the moment at archiving entire platforms, just ad-hoc collections.

Each platform has its own schema for dents, toots, tweets, statuses, posts, notes, etc.
Ideally we would like to store everything in the same format, so as part of this scoping phase we are looking at standards for social media data.

There is a standardisation effort led by the Knight Georgetown Institute in the USA, which is not yet finalised.
https://kgi.georgetown.edu/gold-standard-faq/

The Bridgy https://brid.gy/ and Granary https://granary.io/ projects have also worked on inter-operability between platforms and these both use ActivityStreams as a common language between platforms.
The ActivityStreams model also allows us to connect social media data to other identifiers, which is useful for the organisation of information in collections.

So, I am here to see how the Algorithmic Archive project can use (and contribute to) social web standards.
In a more practical sense, if any of you have any advice on how to go about archiving Mastodon instances, I would appreciate any tips :)

I'll be at Fosdem this weekend, and looking forward to meeting some of you at the Brussels Hackspace on Sunday evening.

Kind regards,
Pierre Marshall (Technical Research Officer)
Beatrice Cannelli (Curatorial and Policy Research Officer)
Bodleian Libraries, University of Oxford

Received on Thursday, 30 January 2025 11:52:02 UTC