Question about Redirects in Linked Data Crawling from Sebastian Hellmann on 2023-08-17 (www-tag@w3.org from August 2023)

From: Sebastian Hellmann <hellmann@informatik.uni-leipzig.de>
Date: Thu, 17 Aug 2023 15:55:26 +0200
To: www-tag@w3.org, Marvin Hofer <marvin.hofer@informatik.uni-leipzig.de>
Message-ID: <fcdb117d-897c-a4f2-ba64-ee001e9beb06@informatik.uni-leipzig.de>

Dear TAG,

I'm reaching out to discuss a specific challenge we're encountering in 
the realm of Linked Data crawling, with the primary goal of data 
integration. Our objective is to deduplicate the data by identifying 
links or "sameAs" references and canonicalizing identifiers (equality), 
e.g. trying to fix http://dbpedia.org/page/Berlin (HTML rep) when 
encountered in sameAs/RDF.

During our aggregation process of numerous resources from the Web via 
HTTP requests, we've observed a significant number of redirects. For 
instance, when accessing data from DBpedia and Wikidata, there are as 
many as four redirects before the desired data is delivered. While we 
appreciate the necessity of an HTTPS upgrade (resulting in a 301 
redirect), the presence of three additional redirects imposes a 
substantial overhead, impacting our scalability efforts.

To illustrate: Ideally, a request like curl -IL -H "Accept: 
application/n-triples" http://dbpedia.org/resource/Berlin or 
http://www.wikidata.org/entity/Q64 would return a 301 redirect to HTTPS 
and then a direct 200 OK status with the corresponding Content-type: 
application/n-triples and the data's location at $DATAURI.

Note that DBpedia and Wikidata are just examples here, we did a 
preliminary test crawl (see 
https://svn.aksw.org/papers/2023/ld-crawl/public.pdf) and are still 
assessing how redirects are done in the wild for debugging and proxying 
purposes.

My primary query revolves around the architectural implications of such 
a setup:

* Are there significant architectural or technical hurdles preventing 
servers from directly returning payload data in the requested format via 
a 200 OK status?
* Are there established best practices or TAG recommendations regarding 
the number and nature of redirects for Linked Data access? In 
particular, we are looking for something, which would be called 
"Apification", i.e. keeping the discovery mechanism of linking, but 
having easier retrieval like with an API.
* Could you provide insights into the potential reasons for having 
content negotiation in such scenarios, especially if they don't serve a 
direct semantic or technical purpose?

Our intention is to optimize our data integration process, and 
understanding these aspects will significantly aid our efforts.

Thank you for your time and consideration. I look forward to your insights.

All the best,
Sebastian

Received on Thursday, 17 August 2023 13:55:55 UTC