- From: Sebastian Hellmann <hellmann@informatik.uni-leipzig.de>
- Date: Thu, 17 Aug 2023 15:55:26 +0200
- To: www-tag@w3.org, Marvin Hofer <marvin.hofer@informatik.uni-leipzig.de>
Dear TAG, I'm reaching out to discuss a specific challenge we're encountering in the realm of Linked Data crawling, with the primary goal of data integration. Our objective is to deduplicate the data by identifying links or "sameAs" references and canonicalizing identifiers (equality), e.g. trying to fix http://dbpedia.org/page/Berlin (HTML rep) when encountered in sameAs/RDF. During our aggregation process of numerous resources from the Web via HTTP requests, we've observed a significant number of redirects. For instance, when accessing data from DBpedia and Wikidata, there are as many as four redirects before the desired data is delivered. While we appreciate the necessity of an HTTPS upgrade (resulting in a 301 redirect), the presence of three additional redirects imposes a substantial overhead, impacting our scalability efforts. To illustrate: Ideally, a request like curl -IL -H "Accept: application/n-triples" http://dbpedia.org/resource/Berlin or http://www.wikidata.org/entity/Q64 would return a 301 redirect to HTTPS and then a direct 200 OK status with the corresponding Content-type: application/n-triples and the data's location at $DATAURI. Note that DBpedia and Wikidata are just examples here, we did a preliminary test crawl (see https://svn.aksw.org/papers/2023/ld-crawl/public.pdf) and are still assessing how redirects are done in the wild for debugging and proxying purposes. My primary query revolves around the architectural implications of such a setup: * Are there significant architectural or technical hurdles preventing servers from directly returning payload data in the requested format via a 200 OK status? * Are there established best practices or TAG recommendations regarding the number and nature of redirects for Linked Data access? In particular, we are looking for something, which would be called "Apification", i.e. keeping the discovery mechanism of linking, but having easier retrieval like with an API. * Could you provide insights into the potential reasons for having content negotiation in such scenarios, especially if they don't serve a direct semantic or technical purpose? Our intention is to optimize our data integration process, and understanding these aspects will significantly aid our efforts. Thank you for your time and consideration. I look forward to your insights. All the best, Sebastian
Received on Thursday, 17 August 2023 13:55:55 UTC