Re: An HTTP header to request OpenGraph or schema.org metadata

Thanks. But it doesn't look like it's gotten anywhere.

I think a separate "Accept-Metadata" header makes sense, since it's 
requesting metadata.

However, if all an agent wants is metadata, then perhaps there should be 
way to request only the metadata. Currently this is done for OpenGraph 
by making a request for the first few KB of a file.

It might make sense to make a HEAD or OPTIONS request with an 
Accept-Metadata header, and the response includes a header with the URL 
how to retrieve it. Either the same URL (for a GET request, client can 
determine whether to request first X bytes or entire file depending on 
metadata type), or a different URL (e.g. for JSON-LD data only or some 
other format).


On 23/11/2024 13:47, Soni "It/Its" L. wrote:
> apparently it was on this list, here's the thread: 
> https://lists.w3.org/Archives/Public/ietf-http-wg/2024JanMar/0181.html
>
> On 2024-11-23 10:45, Soni "It/Its" L. wrote:
>> we have asked about this before. don't think it was on this list tho? 
>> give us a sec...
>>
>> On 2024-11-23 09:56, Robert Rothenberg wrote:
>>> If you look at the HTTP logs for a website that's been around for a 
>>> while, you'll notice a lot of weird user agent strings that include 
>>> the text "Facebot Twitterbot" or "facebookexternal" or even 
>>> "Googlebot" when they are clearly not. Many of these are from 
>>> iMessage and various social media/chat applications.
>>>
>>> I've contacted the developers for one of these and was told this was 
>>> necessary because some major websites do not include OpenGraph 
>>> metadata unless the user agent string includes text strings for some 
>>> well-known bots.
>>>
>>> However, a website that I maintain has been bombarded with a lot of 
>>> unidentified web robots that we believe are using our content for AI 
>>> training, and many of these bots will falsely claim to be Googlebot 
>>> or Bingbot etc.  So we've implemented a scheme to verify these bots 
>>> and block the fakers.  A side-effect is that we're blocking a lot of 
>>> these social media/chat bots.
>>>
>>> Ideally, web clients shouldn't have to fake their user agent strings 
>>> just to get metadata.
>>>
>>> I think a better solution is to have an HTTP header, something like
>>>
>>>   Accept-Metadata: opengraph, json+ld
>>>
>>> The server should respond with a normal HTML web page, but can 
>>> optionally include metadata, possibly with a response header to 
>>> indicate what metadata formats are included.
>>>
>>> Is there existing work on this?
>>>
>>>
>>>
>>>
>>
>

Received on Thursday, 28 November 2024 11:06:49 UTC