Re: An HTTP header to request OpenGraph or schema.org metadata from Austin Wright on 2024-12-02 (ietf-http-wg@w3.org from October to December 2024)

From: Austin Wright <aaa@bzfx.net>
Date: Mon, 2 Dec 2024 12:40:33 -0500
To: Robert Rothenberg <robrwo@gmail.com>
Cc: ietf-http-wg@w3.org
Message-Id: <A0A4FD20-0970-4AD3-B773-6B94AB498E8D@bzfx.net>
> On Nov 28, 2024, at 06:06, Robert Rothenberg <robrwo@gmail.com> wrote:
> 
> Thanks. But it doesn't look like it's gotten anywhere.
> 
> I think a separate "Accept-Metadata" header makes sense, since it's requesting metadata.
> 
> However, if all an agent wants is metadata, then perhaps there should be way to request only the metadata. Currently this is done for OpenGraph by making a request for the first few KB of a file.

This is already possible to some degree. You can read the Accept header when choosing between media types to embed within a document. Consider the request header:

Accept: text/html, application/json+ld

You may read this as a request to embed JSON-LD within HTML (as opposed to, application/turtle, or nothing at all).

Now, it might be possible that this is too ambiguous for some user agents. For example, if a user agent wants a plain Turtle document the most, followed by opaque HTML, and last HTML with embedded Turtle, this may be difficult to convey. I think we should first establish concrete cases where the Accept header would be insufficient by itself, before considering a header like Accept-Metadata.

Second, I agree that you shouldn’t need to fake user agent strings (at most, it should be a last resort to work around bugs in particular user-agents). However, I’m not sure how this would solve your problem, as they probably have little incentive to read an Accept header to begin with. Or they may be doing this out of some business desire to expose this metadata “only” to Facebook, and not just any client.

> It might make sense to make a HEAD or OPTIONS request with an Accept-Metadata header, and the response includes a header with the URL how to retrieve it. Either the same URL (for a GET request, client can determine whether to request first X bytes or entire file depending on metadata type), or a different URL (e.g. for JSON-LD data only or some other format).

In addition to Content-Type negotiation, there’s also the option of sending a Link header with rel=alternate and type attributes.

Regards,

Austin.

> 
> 
> On 23/11/2024 13:47, Soni "It/Its" L. wrote:
>> apparently it was on this list, here's the thread: https://lists.w3.org/Archives/Public/ietf-http-wg/2024JanMar/0181.html
>> 
>> On 2024-11-23 10:45, Soni "It/Its" L. wrote:
>>> we have asked about this before. don't think it was on this list tho? give us a sec...
>>> 
>>> On 2024-11-23 09:56, Robert Rothenberg wrote:
>>>> If you look at the HTTP logs for a website that's been around for a while, you'll notice a lot of weird user agent strings that include the text "Facebot Twitterbot" or "facebookexternal" or even "Googlebot" when they are clearly not. Many of these are from iMessage and various social media/chat applications.
>>>> 
>>>> I've contacted the developers for one of these and was told this was necessary because some major websites do not include OpenGraph metadata unless the user agent string includes text strings for some well-known bots.
>>>> 
>>>> However, a website that I maintain has been bombarded with a lot of unidentified web robots that we believe are using our content for AI training, and many of these bots will falsely claim to be Googlebot or Bingbot etc.  So we've implemented a scheme to verify these bots and block the fakers.  A side-effect is that we're blocking a lot of these social media/chat bots.
>>>> 
>>>> Ideally, web clients shouldn't have to fake their user agent strings just to get metadata.
>>>> 
>>>> I think a better solution is to have an HTTP header, something like
>>>> 
>>>>   Accept-Metadata: opengraph, json+ld
>>>> 
>>>> The server should respond with a normal HTML web page, but can optionally include metadata, possibly with a response header to indicate what metadata formats are included.
>>>> 
>>>> Is there existing work on this?
>>>> 
>>>> 
>>>> 
>>>> 
>>> 
>> 
> 
> 
>
Received on Monday, 2 December 2024 17:40:49 UTC