Re: page about robots.txt from Leonard Rosenthol on 2021-02-25 (public-tdmrep@w3.org from February 2021)

From: Leonard Rosenthol <lrosenth@adobe.com>
Date: Thu, 25 Feb 2021 16:58:13 +0000
To: Brendan Quinn <brendan@cluefulmedia.com>
CC: "laurent.lemeur@edrlab.org" <laurent.lemeur@edrlab.org>, "public-tdmrep@w3.org" <public-tdmrep@w3.org>
Message-ID: <EF609BC1-49B4-4786-AC96-7D4F82476AA0@adobe.com>
Actually, the web standards community is actually moving *TOWARDS* supporting embedded metadata in assets to control their handling.  The most recent one of these is the change to how `image-orientation in CSS is used (https://drafts.csswg.org/css-images/#the-image-orientation) where the new default is `from-image`.

Taking a similar approach with TDM would be seem fully inline with that direction…

Leonard

From: Brendan Quinn <brendan@cluefulmedia.com>
Date: Thursday, February 25, 2021 at 11:20 AM
To: Leonard Rosenthol <lrosenth@adobe.com>
Cc: "laurent.lemeur@edrlab.org" <laurent.lemeur@edrlab.org>, "public-tdmrep@w3.org" <public-tdmrep@w3.org>
Subject: Re: page about robots.txt

Of course Google (and other search engines) also allow robot control to be embedded in web pages, and in fact that is the way that Google recommends that people block pages from Google search: https://developers.google.com/search/reference/robots_meta_tag<https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdevelopers.google.com%2Fsearch%2Freference%2Frobots_meta_tag&data=04%7C01%7Clrosenth%40adobe.com%7C99087038202742ba2a0408d8d9a94bdf%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C1%7C637498668449949189%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=gnZW8fb716Gf8HmVTj4yh1VXqz6dvVGGpIiVGY5ZwBo%3D&reserved=0>

That page references <meta name="robots" content="noindex" /> -- which of course only works for HTML pages --  and also the X-Robots-Tag: noindex HTTP header.

Probably an HTTP header approach is the only thing that will be truly universal (but no doubt someone will prove me wrong in the next 5 minutes :-) - across plain text, HTML, PDF, images, audio, video, everything else...

... but it doesn't solve the problem Leonard raised of retaining the "do not mine" metadata when an asset is moved.

So it seems that technically the guidance has to be to look in multiple places - "look for this field in an HTML page but look for these embedded properties in an image or PDF" etc. Possibly this could be done at the web server level and the result added as an HTTP header on the fly, but that's a lot to ask of web servers isn't it?

Brendan.

On Thu, 25 Feb 2021 at 16:58, Leonard Rosenthol <lrosenth@adobe.com<mailto:lrosenth@adobe.com>> wrote:
As I mentioned on the call, the biggest problem with robots.txt (and the others that Brendan mentions) is that they are completely detached from the assets that they refer to.  This means that a user can simply move an asset from one server to another, and all TDM information/rights will no longer apply.   This, IMO, makes it a non-starter as an option.

Leonard

From: Brendan Quinn <brendan@cluefulmedia.com<mailto:brendan@cluefulmedia.com>>
Date: Thursday, February 25, 2021 at 9:02 AM
To: "laurent.lemeur@edrlab.org<mailto:laurent.lemeur@edrlab.org>" <laurent.lemeur@edrlab.org<mailto:laurent.lemeur@edrlab.org>>
Cc: "public-tdmrep@w3.org<mailto:public-tdmrep@w3.org>" <public-tdmrep@w3.org<mailto:public-tdmrep@w3.org>>
Subject: Re: page about robots.txt
Resent-From: <public-tdmrep@w3.org<mailto:public-tdmrep@w3.org>>
Resent-Date: Thursday, February 25, 2021 at 9:02 AM

Thanks Laurent, that looks good.

It's probably worth mentioning that there are some provider-specific extensions to robots.txt used in the wild, eg sitemap: used by "Google, Bing,and other major search engines".

https://developers.google.com/search/reference/robots_txt#google-supported-non-group-member-lines<https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdevelopers.google.com%2Fsearch%2Freference%2Frobots_txt%23google-supported-non-group-member-lines&data=04%7C01%7Clrosenth%40adobe.com%7C99087038202742ba2a0408d8d9a94bdf%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C1%7C637498668449949189%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=d%2FdoYkEmW0nyBAhNTuIS0MvLQ%2BAlWKP7ApfFMJTZPHE%3D&reserved=0>

I guess we should also document the .well-known folder, with spec here: https://tools.ietf.org/html/rfc8615<https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Ftools.ietf.org%2Fhtml%2Frfc8615&data=04%7C01%7Clrosenth%40adobe.com%7C99087038202742ba2a0408d8d9a94bdf%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C1%7C637498668449959145%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=mi3i0bd1TRz%2BzS8mvigq%2BUDy2PcoPblYjvF4sIrtI2o%3D&reserved=0> and the quite extensive "well-known URI repository" at https://www.iana.org/assignments/well-known-uris/well-known-uris.xhtml<https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.iana.org%2Fassignments%2Fwell-known-uris%2Fwell-known-uris.xhtml&data=04%7C01%7Clrosenth%40adobe.com%7C99087038202742ba2a0408d8d9a94bdf%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C1%7C637498668449959145%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=PHWBwD9USmvYZ%2FedgiwXQLxjQWYkkg5ryCYMej%2Bs614%3D&reserved=0>

Also see IAB's ads.txt initiative: https://iabtechlab.com/ads-txt/<https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fiabtechlab.com%2Fads-txt%2F&data=04%7C01%7Clrosenth%40adobe.com%7C99087038202742ba2a0408d8d9a94bdf%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C1%7C637498668449969100%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=YZ6VMbFF6vXW%2FIK24ELXL9jvJKlQ8cCVdzzhIUDhJSE%3D&reserved=0>

Sorry I missed the call on Tuesday. I hope it was fruitful.

Best regards,

Brendan.

On Thu, 25 Feb 2021 at 15:29, Laurent Le Meur <laurent.lemeur@edrlab.org<mailto:laurent.lemeur@edrlab.org>> wrote:
Dear participants,

I have added a page to the Github repo, which tries to summarize what is robots.txt and how it is used. Robots.txt has been described by Ivan Herman as a possible source of inspiration during our last call.

https://github.com/w3c/tdm-reservation-protocol/blob/main/docs/robots.md<https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fw3c%2Ftdm-reservation-protocol%2Fblob%2Fmain%2Fdocs%2Frobots.md&data=04%7C01%7Clrosenth%40adobe.com%7C99087038202742ba2a0408d8d9a94bdf%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C1%7C637498668449969100%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=vLj6NWG5IGMZLFVAiIJURJAWYdZ57nN09L9RHD9FUV0%3D&reserved=0>

Best regards
Laurent Le Meur
Received on Thursday, 25 February 2021 16:58:30 UTC