Re: page about robots.txt from Brendan Quinn on 2021-02-25 (public-tdmrep@w3.org from February 2021)

From: Brendan Quinn <brendan@cluefulmedia.com>
Date: Thu, 25 Feb 2021 18:20:29 +0200
To: Leonard Rosenthol <lrosenth@adobe.com>
Cc: "laurent.lemeur@edrlab.org" <laurent.lemeur@edrlab.org>, "public-tdmrep@w3.org" <public-tdmrep@w3.org>
Message-ID: <CAMvELkc48L-peKoKBKf0+OH1nUqPfLvfoiMfBNWRDwnLqDS3GA@mail.gmail.com>
Of course Google (and other search engines) also allow robot control to be
embedded in web pages, and in fact that is the way that Google recommends
that people block pages from Google search:
https://developers.google.com/search/reference/robots_meta_tag

That page references <meta name="robots" content="noindex" /> -- which of
course only works for HTML pages --  and also the X-Robots-Tag: noindex
HTTP header.

Probably an HTTP header approach is the only thing that will be truly
universal (but no doubt someone will prove me wrong in the next 5 minutes
:-) - across plain text, HTML, PDF, images, audio, video, everything
else...

... but it doesn't solve the problem Leonard raised of retaining the "do
not mine" metadata when an asset is moved.

So it seems that technically the guidance has to be to look in multiple
places - "look for this field in an HTML page but look for these embedded
properties in an image or PDF" etc. Possibly this could be done at the web
server level and the result added as an HTTP header on the fly, but that's
a lot to ask of web servers isn't it?

Brendan.

On Thu, 25 Feb 2021 at 16:58, Leonard Rosenthol <lrosenth@adobe.com> wrote:

> As I mentioned on the call, the biggest problem with robots.txt (and the
> others that Brendan mentions) is that they are completely detached from the
> assets that they refer to.  This means that a user can simply move an asset
> from one server to another, and all TDM information/rights will no longer
> apply.   This, IMO, makes it a non-starter as an option.
>
>
>
> Leonard
>
>
>
> *From: *Brendan Quinn <brendan@cluefulmedia.com>
> *Date: *Thursday, February 25, 2021 at 9:02 AM
> *To: *"laurent.lemeur@edrlab.org" <laurent.lemeur@edrlab.org>
> *Cc: *"public-tdmrep@w3.org" <public-tdmrep@w3.org>
> *Subject: *Re: page about robots.txt
> *Resent-From: *<public-tdmrep@w3.org>
> *Resent-Date: *Thursday, February 25, 2021 at 9:02 AM
>
>
>
> Thanks Laurent, that looks good.
>
>
>
> It's probably worth mentioning that there are some provider-specific
> extensions to robots.txt used in the wild, eg sitemap: used by "Google,
> Bing,and other major search engines".
>
>
>
>
> https://developers.google.com/search/reference/robots_txt#google-supported-non-group-member-lines
> <https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdevelopers.google.com%2Fsearch%2Freference%2Frobots_txt%23google-supported-non-group-member-lines&data=04%7C01%7Clrosenth%40adobe.com%7C2b7b105680864ad5690c08d8d99603d9%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C1%7C637498585628386484%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=%2FQcl2BqfQPOnjV%2Bg1J7F%2Bd9T0LEQsIvkCjpsXPkR8Jw%3D&reserved=0>
>
>
>
> I guess we should also document the .well-known folder, with spec here:
> https://tools.ietf.org/html/rfc8615
> <https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Ftools.ietf.org%2Fhtml%2Frfc8615&data=04%7C01%7Clrosenth%40adobe.com%7C2b7b105680864ad5690c08d8d99603d9%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C1%7C637498585628396441%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=bSNGMhpPpwQ6MtYQsFQ9HSSMZFkpT%2BYyK2vtOCBY5HM%3D&reserved=0>
> and the quite extensive "well-known URI repository" at
> https://www.iana.org/assignments/well-known-uris/well-known-uris.xhtml
> <https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.iana.org%2Fassignments%2Fwell-known-uris%2Fwell-known-uris.xhtml&data=04%7C01%7Clrosenth%40adobe.com%7C2b7b105680864ad5690c08d8d99603d9%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C1%7C637498585628396441%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=EcCnu7bPXtdIhe97lNAsXPI7c7%2FNzGeH5NkKTv1kiWk%3D&reserved=0>
>
>
>
> Also see IAB's ads.txt initiative: https://iabtechlab.com/ads-txt/
> <https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fiabtechlab.com%2Fads-txt%2F&data=04%7C01%7Clrosenth%40adobe.com%7C2b7b105680864ad5690c08d8d99603d9%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C1%7C637498585628406402%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=PqBbBZ0pRnjcTplqC1EeefkSLuEiuKpWlR4ZGfa2RaY%3D&reserved=0>
>
>
>
> Sorry I missed the call on Tuesday. I hope it was fruitful.
>
>
>
> Best regards,
>
>
>
> Brendan.
>
>
>
> On Thu, 25 Feb 2021 at 15:29, Laurent Le Meur <laurent.lemeur@edrlab.org>
> wrote:
>
> Dear participants,
>
>
>
> I have added a page to the Github repo, which tries to summarize what is
> robots.txt and how it is used. Robots.txt has been described by Ivan Herman
> as a possible source of inspiration during our last call.
>
>
>
> https://github.com/w3c/tdm-reservation-protocol/blob/main/docs/robots.md
> <https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fw3c%2Ftdm-reservation-protocol%2Fblob%2Fmain%2Fdocs%2Frobots.md&data=04%7C01%7Clrosenth%40adobe.com%7C2b7b105680864ad5690c08d8d99603d9%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C1%7C637498585628406402%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=kZWJwgmJ3PjWOOutxW1xjB4qMi0T2BUdFAWMgNtaIDo%3D&reserved=0>
>
>
>
>
> Best regards
>
> Laurent Le Meur
>
>
Received on Thursday, 25 February 2021 16:20:54 UTC