- From: Sergey Shekyan <shekyan@gmail.com>
- Date: Tue, 17 Jan 2017 11:45:27 -0800
- To: Jonathan Garbee <jonathan.garbee@gmail.com>
- Cc: Daniel Veditz <dveditz@mozilla.com>, "public-webappsec@w3.org" <public-webappsec@w3.org>
- Message-ID: <CAPkvmc8L1zt43q8rALQi=OMUxjHQZ8U+AqnVXGe=_0XsVaGhpA@mail.gmail.com>
They should reply differently exactly in the same way they respond now. For example, to recommend the use of an API for scraping rather that loading heavy resources for every page, or immediately send through failed CAPTCHA route, or not to show ads to a headless browser. The only difference would be that they would stop inferring many indirect signals of automation if there is a flag for that already set by the UA. Sure, if UA automation tools would have a built in way to honor robots.txt, that might solve some of the problems, but they don't. All I am asking is to standardize that mechanism. On Mon, Jan 16, 2017 at 11:18 PM, Jonathan Garbee <jonathan.garbee@gmail.com > wrote: > I'm what way should they respond differently? The site has absolutely no > context as to why headless is being used. Why mangle the response without > any context and just hope your users still get benefit from it? > > On Mon, Jan 16, 2017, 4:47 PM Sergey Shekyan <shekyan@gmail.com> wrote: > >> robots.txt is either is an on/off switch, while what I propose is more >> granular, allowing websites to chose how to respond. >> >> >> On Sat, Jan 14, 2017 at 5:52 AM, Jonathan Garbee < >> jonathan.garbee@gmail.com> wrote: >> >> I don't see where having a header or something to help detect automated >> access will be beneficial. We can already automate browser engines. >> Headless mode is just a native way to do it. So, if someone is already not >> taking your robots.txt into account, they'll just use another method or >> strip whatever we add to say headless mode is in use out. Sites don't gain >> any true benefit from having this kind of detection. If someone wants to >> automate tasks they do regularly, that's their prerogative. We have >> robots.txt as a respectful way to ask people automating things to avoid >> certain areas and actions, that easily continues into headless mode. >> >> On Sat, Jan 14, 2017, 4:28 AM Sergey Shekyan <shekyan@gmail.com> wrote: >> >> I am talking about tools that automate user agents, e.g. headless >> browsers (PhantomJS, SlimerJS, headless Chrome), Selenium, curl, etc. >> I mentioned navigation requests as don't see so far how advertising >> automation to non-navigation requests would help. >> Another option to advertise can be a property on navigator object, which >> would defer possible actions by authors to second request. >> >> >> On Sat, Jan 14, 2017 at 12:56 AM, Daniel Veditz <dveditz@mozilla.com> >> wrote: >> >> On Fri, Jan 13, 2017 at 5:11 PM, Sergey Shekyan <shekyan@gmail.com> >> wrote: >> >> I think that attaching a HTTP request header to synthetically initiated >> navigation requests (https://fetch.spec.whatwg.org/#navigation-request) >> will help authors to build more reliable mechanisms to detect unwanted >> automation. >> >> >> I don't see anything in that spec about "synthetic" navigation requests. >> Where would you define that? How would you define that? Is a scripted >> window.open() in a browser "synthetic"? what about an iframe in a page? >> Does it matter if the user expected the iframe to be there or not (such as >> ads)? What if the page had 100 iframes? >> >> Are you trying to solve the same problem robots.txt is trying to solve? >> If not what kind of automation are you talking about? >> >> - >> Dan Veditz >> >> >> >>
Received on Tuesday, 17 January 2017 19:46:20 UTC