Re: Just what *does* robots.txt mean for a LOD site?

On 11 Aug 2014, at 15:49, Sarven Capadisli <info@csarven.ca> wrote:

> I briefly brought up something like this to Henry Story for WebIDs. That is, it'd be cool to encourage the use of WebID's for crawlers, so that, the server logs would show them in place of User-Agents. That URI could also say something like "we are crawling these domains, so, yes, it is really us if you see us in your logs (and not someone pretending)".
> 
> I don't know what the state of that stuff is with WebID. Maybe Kingsley or Henry can comment further.

yes that seemed like long term a good idea.

For WebID see: http://www.w3.org/2005/Incubator/webid/spec/

The WebID Profile could contain info about the type of Agent.
The WebID-TLS auth could allow the robot to authenticate.
( Other WebID based authentications to be developed could be used too.
  you can easily work out from the WebID-TLS what another system of
  authentication could be )

This would allow one then to create Web Access Control rules where
one can allow any robot access in Read Only for a certain type of resource.
One could then also attach useage rules ( to be developed ) to the document.

Henry


> On 2014-08-11 12:11, Hugh Glaser wrote:
>> So should we have a class of agents for Linked Data access?
>> Can you have a class of agents, or rather can an agent have more than one ID?
>> (In particular, can a spider identify as both class and spider instance?)
>> Actually ldspider is quite a good class ID :-)
>> 
>> On 10 Aug 2014, at 21:31, Sarven Capadisli <info@csarven.ca> wrote:
>> 
>>> Hi Hugh,  just a side discussion
>>> 
>>> Currently I let all the bots have a field day, unless they are clearly abusing or some student didn't get the memo on making reasonably frequent requests.
>>> 
>>> If I were to start blocking the bigger crawlers, Google would be first to go. That's beside the fact that it is possible to control their crawl rate through Webmaster tools. The main reason for me is that, I simply don't see a "return" from them. They don't mind hammering the site if you let them, but try checking all those resources in Google search results - it is a gamble. I have a lot of resources which are statistical observations that don't really differ much from one document to another (at least what most humans or Google would consider). So, any way, I would give SW/LD crawlers the VIP line if I can because they tend to hit sporadically. Which is something I can live with.
>>> 
>>> -Sarven
>>> 
>>> On 2014-08-09 14:17, Hugh Glaser wrote:hye
>>>> Hi Tobias,
>>>> I have also done the same in
>>>> http://sameas.org/robots.txt
>>>> (Well, Kingsley said “Yes”, when I asked if I should :-)
>>>> 
>>>> I know it is past time for the spider, but it will happen next time, I guess.
>>>> And it will also open up all the sub-stores (http://www.sameas.org/store/), such as Sarven’s 270a.
>>>> I’m not sure how the sameas.org URIs will work in fact - it may be that the linkage won’t make it happen, but it will be interesting to see.
>>>> Have at them whenever you like :-)
>>>> 
>>>> Very best
>>>> Hugh
>>>> 
>>>> On 6 Aug 2014, at 00:01, Tobias Käfer <tobias.kaefer@kit.edu> wrote:
>>>> 
>>>>>> :-)
>>>>>> I thought I had done what you suggested:
>>>>>> 
>>>>>> User-agent: ldspider
>>>>>> Disallow:
>>>>>> Allow: /
>>>>>> 
>>>>>> Which should allow ldspider to crawl the site.
>>>>> 
>>>>> OK, then I got your "No, thank you." line wrong.
>>>>> 
>>>>> But the robots.txt is fine then :) and ldspider will not refrain from crawling the site any more.
>>>>> 
>>>>> Btw. One of the two lines -"Allow: /" and "Disallow:"- is sufficient. The Disallow line is the older way of putting it, so you might want to remove the "Allow: /" line again.
>>>>> 
>>>>> Cheers,
>>>>> 
>>>>> Tobias
>>>>> 
>>>>>> On 5 Aug 2014, at 18:06, Tobias Käfer <tobias.kaefer@kit.edu> wrote:
>>>>>> 
>>>>>>> Hi Hugh,
>>>>>>> 
>>>>>>> sorry for getting you wrong, but still I do not get what behaviour you want. What you are saying looks different from the robots.txt. If you tell me how you want it, I can help with the robots.txt (hopefully).
>>>>>>> 
>>>>>>> Cheers,
>>>>>>> 
>>>>>>> Tobias
>>>>>>> 
>>>>>>> Am 05.08.2014 um 19:01 schrieb Hugh Glaser:
>>>>>>>> Hi Tobias,
>>>>>>>> On 5 Aug 2014, at 17:33, Tobias Käfer <tobias.kaefer@kit.edu> wrote:
>>>>>>>> 
>>>>>>>>> Hi Hugh,
>>>>>>>>> 
>>>>>>>>>> By the way, have I got my robots.txt right?
>>>>>>>>>> In particular, is the
>>>>>>>>>> User-agent: LDSpider
>>>>>>>>>> correct?
>>>>>>>>>> Should I worry about case-sensitivity?
>>>>>>>>> 
>>>>>>>>> The library (norbert) that is employed in LDspider is case-insensitive for the user agent. The user agent that is sent is "ldspider".
>>>>>>>>> 
>>>>>>>>> I suppose you want ldspider to crawl your site (highly appreciated),
>>>>>>>> No, thank you.
>>>>>>>>> so you should change the line in your robots.txt for LDspider to:
>>>>>>>>> a) Disallow:
>>>>>>>>> b) Allow: /
>>>>>>>>> And not leave it with:
>>>>>>>>> c) Allow: *
>>>>>>>>> The star there does not bring the desired behaviour (and I have not found it in the spec for the path either), in fact, it keeps LDspider from crawling the folders you specified for exclusion for the other crawlers.
>>>>>>>> Hopefully it is OK now:
>>>>>>>> http://ibm.rkbexplorer.com/robots.txt
>>>>>>>> 
>>>>>>>> Cheers
>>>>>>>>> 
>>>>>>>>> Cheers,
>>>>>>>>> 
>>>>>>>>> Tobias
>>>>>>>>> 
>>>>>>>> 
>>>>>> 
>>>> 
>>> 
>>> 
>> 
> 
> 

Social Web Architect
http://bblfish.net/

Received on Monday, 11 August 2014 14:21:21 UTC