W3C home > Mailing lists > Public > public-bioschemas@w3.org > November 2020

Re: Robots.txt and Sitemap files

From: Dan Brickley <danbri@danbri.org>
Date: Wed, 4 Nov 2020 18:40:58 +0000
Message-ID: <CAFfrAFpAsGdh71MEx7ipBa6kd_jFP48a9YBmc4vWAAhOgJ3rhQ@mail.gmail.com>
To: Justin Clark-Casey <justinccdev@gmail.com>
Cc: Carole Goble <carole.goble@manchester.ac.uk>, "LJ.Garcia" <lj.garcia.co@gmail.com>, "Gray, Alasdair J G" <A.J.G.Gray@hw.ac.uk>, Dan Brickley <danbri@google.com>, "public-bioschemas@w3.org" <public-bioschemas@w3.org>
Thanks. I have taken the liberty of making a small edit to
https://github.com/BioSchemas/specifications/wiki/Technical to encourage
consideration of other crawlers beyond Google's.

Dan

On Wed, 4 Nov 2020 at 18:15, Justin Clark-Casey <justinccdev@gmail.com>
wrote:

> I just added robots.txt advice to the sitemap advice that I wrote up long
> ago [1]. This technical wiki page is still reachable via the technical link
> in the Bioschemas website menu.
>
> Best,
>
> Justin Clark-Casey
>
> On Tue, 3 Nov 2020 at 11:45, Carole Goble <carole.goble@manchester.ac.uk>
> wrote:
>
>> +1 Leyla
>>
>>
>>
>> Carole
>>
>>
>>
>>
>>
>> *From:* LJ.Garcia [mailto:lj.garcia.co@gmail.com]
>> *Sent:* 03 November 2020 11:34
>> *To:* Gray, Alasdair J G
>> *Cc:* Dan Brickley; public-bioschemas@w3.org
>> *Subject:* Re: Robots.txt and Sitemap files
>>
>>
>>
>> Hi Alasdair,
>>
>>
>>
>> I would say good practices about sitemaps and robots.txt would fall into
>> the subject for our next community call.
>>
>>
>>
>> Regards,
>>
>>
>>
>> On Tue, Nov 3, 2020 at 10:05 AM Gray, Alasdair J G <A.J.G.Gray@hw.ac.uk>
>> wrote:
>>
>> Hi All
>>
>>
>>
>> Dan thanks for the prompt on this and I would also encourage the use of
>> sitemaps to allow us to know what pages are available on your site.
>>
>>
>>
>> I have added a field to the list of live deploys that lists the sitemap
>> as well, although this is currently not shown on the website it is useful
>> for us to have a list of these. You can find details in the following PR
>>
>> https://github.com/BioSchemas/bioschemas.github.io/pull/340
>>
>>
>>
>> Best regards
>>
>>
>>
>> Alasdair
>>
>>
>>
>> --
>>
>> Alasdair J G Gray
>>
>> Associate Professor in Computer Science,
>> School of Mathematical and Computer Sciences
>> Heriot-Watt University, Edinburgh, UK.
>>
>> Email: A.J.G.Gray@hw.ac.uk <A.J.G.Gray@hw.ac.uk>
>> Web: http://www.macs.hw.ac.uk/~ajg33
>> ORCID: http://orcid.org/0000-0002-5711-4872
>> Office: Earl Mountbatten Building 1.39
>> Twitter: @gray_alasdair
>>
>>
>>
>>
>>
>> Heriot-Watt is a global University, as a result my working hours may not
>> be your working hours. Do not feel pressure to reply to this email outside
>> your working hours.
>>
>>
>>
>>
>>
>> To arrange a meeting: https://doodle.com/mm/alasdairgray/book-a-time
>>
>>
>>
>>
>>
>> *From: *"danbri@google.com" <danbri@google.com>
>> *Date: *Monday, 2 November 2020 at 19:12
>> *To: *"public-bioschemas@w3.org" <public-bioschemas@w3.org>
>> *Subject: *Robots.txt and Sitemap files
>> *Resent from: *"public-bioschemas@w3.org" <public-bioschemas@w3.org>
>> *Resent date: *Monday, 2 November 2020 at 19:11
>>
>>
>>
>>
>> ***************************************************************** *
>> *Caution: This email originated from a sender outside Heriot-Watt
>> University. Do not follow links or open attachments if you doubt the
>> authenticity of the sender or the content. *
>> * *****************************************************************
>>
>>
>>
>>
>>
>> Just a quick note to encourage discussion of robots.txt
>> <https://en.wikipedia.org/wiki/Robots_exclusion_standard> and sitemap
>> <https://en.wikipedia.org/wiki/Sitemaps> files as something that
>> bioschemas implementers should think about. There are a few cases of
>> bioschemas-publishing sites excluding most crawlers via a very restrictive
>> robots.txt file. Similarly, sitemap files can make large and complex sites
>> easier for crawlers (whether simple code or large/commercial) to collect
>> data from efficiently, including URL discovery. Since the hope has always
>> been that bioschemas will encourage innovative uses of marked up data, it
>> seems worth making sure that sites aren't accidentally excluding
>> bioschema-crawlers...
>>
>>
>>
>> cheers,
>>
>>
>>
>> Dan
>> ------------------------------
>>
>> Founded in 1821, Heriot-Watt is a leader in ideas and solutions. With
>> campuses and students across the entire globe we span the world, delivering
>> innovation and educational excellence in business, engineering, design and
>> the physical, social and life sciences. This email is generated from the
>> Heriot-Watt University Group, which includes:
>>
>> 1.      Heriot-Watt University, a Scottish charity registered under
>> number SC000278
>>
>> 2.      Heriot- Watt Services Limited (Oriam), Scotland's national
>> performance centre for sport. Heriot-Watt Services Limited is a private
>> limited company registered is Scotland with registered number SC271030 and
>> registered office at Research & Enterprise Services Heriot-Watt University,
>> Riccarton, Edinburgh, EH14 4AS.
>>
>> The contents (including any attachments) are confidential. If you are not
>> the intended recipient of this e-mail, any disclosure, copying,
>> distribution or use of its contents is strictly prohibited, and you should
>> please notify the sender immediately and then delete it (including any
>> attachments) from your system.
>>
>>
Received on Wednesday, 4 November 2020 18:41:24 UTC

This archive was generated by hypermail 2.4.0 : Wednesday, 4 November 2020 18:41:26 UTC