W3C home > Mailing lists > Public > public-bioschemas@w3.org > November 2020

Re: Robots.txt and Sitemap files

From: Gray, Alasdair J G <A.J.G.Gray@hw.ac.uk>
Date: Thu, 5 Nov 2020 15:22:13 +0000
To: Dan Brickley <danbri@danbri.org>, Justin Clark-Casey <justinccdev@gmail.com>
CC: Carole Goble <carole.goble@manchester.ac.uk>, LJ.Garcia <lj.garcia.co@gmail.com>, Dan Brickley <danbri@google.com>, "public-bioschemas@w3.org" <public-bioschemas@w3.org>
Message-ID: <F74A585E-D322-4624-BC56-8AA2E69C667D@hw.ac.uk>
Thanks both for these contributions, and Justin particularly for the reminder to the community about the technical wiki.

We’d like to get information like this available in usage guides that are available in the Getting Started section of the website (we plan to move the current content from the GitBook where it currently sits to be natively in the website with the ability then embed markup). We may try to make a start on this during the BioHackathon next week.

Best regards

Alasdair

--
Alasdair J G Gray
Associate Professor in Computer Science,
School of Mathematical and Computer Sciences
Heriot-Watt University, Edinburgh, UK.

Email: A.J.G.Gray@hw.ac.uk<mailto:A.J.G.Gray@hw.ac.uk>
Web: http://www.macs.hw.ac.uk/~ajg33

ORCID: http://orcid.org/0000-0002-5711-4872

Office: Earl Mountbatten Building 1.39
Twitter: @gray_alasdair


Heriot-Watt is a global University, as a result my working hours may not be your working hours. Do not feel pressure to reply to this email outside your working hours.


To arrange a meeting: https://doodle.com/mm/alasdairgray/book-a-time



From: "danbri@danbri.org" <danbri@danbri.org>
Date: Wednesday, 4 November 2020 at 18:41
To: "justinccdev@gmail.com" <justinccdev@gmail.com>
Cc: Carole Goble <carole.goble@manchester.ac.uk>, "Leyla J. Garcia" <lj.garcia.co@gmail.com>, Alasdair Gray <A.J.G.Gray@hw.ac.uk>, "danbri@google.com" <danbri@google.com>, "public-bioschemas@w3.org" <public-bioschemas@w3.org>
Subject: Re: Robots.txt and Sitemap files

****************************************************************
Caution: This email originated from a sender outside Heriot-Watt University.
Do not follow links or open attachments if you doubt the authenticity of the sender or the content.
****************************************************************


Thanks. I have taken the liberty of making a small edit to https://github.com/BioSchemas/specifications/wiki/Technical to encourage consideration of other crawlers beyond Google's.

Dan

On Wed, 4 Nov 2020 at 18:15, Justin Clark-Casey <justinccdev@gmail.com<mailto:justinccdev@gmail.com>> wrote:
I just added robots.txt advice to the sitemap advice that I wrote up long ago [1]. This technical wiki page is still reachable via the technical link in the Bioschemas website menu.

Best,

Justin Clark-Casey

On Tue, 3 Nov 2020 at 11:45, Carole Goble <carole.goble@manchester.ac.uk<mailto:carole.goble@manchester.ac.uk>> wrote:
+1 Leyla

Carole


From: LJ.Garcia [mailto:lj.garcia.co@gmail.com<mailto:lj.garcia.co@gmail.com>]
Sent: 03 November 2020 11:34
To: Gray, Alasdair J G
Cc: Dan Brickley; public-bioschemas@w3.org<mailto:public-bioschemas@w3.org>
Subject: Re: Robots.txt and Sitemap files

Hi Alasdair,

I would say good practices about sitemaps and robots.txt would fall into the subject for our next community call.

Regards,

On Tue, Nov 3, 2020 at 10:05 AM Gray, Alasdair J G <A.J.G.Gray@hw.ac.uk<mailto:A.J.G.Gray@hw.ac.uk>> wrote:
Hi All

Dan thanks for the prompt on this and I would also encourage the use of sitemaps to allow us to know what pages are available on your site.

I have added a field to the list of live deploys that lists the sitemap as well, although this is currently not shown on the website it is useful for us to have a list of these. You can find details in the following PR
https://github.com/BioSchemas/bioschemas.github.io/pull/340


Best regards

Alasdair

--
Alasdair J G Gray
Associate Professor in Computer Science,
School of Mathematical and Computer Sciences
Heriot-Watt University, Edinburgh, UK.

Email: A.J.G.Gray@hw.ac.uk<mailto:A.J.G.Gray@hw.ac.uk>
Web: http://www.macs.hw.ac.uk/~ajg33

ORCID: http://orcid.org/0000-0002-5711-4872

Office: Earl Mountbatten Building 1.39
Twitter: @gray_alasdair


Heriot-Watt is a global University, as a result my working hours may not be your working hours. Do not feel pressure to reply to this email outside your working hours.


To arrange a meeting: https://doodle.com/mm/alasdairgray/book-a-time



From: "danbri@google.com<mailto:danbri@google.com>" <danbri@google.com<mailto:danbri@google.com>>
Date: Monday, 2 November 2020 at 19:12
To: "public-bioschemas@w3.org<mailto:public-bioschemas@w3.org>" <public-bioschemas@w3.org<mailto:public-bioschemas@w3.org>>
Subject: Robots.txt and Sitemap files
Resent from: "public-bioschemas@w3.org<mailto:public-bioschemas@w3.org>" <public-bioschemas@w3.org<mailto:public-bioschemas@w3.org>>
Resent date: Monday, 2 November 2020 at 19:11

****************************************************************
Caution: This email originated from a sender outside Heriot-Watt University.
Do not follow links or open attachments if you doubt the authenticity of the sender or the content.
****************************************************************


Just a quick note to encourage discussion of robots.txt<https://en.wikipedia.org/wiki/Robots_exclusion_standard> and sitemap<https://en.wikipedia.org/wiki/Sitemaps> files as something that bioschemas implementers should think about. There are a few cases of bioschemas-publishing sites excluding most crawlers via a very restrictive robots.txt file. Similarly, sitemap files can make large and complex sites easier for crawlers (whether simple code or large/commercial) to collect data from efficiently, including URL discovery. Since the hope has always been that bioschemas will encourage innovative uses of marked up data, it seems worth making sure that sites aren't accidentally excluding bioschema-crawlers...

cheers,

Dan
________________________________

Founded in 1821, Heriot-Watt is a leader in ideas and solutions. With campuses and students across the entire globe we span the world, delivering innovation and educational excellence in business, engineering, design and the physical, social and life sciences. This email is generated from the Heriot-Watt University Group, which includes:
1.      Heriot-Watt University, a Scottish charity registered under number SC000278
2.      Heriot- Watt Services Limited (Oriam), Scotland's national performance centre for sport. Heriot-Watt Services Limited is a private limited company registered is Scotland with registered number SC271030 and registered office at Research & Enterprise Services Heriot-Watt University, Riccarton, Edinburgh, EH14 4AS.

The contents (including any attachments) are confidential. If you are not the intended recipient of this e-mail, any disclosure, copying, distribution or use of its contents is strictly prohibited, and you should please notify the sender immediately and then delete it (including any attachments) from your system.
Received on Thursday, 5 November 2020 15:22:32 UTC

This archive was generated by hypermail 2.4.0 : Thursday, 5 November 2020 15:22:34 UTC