W3C home > Mailing lists > Public > public-bioschemas@w3.org > November 2020

Re: Robots.txt and Sitemap files

From: Justin Clark-Casey <justinccdev@gmail.com>
Date: Wed, 4 Nov 2020 18:14:19 +0000
Message-ID: <CAME9NR9EjJzcsME0Sn7NkTW0amt6WBtXFM+aZtGBOZ09zCYWhg@mail.gmail.com>
To: Carole Goble <carole.goble@manchester.ac.uk>
Cc: "LJ.Garcia" <lj.garcia.co@gmail.com>, "Gray, Alasdair J G" <A.J.G.Gray@hw.ac.uk>, Dan Brickley <danbri@google.com>, "public-bioschemas@w3.org" <public-bioschemas@w3.org>
I just added robots.txt advice to the sitemap advice that I wrote up long
ago [1]. This technical wiki page is still reachable via the technical link
in the Bioschemas website menu.

Best,

Justin Clark-Casey

On Tue, 3 Nov 2020 at 11:45, Carole Goble <carole.goble@manchester.ac.uk>
wrote:

> +1 Leyla
>
>
>
> Carole
>
>
>
>
>
> *From:* LJ.Garcia [mailto:lj.garcia.co@gmail.com]
> *Sent:* 03 November 2020 11:34
> *To:* Gray, Alasdair J G
> *Cc:* Dan Brickley; public-bioschemas@w3.org
> *Subject:* Re: Robots.txt and Sitemap files
>
>
>
> Hi Alasdair,
>
>
>
> I would say good practices about sitemaps and robots.txt would fall into
> the subject for our next community call.
>
>
>
> Regards,
>
>
>
> On Tue, Nov 3, 2020 at 10:05 AM Gray, Alasdair J G <A.J.G.Gray@hw.ac.uk>
> wrote:
>
> Hi All
>
>
>
> Dan thanks for the prompt on this and I would also encourage the use of
> sitemaps to allow us to know what pages are available on your site.
>
>
>
> I have added a field to the list of live deploys that lists the sitemap as
> well, although this is currently not shown on the website it is useful for
> us to have a list of these. You can find details in the following PR
>
> https://github.com/BioSchemas/bioschemas.github.io/pull/340
>
>
>
> Best regards
>
>
>
> Alasdair
>
>
>
> --
>
> Alasdair J G Gray
>
> Associate Professor in Computer Science,
> School of Mathematical and Computer Sciences
> Heriot-Watt University, Edinburgh, UK.
>
> Email: A.J.G.Gray@hw.ac.uk <A.J.G.Gray@hw.ac.uk>
> Web: http://www.macs.hw.ac.uk/~ajg33
> ORCID: http://orcid.org/0000-0002-5711-4872
> Office: Earl Mountbatten Building 1.39
> Twitter: @gray_alasdair
>
>
>
>
>
> Heriot-Watt is a global University, as a result my working hours may not
> be your working hours. Do not feel pressure to reply to this email outside
> your working hours.
>
>
>
>
>
> To arrange a meeting: https://doodle.com/mm/alasdairgray/book-a-time
>
>
>
>
>
> *From: *"danbri@google.com" <danbri@google.com>
> *Date: *Monday, 2 November 2020 at 19:12
> *To: *"public-bioschemas@w3.org" <public-bioschemas@w3.org>
> *Subject: *Robots.txt and Sitemap files
> *Resent from: *"public-bioschemas@w3.org" <public-bioschemas@w3.org>
> *Resent date: *Monday, 2 November 2020 at 19:11
>
>
>
>
> ***************************************************************** *
> *Caution: This email originated from a sender outside Heriot-Watt
> University. Do not follow links or open attachments if you doubt the
> authenticity of the sender or the content. *
> * *****************************************************************
>
>
>
>
>
> Just a quick note to encourage discussion of robots.txt
> <https://en.wikipedia.org/wiki/Robots_exclusion_standard> and sitemap
> <https://en.wikipedia.org/wiki/Sitemaps> files as something that
> bioschemas implementers should think about. There are a few cases of
> bioschemas-publishing sites excluding most crawlers via a very restrictive
> robots.txt file. Similarly, sitemap files can make large and complex sites
> easier for crawlers (whether simple code or large/commercial) to collect
> data from efficiently, including URL discovery. Since the hope has always
> been that bioschemas will encourage innovative uses of marked up data, it
> seems worth making sure that sites aren't accidentally excluding
> bioschema-crawlers...
>
>
>
> cheers,
>
>
>
> Dan
> ------------------------------
>
> Founded in 1821, Heriot-Watt is a leader in ideas and solutions. With
> campuses and students across the entire globe we span the world, delivering
> innovation and educational excellence in business, engineering, design and
> the physical, social and life sciences. This email is generated from the
> Heriot-Watt University Group, which includes:
>
> 1.      Heriot-Watt University, a Scottish charity registered under
> number SC000278
>
> 2.      Heriot- Watt Services Limited (Oriam), Scotland's national
> performance centre for sport. Heriot-Watt Services Limited is a private
> limited company registered is Scotland with registered number SC271030 and
> registered office at Research & Enterprise Services Heriot-Watt University,
> Riccarton, Edinburgh, EH14 4AS.
>
> The contents (including any attachments) are confidential. If you are not
> the intended recipient of this e-mail, any disclosure, copying,
> distribution or use of its contents is strictly prohibited, and you should
> please notify the sender immediately and then delete it (including any
> attachments) from your system.
>
>
Received on Wednesday, 4 November 2020 18:15:11 UTC

This archive was generated by hypermail 2.4.0 : Wednesday, 4 November 2020 18:15:11 UTC