RE: Lars's comments on the BP document (was: BP document is FROZEN pending vote to release next WD)

Good Morning Clemens,

On Wednesday, February 08, 2017 12:42 PM, Clemens Portele [mailto:portele@interactive-instruments.de] wrote:

> You are correct about the nesting of sitemaps and also that the current "will not work
> for larger datasets" is oversimplifying things. However, while your proposed text is
> correct, I think we should add a bit more context and explanation to guide data
> providers.

Yes, sitemaps look simple but their creation can indeed be quite complex...

> If a dataset contains millions of spatial things (e.g. many building, address
> or cadastral parcel datasets), generating and maintaining the sitemaps is at the very
> least quite complex and typically resource intensive, also considering that the dataset
> will see frequent changes (although most of the spatial things rarely change). Basically
> the sitemaps contain a register of all spatial things, datasets, etc. on a site and using
> standard sitemap builder tools will often not work, i.e. a custom approach is required.
> At least this was our experience when we looked at it.

That is my experience, too. We had a custom sitemap generator for a subset of our data that could only generate one sitemap file, so when the subset grew the search engines simply stopped crawling it... There will be a re-implementation sometime this year, I hope.

> If others have found a way to make it work for such cases, that would indeed be a
> good example. Also, it would be good to have some practical experience, if such
> sitemap structures with millions of entries (siginificantly) help getting such larger sites
> indexed.

We'll have two fairly large sitemaps: one with ~10M, one with about ~27M URLs in them, so while I can't provide any experience now, I hope that I can in about six months.

Best,

Lars

> > On 8 Feb 2017, at 11:43, Svensson, Lars <L.Svensson@dnb.de> wrote:
> >
> > All,
> >
> > On Monday, February 06, 2017 12:01 PM, Jeremy Tandy
> [mailto:jeremy.tandy@gmail.com] wrote:
> >
> >> BP document is FROZEN and ready for people to read/review (see emails in this
> thread
> >> [1] for the change-log).
> >
> > First of all: The changes have made the document much easier to read and it's
> much clearer, what is the proposed outcome when someone wants to implement the
> BPs. A large bunch of kudos to the editors and contributors! And +1 from me to
> publish this as a WD.
> >
> > And I have some comments.
> >
> > 1) What has happened to the references? I cannot find them in the github version...
> [1]
> >
> > 2) BP4 [2] says that "sitemaps currently are limited to several thousands of entries
> and will not work for larger datasets". IMHO this is not correct. The sitemap
> specification [3] says that "each Sitemap file that you provide must have no more than
> 50,000 URLs and must be no larger than 50MB (52,428,800 bytes)". It then goes on to
> state that you can provide multiple sitemaps and list them in an index file and that
> "index files may not list more than 50,000 Sitemaps and must be no larger than 50MB
> (52,428,800 bytes)". You can, however, have multiple index files, too. But even using
> just one index file means that you can list 50.000^^2 URLs in your sitemaps which
> should be enough for most applications. For the next iteration, I propose the following
> text:
> > [[
> > You may also consider using Sitemaps to direct the Web-crawler; please refer to the
> sitemap protocol specification [https://www.sitemaps.org/protocol.html] for more
> information.
> > ]]
> >
> > 3) BP4 (again) in sec 3 (Decide what spatial relationships to use) says "The
> geographical, topological and social hierarchy should be described with clear semantics
> and registered with IANA Link relations." What exactly should be registered with IANA
> link relations? Is the following meant:
> > [[
> > The geographical, topological and social hierarchy should be described with clear
> semantics and use relations registered in the IANA Link relations registry.
> > ]]
> > or
> > [[
> > The geographical, topological and social hierarchy should be described with clear
> semantics. If you use relations not registered with IANA Link relations registry, please
> register them there.
> > ]]
> > Put differently: Is the BP to use only relations already registered with IANA, or is the
> BP to register new relations with IANA?
> >
> > The rest of my comments are only editorial:
> > 1) In §5 [4] you refer to the Deutsche Nationalbibliothek (yay!). Please don't use the
> URL you see in the browser. Instead use the CMS-independent one [5].
> > 2) There are two places in the document where references start with two square
> brackets "[[". As a result there are no hyperlinks to the (missing) references section.
> > 3) s/converstion/conversion/ (somewhere in sec 8)
> > 4) §8 and BP 17 say "Alternatively you can re-project your coordinates to WGS84
> Long/Lat using many available tools online." Do we want to point to specific tools?
> > 5) §8 says "So we are now at the point where 99.9% of people can stop reading". If
> we really assume that 99.9% of all readers at that point they will never reach the very
> interesting information about the surface of the earth moving and the impact of that
> on self-driving cars that is two paragraphs further down... Maybe we should put the
> final paragraph as number three in §8.
> >
> > [1] https://w3c.github.io/sdw/bp/

> > [2] https://w3c.github.io/sdw/bp/#indexable-by-search-engines

> > [3] https://www.sitemaps.org/protocol.html#index

> > [4] https://w3c.github.io/sdw/bp/#spatial-things-features-and-geometry

> > [5] http://www.dnb.de/

> >
> > Talk to you later,
> >
> > Lars
> >
> >
> > *** Lesen. Hören. Wissen. Deutsche Nationalbibliothek ***
> > --
> > Dr. Lars G. Svensson
> > Deutsche Nationalbibliothek
> > Informationsinfrastruktur
> > Adickesallee 1
> > 60322 Frankfurt am Main
> > Telefon: +49 69 1525-1752
> > Telefax: +49 69 1525-1799
> > mailto:l.svensson@dnb.de
> > http://www.dnb.de

> >
> >
> >

Received on Thursday, 9 February 2017 07:59:51 UTC