Fwd: XML Schema validation and https redirects from Norm Tovey-Walsh on 2022-08-18 (xmlschema-dev@w3.org from August 2022)

From: Norm Tovey-Walsh <ndw@nwalsh.com>
Date: Thu, 18 Aug 2022 13:32:28 +0100
To: Michael Sperberg-McQueen <cmsmcq@blackmesatech.com>
CC: Michael Kay <mike@saxonica.com>, xmlschema-dev@w3.org, Gerald Oskoboiny <gerald@w3.org>
Message-ID: <m2k075lmty.fsf@nwalsh.com>
Hi,

Michael Kay forwarded this to me. I’ve tried to re-join xmlschema-dev
but it’s been a few hours and I’ve seen no evidence one way or the
other. So the list post might bounce.

> From: "C. M. Sperberg-McQueen" <cmsmcq@blackmesatech.com>
> Subject: Re: XML Schema validation and https redirects
> Date: 18 August 2022 at 01:12:54 BST
> To: Gerald Oskoboiny <gerald@w3.org>
> Cc: xmlschema-dev@w3.org
> Resent-From: xmlschema-dev@w3.org
>
> Gerald Oskoboiny <gerald@w3.org> writes:
>
> W3C's main web site https://www.w3.org/ will soon start to redirect
> all http requests to https. Will this cause issues for XML
> Schema-related resources hosted on www.w3.org?
>
> To this top-level question I have no reliable answer.  It SHOULD not
> cause major issues; it probably WILL cause at least some issues, just
> because it's so easy to put things like this off until something
> actually breaks.

Like Micheal (Sperberg-McQueen), I’m inclined to hedge my bets. What’s
actually going to happen?

1. Some validator is going to request http://www.w3.org/schema.xsd to
   get a schema file.

2. That’s going to get redirected to https://www.w3.org/schema.xsd.

3a. If the API that the validator is using is not configured to follow
    redirects, then they’re going to get back a 301 and whatever
    resource is sent with that reply.

  4a. If the 301 response is not WF XML, then the process may report an
      error
  4b. If the 301 response is WF XML, then the process will parse it but
      may still report an error because it won’t be a schema document

  5. If the validator didn’t report an error because it failed to get an
     XSD file, then it’ll proceed without the schema document. That
     probably won’t work, but it’s a bit hard to predict how it’ll fail.

3b. If the API that the validator is using is configured to follow
    redirects, then it will eventually get a schema document.

  4c. If the API returns the schema document with the original, http:
      URI as the system identifier, then the rest of the validation
      process will be none the wiser and everything should work.

  4d. If the API returns the schema document with the https: URI as the
      system identifier, then…

    5a. If the validator ignores the system identifier and looks only at
        the schema document and its targetNamespace and other internal
        features, everything should work.

    5b. If the validator looks at the system identifier, I suppose some
        part of the validator might decide that https:// doesn’t match
        http:// and conclude that it has the wrong namespace.
  
I won’t be at all surprised if 3a happens to some validators. Libraries,
such as the Apache HTTP library, don’t follow redirects unless you take
special care to make them. (I’ve made this mistake within the last few
years and not noticed it for “a while”.)

I have no real intuition about how likely 5b is. My wild guess is “not
very likely” because once you’ve got the schema, you’re probably more
concerned about what targetNamespace it claims to validate than what its
URI was.

> Is it intended that www.w3.org is in the critical path when performing
> XML Schema validation?
>
> Yes and no, at least in my reading of the XSD spec and my recollections
> of the WG discussions.

The fact that the W3C has been throttling the response time for requests
for DTD and XSD files has probably pushed most validators to avoid
making requests to w3.org for well-known schemas.

Does it make sense to look at the user agents that have requested XSD
files over the past 6-12 months and see what the distribution is like?

Saxonica used to bundle a bunch of schema documents in the jar file so
that they could be obtained without hitting the W3C. That feature is now
bundled in the XML Resolver API.

But at the same time, I’ve been caught out a few times over the years
wondering why some validation process was so mind numbingly slow only to
discover that I’d let one slip in. So it will happen.

>    As a special case of this, knowledge of the XSD schema can be (and I
>    expect probably is) built in to most schema validators, so that they
>    don't have any pressing need to fetch a copy of the XSD schema for
>    XSD schemas.

I saw this one in the wild within the last year: (Some of the) XSD for
XSD Schemas have a doctype declaration, for example this one:

  http://www.w3.org/2001/XMLSchema.xsd

I discovered some bit of software, I forget the exact details, that had
a cached copy of the XSD but not the DTD so parsing the cached XSD made
a DTD request to www.w3.org every time…

> Are there other use cases
> besides validation that might cause automated requests to www.w3.org?

Validation is the obvious case, but I wouldn’t be surprised if there are
others. I’m sure XSLT stylesheets have been published in some specs. You
wouldn’t think anyone would xsl:import them from www.w3.org every time,
but I wouldn’t be at all surprised if it has happened.

Again, looking at the logs should be informative.

> What are the most popular software packages that might be making these
> requests to www.w3.org? In what contexts do they make these requests?
> Do the latest versions typically have the ability to follow http to
> https redirects? Would XML catalogs help?

Yes, XML catalogs help. They allow the application author and/or user to
configure local resources that can be returned automatically when
attempts are made to retrieve documents over the web.

Good luck!

                                        Be seeing you,
                                          norm

--
Norman Tovey-Walsh <ndw@nwalsh.com>
https://nwalsh.com/

> My fate cannot be mastered; it can only be collaborated with and
> thereby, to some extent, directed. Nor am I the captain of my soul; I
> am only its noisiest passenger.--Aldous Huxley
Received on Thursday, 18 August 2022 13:10:21 UTC