- From: Norm Tovey-Walsh <ndw@nwalsh.com>
- Date: Thu, 18 Aug 2022 13:32:28 +0100
- To: Michael Sperberg-McQueen <cmsmcq@blackmesatech.com>
- CC: Michael Kay <mike@saxonica.com>, xmlschema-dev@w3.org, Gerald Oskoboiny <gerald@w3.org>
- Message-ID: <m2k075lmty.fsf@nwalsh.com>
Hi, Michael Kay forwarded this to me. I’ve tried to re-join xmlschema-dev but it’s been a few hours and I’ve seen no evidence one way or the other. So the list post might bounce. > From: "C. M. Sperberg-McQueen" <cmsmcq@blackmesatech.com> > Subject: Re: XML Schema validation and https redirects > Date: 18 August 2022 at 01:12:54 BST > To: Gerald Oskoboiny <gerald@w3.org> > Cc: xmlschema-dev@w3.org > Resent-From: xmlschema-dev@w3.org > > Gerald Oskoboiny <gerald@w3.org> writes: > > W3C's main web site https://www.w3.org/ will soon start to redirect > all http requests to https. Will this cause issues for XML > Schema-related resources hosted on www.w3.org? > > To this top-level question I have no reliable answer. It SHOULD not > cause major issues; it probably WILL cause at least some issues, just > because it's so easy to put things like this off until something > actually breaks. Like Micheal (Sperberg-McQueen), I’m inclined to hedge my bets. What’s actually going to happen? 1. Some validator is going to request http://www.w3.org/schema.xsd to get a schema file. 2. That’s going to get redirected to https://www.w3.org/schema.xsd. 3a. If the API that the validator is using is not configured to follow redirects, then they’re going to get back a 301 and whatever resource is sent with that reply. 4a. If the 301 response is not WF XML, then the process may report an error 4b. If the 301 response is WF XML, then the process will parse it but may still report an error because it won’t be a schema document 5. If the validator didn’t report an error because it failed to get an XSD file, then it’ll proceed without the schema document. That probably won’t work, but it’s a bit hard to predict how it’ll fail. 3b. If the API that the validator is using is configured to follow redirects, then it will eventually get a schema document. 4c. If the API returns the schema document with the original, http: URI as the system identifier, then the rest of the validation process will be none the wiser and everything should work. 4d. If the API returns the schema document with the https: URI as the system identifier, then… 5a. If the validator ignores the system identifier and looks only at the schema document and its targetNamespace and other internal features, everything should work. 5b. If the validator looks at the system identifier, I suppose some part of the validator might decide that https:// doesn’t match http:// and conclude that it has the wrong namespace. I won’t be at all surprised if 3a happens to some validators. Libraries, such as the Apache HTTP library, don’t follow redirects unless you take special care to make them. (I’ve made this mistake within the last few years and not noticed it for “a while”.) I have no real intuition about how likely 5b is. My wild guess is “not very likely” because once you’ve got the schema, you’re probably more concerned about what targetNamespace it claims to validate than what its URI was. > Is it intended that www.w3.org is in the critical path when performing > XML Schema validation? > > Yes and no, at least in my reading of the XSD spec and my recollections > of the WG discussions. The fact that the W3C has been throttling the response time for requests for DTD and XSD files has probably pushed most validators to avoid making requests to w3.org for well-known schemas. Does it make sense to look at the user agents that have requested XSD files over the past 6-12 months and see what the distribution is like? Saxonica used to bundle a bunch of schema documents in the jar file so that they could be obtained without hitting the W3C. That feature is now bundled in the XML Resolver API. But at the same time, I’ve been caught out a few times over the years wondering why some validation process was so mind numbingly slow only to discover that I’d let one slip in. So it will happen. > As a special case of this, knowledge of the XSD schema can be (and I > expect probably is) built in to most schema validators, so that they > don't have any pressing need to fetch a copy of the XSD schema for > XSD schemas. I saw this one in the wild within the last year: (Some of the) XSD for XSD Schemas have a doctype declaration, for example this one: http://www.w3.org/2001/XMLSchema.xsd I discovered some bit of software, I forget the exact details, that had a cached copy of the XSD but not the DTD so parsing the cached XSD made a DTD request to www.w3.org every time… > Are there other use cases > besides validation that might cause automated requests to www.w3.org? Validation is the obvious case, but I wouldn’t be surprised if there are others. I’m sure XSLT stylesheets have been published in some specs. You wouldn’t think anyone would xsl:import them from www.w3.org every time, but I wouldn’t be at all surprised if it has happened. Again, looking at the logs should be informative. > What are the most popular software packages that might be making these > requests to www.w3.org? In what contexts do they make these requests? > Do the latest versions typically have the ability to follow http to > https redirects? Would XML catalogs help? Yes, XML catalogs help. They allow the application author and/or user to configure local resources that can be returned automatically when attempts are made to retrieve documents over the web. Good luck! Be seeing you, norm -- Norman Tovey-Walsh <ndw@nwalsh.com> https://nwalsh.com/ > My fate cannot be mastered; it can only be collaborated with and > thereby, to some extent, directed. Nor am I the captain of my soul; I > am only its noisiest passenger.--Aldous Huxley
Received on Thursday, 18 August 2022 13:10:21 UTC