Re: XML Schema validation and https redirects

Greg Hunt <greg@firmansyah.com> writes:
> From the comments in the W3C blog it sounds like Xerces in Java 11
> does not support this.

I bet it does if you tell Xerces to follow redirects.

(time passes)

Okay. I gave myself 30 minutes to see if I could figure this out. It
took closer to 40, but you know, that’s not bad for programming.

1. I couldn’t find any way to tell the Xerces parser bundled with JDK11
   to follow redirects natively. I’m not saying it can’t be done, I just
   couldn’t figure it out in ~30 minutes.

2. The escape hatch that the parser does give you is the entity
   resolver. If you wrote your own entity resolver that followed
   redirects, that would work.

3. You don’t have to write your own, because XML Resolver exists.
   (Shameless plug because I wrote it.)

Here is the boilerplate schema validation code that I cut and pasted out
of the Oracle JDK11 docs. I’ve added exactly two lines to it:

  // THIS LINE CREATES THE RESOLVER
  org.xmlresolver.Resolver resolver = new org.xmlresolver.Resolver();

  // parse an XML document into a DOM tree
  DocumentBuilder parser = DocumentBuilderFactory.newInstance().newDocumentBuilder();
  Document document = parser.parse(new File("instance.xml"));

  // create a SchemaFactory capable of understanding WXS schemas
  SchemaFactory factory = SchemaFactory.newInstance(XMLConstants.W3C_XML_SCHEMA_NS_URI);

  // THIS LINE USES THE RESOLVER
  factory.setResourceResolver(resolver);

  // load a WXS schema, represented by a Schema instance
  Source schemaFile = new SAXSource(new InputSource("https://www.w3.org/2001/XMLSchema.xsd"));
  Schema schema = factory.newSchema(schemaFile);

  // create a Validator instance, which can be used to validate an instance document
  Validator validator = schema.newValidator();

  // validate the DOM tree
  try {
      validator.validate(new DOMSource(document));
  } catch (SAXException e) {
      // instance document is invalid!
  }

With the addition of those two lines of code, it will follow redirects
and happily validate.  

(For the record, I am not asserting that this is a simple and
straightforward thing for every user to do. Lots of folks using JDK11 to
validate have probably never heard of entity resolvers. Changing
software is hard, especially if it’s considered a legacy application. In
some environments, adding a new third party library may be very hard or
impossible. I just wanted to work out what would be required to fix it,
if you wanted to fix it. The answer is “Write or obtain an entity
resolver that will follow redirects for you and use it.” That’s not hard
in principle, even if it is hard in practice.)

> Break the validation, even momentarily, and all you have is a legacy
> technology that is harder to argue for.
>
> I am with Michael on this, publishing stable URIs, (and I am inclined
> to factor in the frankly rather vague statements about dereferencing
> URLs), constituted a promise to not change things, a promise that you
> cannot evade by saying people ought to be reading the W3C blog and
> updating their software.

I think those are very reasonable and valid points. On the other hand,
configuring software so that it dereferences www.w3.org to do validation
of some local resource was probably not an explicit decision, it’s
probably an accident. The application is going to fail when www.w3.org
falls off the internet, which I’m sure it does periodically when
maintenance is performed, or when someone borks DNS on purpose or by
mistake.

We know that http: URIs are insecure and subject to various kinds of
attacks. If someone constructs an attack vector that uses a hacked
schema injected into an insecure HTTP stream to get software to accept
an otherwise invalid document with some downstream consequence that the
black hats can exploit, that’s bad too. If a bit…unlikely.

                                        Be seeing you,
                                          norm

--
Norman Tovey-Walsh <ndw@nwalsh.com>
https://nwalsh.com/

> We think in generalities, but we live in detail--Alfred North Whitehead

Received on Friday, 19 August 2022 09:30:58 UTC