Re: XML Schema validation and https redirects

Norm,
I'm somewhat familiar with using XML catalogs, but I know very few people
these days who both write code and know what resolvers are for.  The real
problem is that the appetite of developers for understanding XML these
days, and for a long time now, could be described as limited, so what gets
used is the out of the box behaviour and the 40 minutes of research is more
than people will put in.  I have had to argue with people to get them to
use schema validation at all when it is just good engineering practice let
alone talk about resolvers.  I have had quite senior people tell me that
namespaces are too difficult (as they remove them from designs).

Those accidents that you describe stemming from the lack of an explicit
design decision, are default behaviours in the software, but with falling
Java skill levels in my part of the world and more focus on other
technologies, there is a preparedness to simply write off unstable
behaviour as intrinsic to old technology (rather than try to fix it).  In
big old Java systems people will just turn the validation off rather than
work out what has to be done to fix it, let alone add, or even reconfigure
a resolver. Updating java.xml to use a later version of Xerces is entirely
unlikely.  It looks like openjdk11 is still using Xerces 2.11 from 2010 so
the idea that people will update their XML parser (that is baked into their
JRE) is highly implausible.

Greg

On Fri, 19 Aug 2022 at 19:31, Norm Tovey-Walsh <ndw@nwalsh.com> wrote:

> Greg Hunt <greg@firmansyah.com> writes:
> > From the comments in the W3C blog it sounds like Xerces in Java 11
> > does not support this.
>
> I bet it does if you tell Xerces to follow redirects.
>
> (time passes)
>
> Okay. I gave myself 30 minutes to see if I could figure this out. It
> took closer to 40, but you know, that’s not bad for programming.
>
> 1. I couldn’t find any way to tell the Xerces parser bundled with JDK11
>    to follow redirects natively. I’m not saying it can’t be done, I just
>    couldn’t figure it out in ~30 minutes.
>
> 2. The escape hatch that the parser does give you is the entity
>    resolver. If you wrote your own entity resolver that followed
>    redirects, that would work.
>
> 3. You don’t have to write your own, because XML Resolver exists.
>    (Shameless plug because I wrote it.)
>
> Here is the boilerplate schema validation code that I cut and pasted out
> of the Oracle JDK11 docs. I’ve added exactly two lines to it:
>
>   // THIS LINE CREATES THE RESOLVER
>   org.xmlresolver.Resolver resolver = new org.xmlresolver.Resolver();
>
>   // parse an XML document into a DOM tree
>   DocumentBuilder parser =
> DocumentBuilderFactory.newInstance().newDocumentBuilder();
>   Document document = parser.parse(new File("instance.xml"));
>
>   // create a SchemaFactory capable of understanding WXS schemas
>   SchemaFactory factory =
> SchemaFactory.newInstance(XMLConstants.W3C_XML_SCHEMA_NS_URI);
>
>   // THIS LINE USES THE RESOLVER
>   factory.setResourceResolver(resolver);
>
>   // load a WXS schema, represented by a Schema instance
>   Source schemaFile = new SAXSource(new InputSource("
> https://www.w3.org/2001/XMLSchema.xsd"));
>   Schema schema = factory.newSchema(schemaFile);
>
>   // create a Validator instance, which can be used to validate an
> instance document
>   Validator validator = schema.newValidator();
>
>   // validate the DOM tree
>   try {
>       validator.validate(new DOMSource(document));
>   } catch (SAXException e) {
>       // instance document is invalid!
>   }
>
> With the addition of those two lines of code, it will follow redirects
> and happily validate.
>
> (For the record, I am not asserting that this is a simple and
> straightforward thing for every user to do. Lots of folks using JDK11 to
> validate have probably never heard of entity resolvers. Changing
> software is hard, especially if it’s considered a legacy application. In
> some environments, adding a new third party library may be very hard or
> impossible. I just wanted to work out what would be required to fix it,
> if you wanted to fix it. The answer is “Write or obtain an entity
> resolver that will follow redirects for you and use it.” That’s not hard
> in principle, even if it is hard in practice.)
>
> > Break the validation, even momentarily, and all you have is a legacy
> > technology that is harder to argue for.
> >
> > I am with Michael on this, publishing stable URIs, (and I am inclined
> > to factor in the frankly rather vague statements about dereferencing
> > URLs), constituted a promise to not change things, a promise that you
> > cannot evade by saying people ought to be reading the W3C blog and
> > updating their software.
>
> I think those are very reasonable and valid points. On the other hand,
> configuring software so that it dereferences www.w3.org to do validation
> of some local resource was probably not an explicit decision, it’s
> probably an accident. The application is going to fail when www.w3.org
> falls off the internet, which I’m sure it does periodically when
> maintenance is performed, or when someone borks DNS on purpose or by
> mistake.
>
> We know that http: URIs are insecure and subject to various kinds of
> attacks. If someone constructs an attack vector that uses a hacked
> schema injected into an insecure HTTP stream to get software to accept
> an otherwise invalid document with some downstream consequence that the
> black hats can exploit, that’s bad too. If a bit…unlikely.
>
>                                         Be seeing you,
>                                           norm
>
> --
> Norman Tovey-Walsh <ndw@nwalsh.com>
> https://nwalsh.com/
>
> > We think in generalities, but we live in detail--Alfred North Whitehead
>

Received on Friday, 19 August 2022 10:47:11 UTC