Using new Lets Encrypt intermediate chain with SkyHole
If you somehow missed it, one of the certificates used by Lets Encrypt chain of trust expired this week. As expected, things broke…. including my private, filtered DNS over TLS server - SkyHole. Below is a condensed form of my notes to create the exact document that I wish I had while trying to triage broken DNS on my phone.
TL;DR: Implement solution 3 from this post.
Note: After releasing the initial version of SkyHole, I re-factored most of the code to eliminate the dependency on Docker. This was to make the project easier to deploy on resource constrained hardware. At the time, I was working with SaltStack a lot and took the opportunity to re-do the entire thing as a salt state for a bit of practice. The exact steps and commands shown below are unique to my particular instance. Use them as guidance for fixing an issue with the publicly released version of SkyHole.
My current daily driver runs Android 11. When configured to use a private DNS server, Android essentially behaves as if you’ve turned WiFi/Cell data off if there is any issue when talking to the DoT server. While this ‘fail private’ approach is commendable, the lack of debug info in the UI is not; no details are given about the failure other than a generic ‘the private dns server could not be reached’ message.
I have seen this behavior once before when the certificate renewal timer failed to fire off… except I implemented email based notifications after that incident and had recently received a notification from the renewal script.
Just to be sure, I checked the
notAfter in each certificate and they all had plenty of life left:
Borrowed that command from this post.
Looking for more information,
adb yielded something:
No such file or directory message came from here.
Google uses their own fork of openSSL in Android so spent some time trying to figure out what an error code of
1 means in the openSSL project.
SSL_ERROR_SSL but that seems to be a relatively ‘generic’ error.
Furthermore, in context of
No such file or directory … it makes even less sense.
Oh well. So much for that theory. From this point on, I’m treating all of the TLS connection stuff as a black box.
I was able to confirm that the TLS certificates on the skyhole instance had not expired and the intermediate chain was not using any of the depreciated certificates.
I had also not made any changes to the skyhole instance in close to a year and
kdig didn’t throw any warnings when querying against the DoT server.
I could see the manual query from
kdig in the DNS query/filter logs … so it seemed like the problem was not in either the TLS portion or the DNS portion.
That left Android as the culprit.
But I was also fairly sure that Google hadn’t changed anything on the phone w/r/t how the DoT client worked…🤔
If in doubt, turn to the wires!
tcpdump on the skyhole instance and did notice traffic from Android that was not showing up in the DNS server logs.
I compared the traffic with a working manual query from
kdig and noticed that the traffic from the Android client stopped shortly before where the
kdig traffic would have turned into a regular DNS query.
So the problem was happening during the TLS setup. Whatever Android was choking on was happening before any DNS queries were sent.
I quickly configured the phone to use a known good / working DNS over TLS server and it was immediately accepted. The triumphant
logcat output confirmed that everything on the TLS layer has happy:
W resolv : Validation success.
Android worked instantly when configured to use a different server but immediately failed when used with the skyhole instance. This points to a problem on the server.
Nothing about the skyhole instance had changed and the certificates that it was offering up were totally valid; other clients worked w/o issue. This points to a problem on the phone.
Even though the certificates appear fine, the timing with the recent Lets Encrypt certificate expiration is too suspicious 🤨.
Looking for another data point, I moved to a different computer with a different version of openSSL installed.
Huh. That sure looks like a problem!
After a bit of google, I found that the different versions of openSSL (and their forks…) behave differently when validating certificate chains:
The currently recommended certificate chain as presented to Let’s Encrypt ACME clients when new certificates are issued contains an intermediate certificate (ISRG Root X1) that is signed by an old DST Root CA X3 certificate that expires on 2021-09-30. In some cases the OpenSSL 1.0.2 version will regard the certificates issued by the Let’s Encrypt CA as having an expired trust chain.
Most up-to-date CA cert trusted bundles, as provided by operating systems, contain this soon-to-be-expired certificate. The current CA cert bundles also contain an ISRG Root X1 self-signed certificate. This means that clients verifying certificate chains can find the alternative non-expired path to the ISRG Root X1 self-signed certificate in their trust store.
Unfortunately this does not apply to OpenSSL 1.0.2 which always prefers the untrusted chain and if that chain contains a path that leads to an expired trusted root certificate (DST Root CA X3), it will be selected for the certificate verification and the expiration will be reported.
That would certainly explain the behavior I observed when checking the skyhole certificates on the second computer. I don’t know exactly what version of OpenSSL the BoringSSL in my phone is based off of, but, assuming that it’s got the same bug as OpenSSL 1.0.2, that would explain everything.
The openSSL blog post pointed out three possible fixes; two of which are applied client side. My phone is not rooted so I just assumed that I would have access to the portions of the file system needed for a ‘client side’ fix. That left the third solution; use a different intermediate chain.
I was not aware that there was an alternate intermediate chain for Lets Encrypt. I didn’t even know that was a thing let alone why somebody would do that.
Turns out, it’s a very clever trick meant to prevent Lets Encrypt certificates from breaking on older versions. Ironic that a ‘newer’ android device got screwed in the process 😬.
The two valid chains of trust for Lets Encrypt certificates look like this:
With the help of this GitHub issue, the revised CertBot
cli.ini file becomes:
After running certbot with the new config,
logcat shows success:
Android no longer shows an unhelpful “can’t connect” message and I can see DNS queries being filtered!