The Hardest Production Bug That I Faced During My Software Engineering Career.

Tue Jun 15 2021

Akram MECHERI

Technical Leader | Blogger | Freelancer

As a passionate technical leader, I am skilled in FullStack Java development and have a strong background in DevOps, Cloud, and Kubernetes. I have a track record of delivering high-quality software, and I am always seeking to improve my knowledge and skills through continuing education and hands-on experience.

Get in Touch

The Hardest Production Bug That I Faced During My Software Engineering Career.

In this article, I’ll describe the weirdest Internet Explorer bug that I faced in my life, how we diagnosed it, and the way we managed to correct it in my company.

✤ ✤ ✤ ✤ ✤

By the time you’re reading this article, IE might not exist anymore since Microsoft has announced its abandonment by June 15, 2022, so most companies and individuals should work on moving to another browser in order to stay safe, this is good news for web developers because they know that Internet Explorer has always been a pain-in-the-neck when it comes to web development, mostly because of its various CSS and JavaScript incompatibilities, often we can find some polyfills for these kinds of issues; The issue I’m going to describe in this article isn’t related to JS or CSS, it’s a very weird behavior from IE when it receives an exotic HTTP response, so for once we can’t really say that its IE’s fault.

✤ ✤ ✤ ✤ ✤

The context 📃

I used to work on a B2B project in an insurance company, one of our clients deploys IE as the default browser in all of its IT infrastructure, so we need to make sure that our websites run on IE as perfectly as it does on other browsers, recently our client has reported us a serious bug affecting one of our websites, this bug was so exotic that I’ve decided to write a blog post about it.

Basically, the bug affects the display of our website by adding multiple lines of the Not Modified string followed by some HTTP headers at the beginning of the DOM of our website.

What makes the bug so exotic 🥝

The bug occurs randomly.
We couldn’t reproduce the bug more than a few times after thousands of tries.
It only happens on IE and on the production environment.
No major changes were pushed to prod before the bug report.

✤ ✤ ✤ ✤ ✤

Reproducing the Bug 🐛

This was a hard task because of the randomness of the bug, even after thousands of refreshes on an identical ecosystem of our client, we didn’t manage to reproduce the bug on our machines, what we end-up doing was creating a Selenium program that refreshes the page for us and checks for the Not Modified string in the DOM of our page.

from seleniumimport webdriver
import time

siteUrl="https://example.com/"
webDriverLocation="C:\\Users\\FrenchTechLead\\Downloads\\IEDriverServer.exe"

driver=webdriver.Ie(executablepath=webDriverLocation)
driver.maximize window()
driver.get(siteUrl)

shouldContinue=True

while shouldContinue:
  html=driver.page_source
  if "Not Modified" in html:
    print(html, file=open( 'ModifiedNot.html', 'w'))
    driver.close()
    shouldContinue=False
  else:
    driver.refresh()
    time.sleep(2)

After running this program for some time, we managed to reproduce the bug and export the HTML containing the Not Modified string.

Now that we managed to reproduce the bug we need to understand how the heck we end up having this unexpected content at the top of our page.

For this purpose, we had to look at the architecture of the app from an infrastructure perspective:

From the following architecture, we identified some potential guilty components and had some assumptions.

What components can edit the HTML rendered pages?

The Mashup server is responsible for Server-Side integration, it acts directly on the HTML rendering, it might have a bug when retrieving cached assets from the CMS, so we tried to disable cache on the CMS server but the bug was still there.
Javascript can edit the HTML on the client-side after sending some XHR requests, we analyzed the XHR requests one by one, but none of them had a 304 response code, but we observed that the Not Modified string was not part of the first paint content, so we had a strong belief that Javascript was responsible for the bug at this point (later we knew that we were wrong ).

✤ ✤ ✤ ✤ ✤

Getting Deeper into debugging 🧿

At that time, we knew certainly that the issue occurs on the client-side since the “not modified” string is not part of the first paint content.

We had a strong belief that some JS was responsible for printing these weird headers on the screen, we analyzed all the libraries that we use on our legacy JSF app (Application backend), we had many JS libraries to analyze such as Sarissa JS, Sarissa was easy to point to because it acts as a wrapper for XHR requests, we have reviewed almost all of our JS code just to realize that not a single line of our JS injects these undesired headers on the DOM of our app.

At this point we knew that the bug occurs on the client-side but Javascript was not responsible for it.

Unlike recent web browsers like Chrome and Firefox, IE doesn’t have some decent debugging capabilities, and the network tab on IE misled us in our investigations, let me explain how :

The misleading thing about the network tab here is the fact that an HTTP 304 response is associated with a Body content, there is no indication that the Body is retrieved from the local cache and not from the actual response(that should have no Body content associated to it), so if you don’t really pay attention, you get easily into thinking that the Body has been returned from the server with the 304 response.

We decided to view how a single 304 resource is received from our backend, for that, we used a very interesting program called Fiddler which is a debugging proxy server tool used to log, inspect, and alter HTTP and even HTTPS traffic between a web client and a web server.

✤ ✤ ✤ ✤ ✤

Bingo 🎉

We have found a serious lead by using Fiddler, we noticed that our 304 response has a Body content (12 bytes) which is a violation of the HTTP protocol.

A 304 response cannot contain a message-body; it is always terminated by the first empty line after the header fields.

✤ ✤ ✤ ✤ ✤

Check out the following article to learn HTTP the right way ✅

✤ ✤ ✤ ✤ ✤

Before Learning Rest / Soap / GraphQL / You Need To Understand HTTP !

✤ ✤ ✤ ✤ ✤

So we know now that our servers are returning some exotic HTTP responses for cached resources, and unlike Chrome and Firefox, IE does not tolerate this protocol violation and acts weirdly by printing the Body of these responses directly onto the DOM.

✤ ✤ ✤ ✤ ✤

So where did the 304 HTTP response got her Body from? 🍑

When we GET a resource directly from the Mashup Server we receive a 304 response with no Body content (L.1), but when going through the RP we receive a 304 response that has a 12 bytes Body content (L.2).

~ tail -F /usr/tomcat/logs/localhost access. log
  [05/03/21:12:23:39] "GET /files/live/modules/eep-templates/1.0/templates/files/themes/eep-theme/images/trame.png HTTP/1.1" 304 -
  [05/03/21:12:23:53] "GET /files/live/modules/eep-templates/1.0/templates/files/themes/eep-theme/images/trame.png HTTP/1.1" 304 12

The RP connects to the Mashup Server through AJP Protocol, we did some research about the AJP connector of our Tomcat server and it turned out that there is indeed a problem with this connector.

The bug was fixed on Tomcat version 7.0.43 onwards, so we just had to upgrade the version of our Tomcat server to correct this issue.

✤ ✤ ✤ ✤ ✤

Final thoughts,

I took the time to write about this particular experience for many reasons, Firstly, I hope that people facing similar Bugs would find help here, Then for my fellow web developers, I want to point out theimportance of knowing your apps on all of its layers, don’t be an exclusive front-end or back-end developer, otherwise, you won’t be able to diagnose some serious bugs like the one described on this article, Finally, to product managers out there, please stop using/supporting IE :)

That’s it, if you find this article interesting don’t hesitate to let me a comment and/or hit the 👏 button below.

✤ ✤ ✤ ✤ ✤

Akram MECHERI

Technical Leader | Blogger | Freelancer

The context 📃

What makes the bug so exotic 🥝

Reproducing the Bug 🐛

Getting Deeper into debugging 🧿

Bingo 🎉

So where did the 304 HTTP response got her Body from? 🍑

Final thoughts,

Support my work 🐶