This was an ASP.Net web service, written in VB.Net and built on either the 1.0 or 1.1 .Net framework.
The company had a dozen different systems, each with their own unique data format, which needed to submit work requests to each other.
This web service would receive a work request from System A, convert it to System B’s format, and submit it to System B. When System B finished the work, it would submit its response back to the web service, which would translate it to System A’s expected response format and send it to System A.
During load-testing, the web service started using a lot of RAM. It would grow until it reached the point where it was using around 1.2 gigabytes of RAM, and then crash.
When you have an error that doesn’t point to a specific location in the source code, like the “Out of memory” error we had, it could be happening anywhere. That can be difficult to track down.
So, I used what I call the “slice-and-dice” method to locate the source of the problem.
Almost every program performs a process that follows specific steps. You can almost think of it as an assembly line – moving data, instead of cars or computers.
If you have a problem at the end of the process, you can do a check in the middle of the assembly line, to see if the problem exists there. If it does, you know the problem was created in the first half of the process. So, you split that in half and start checking at 25% along the assembly line. If it doesn’t exist, you move to the 75% point, which will let you know whether the problem is in the 50-75% stage, or the 75-100% stage.
You just keep repeating this “slicing-and-dicing” until you get to the point where you discover exactly where the problem occurs.
Since this only happened when the website was being load-tested, I “split” the assembly line by stubbing out the code at a certain point.
I had the load test app continually submit work requests that needed to go from System A to System B.
However, at the point where the web service would submit the converted work request to System B, I commented out that code. The system still crashed. So, I moved to an earlier stage in the process and commented out code to prevent the order from moving any further in the process.
Within an hour or two, I had found exactly where the problem was occurring.
It turns out the problem was happening when the web service created the translation objects. There was a factory class where you would pass in the IDs for the source and destination formats. The factory would create the correct translator object to handle those formats – sometimes using VB.Net code, sometimes XSL transforms, since the data was submitted to the web service in XML.
In this particular version of the .Net framework, there was a bug.
When doing an XSL transform – changing XML data from one format to another – the framework created some internal objects. Some of them were never marked as “no longer needed”, so the garbage collector would never get rid of them and free up the memory they had used.
That’s why the web service’s memory usage kept growing until it crashed.
Since we were using a factory method to create the translation objects, it was easy to change it so the factory created one instance of each type of translation object, and returned that same object with each request.
The program no longer created hundreds of Convert_System_A_To_System_B objects. It just created one and kept re-using it each time it needed to convert a work request from System A to System B.
However, some of the translators had static variables in them. It would have been best to rewrite them so they didn’t use static variables. However, as usual, the system needed to be fixed “yesterday”. So, I tried the quick and easy method of putting a synclock around the call to the translation code.
This meant that the web service could only do one translation of each format pair at a time. If it received two simultaneous requests to send an order from System A to System B, one of them would have to wait until the first one was completed.
I was worried that this change would slow down the web service significantly. On the surface, it looked like a bad idea that would limit the throughput of the system. But, if this worked, it would prove where the problem was located, and I could work on a better solution for it next.
After making the change, I ran the load-test program to see if the problem was gone and how the change affected performance.
It not only fixed the memory problem, but it was surprisingly able to process the work requests even faster than before.
Apparently, it took a relatively large amount of time to instantiate, or dispose of, the individual translation objects. That was the slowest part of the process. Even though this change only allowed one work request translation at a time, for each format pair combination, the throughput was now about 12 times higher than before. And now, the memory usage never went above 150 megabytes and the web service never crashed.
Besides confirming how great the “slice-and-dice” method works to track down strange bugs, this added a new level of respect for the Factory design pattern.
Since we already had one central place that was creating the translation objects, it was fast and easy to make a change to how the application created, or re-used, them. I didn’t need to go through the whole application, making changes in dozens of places (and hoping that I didn’t miss any) that were instantiating new translation objects.
This also showed that sometimes performance bottlenecks are not where you would expect them to be.
Assuming the out of memory bug didn’t exist, the web service would have met the performance requirements with the original code. It handled roughly twice as many requests as it was expected to need to do in production. However, if the business grew, it quickly would have found that it needed to start buying more web servers to handle the increased load. After the fix, it could handle 12 times as much traffic, without any additional hardware expenses.