Smaller things are easier to understand than large thing. This applies to managing your work and writing your code.
As I mentioned in the “Learn Your Tools” section, you’ll need to know how to use your debugger and logging tools. But, once you’ve done that, it’s still important to have a good process for debugging.
This is the general process I use.
First, don’t make assumptions
I’ve lost count of the number of times someone reported a bug, told me what they think the source of the problem was, and was wrong about the source of the bug. The same goes for error messages. They often report a symptom of a problem, but not the source of the problem.
Think like a scientist. The error message or user report is a piece of data, but you’ll need to prove (or disprove) them with your own experiments.
Second, get the steps to repeat the bug
Bugs often happen around “edge cases” – an uncommon situation. In order to find the bug, and ensure your fix eliminates it, you generally need to be have repeatable steps to have the bug occur.
If at all possible, get the exact steps to repeat the bug. Find out what buttons the user clicked on, get the exact values they entered in the fields, etc.
Having the steps to repeat the bug can also give you a way to narrow down a problem.
Follow the steps in the “Learn how to read error messages and debug” section, doing the same thing the user did when they encountered the error.
If you try repeating the steps in your development environment, and the bug doesn’t happen, that may be a sign that the source of the bug is related to the environment.
Maybe the database has different values, which are causing the bug. Maybe the problem only happens when multiple people are working on something. Maybe the problem is because of server configuration.
But, if you can’t repeat the problem in your development environment, that doesn’t always mean the problem is because of the environment. There still may be a bug in the code.
Third, slice-and-dice to locate the source of the problem
When searching for bugs, I use a “slice-and-dice” technique.
There are a thousand things that can go wrong in a program. When you encounter a bug, you want to quickly narrow that list from thousands to one.
Let’s say you have to fix a bug that doesn’t provide a clear error message. I had one bug like this where a program would process requests, but eventually run out of memory and crash – without providing an error message.
I could reliably repeat the crash in the development environment by sending a few thousand orders through the program.
This program had a process for handling requests. Let’s say it was 20 steps long.
I went to step 10 and commented out the line that moved to step 11. Then I ran through thousands of orders. It still crashed.
I went to step 5 and commented out the line that moved to step 6 and ran through thousands of orders. This time, it didn’t crash. This let me know the problem was somewhere between steps 6 and 10.
I went to step 8 and commented out the line that moved to step 9 and sent through thousands of orders.
This process let me quickly narrow down the place in the code where the crash was happening. Instead of looking at tens of thousands of lines of code, within an hour or two I was able to pinpoint the problem to one class.
Finally, make a change to prevent the bug from recurring, or give better information if it happens again
I once took over three maintenance of three services that did some processing every night. These services weren’t written well, and had errors most nights.
Unfortunately, they didn’t have any logging. That was one of the first things I changed.
I modified them all to write messages to a log file. When they started and finished, they wrote log entries. When they retrieved data, they wrote log entries with the number of records. And, most importantly, if they had an error, they wrote a log with all the error information.
I checked those log files every morning. This let me see errors that had been happening, but were previously being ignored. When I saw a new error in the log, I fixed the program and deployed the new version to production.
Over time, the services were able to successfully run every night.
Remember, if an error happens, and the programmer (or user) never knows, the error will continue to happen.