I coined the term production Time Debugging many years ago to explain the difference between the sterile, abstract world of the developers to the real world of the system operators.
I traveled between those two separate and parallel universes for many years. I did it long before Dev-Ops became a common buzzword. I have a lot to say about the wrong way Dev-Ops is implemented today in most organizations, but I will defer my grievances to a different post.
Let’s discuss debugging software systems out of the standard developing environment, focusing on the Windows operating system environment. The same issues discussed here happen in other operating systems, but I want to focus on windows in this post.
You probably encountered the following scenario too many times. The system is crashing or exhibits severe performance issues in production. Operation restarts the system, and the problem persists. After several restarts with all the phones ringing with the escalating list of users, customers, managers, CEO, VPs, Stakeholders, and finally “He Who Must Not Be Named”. The IT manager has no other choice but to call the developers department and ask for help. The developer’s manager picks one of the developers and sends him (or her) to the dark dungeons of the data center to save the world. It must be a junior developer because the seniors know better.
Junior enters for the first time in his life through the entrance door to the operations room. If you wonder why it is his first time there, it is because the nearest place to operations he ever visited was the point outside where they throw the software over the wall to the production. Junior is a little bit puzzled why the strange feeling that everyone in the IT department hates him. Isn’t he is the hero who is going to save them. The truth is that they do hate him because every time they see a developer, it means that something is going wrong, and their five-nines SLA is in severe danger. Junior, of course, knows nothing about nines, the software algorithm he used was very elegant, and he got A+ from the professor in the university on originality.
Junior takes his hero title seriously, and he wants to save the world. It will be a painful learning process for him to discover he is not the hero but the problem. Don’t worry too much about Junior. He will eventually grow up (or find a new purpose in life).
Junior starts to ask questions and doesn’t understand what went wrong. “It works fine on my machine. I can take you upstairs and show you”. The IT babysitter attached to him has to explain that they don’t ship his machine.
That’s the point Junior askes to have access to a terminal and announces he plans to stop the application and restart it under the visual studio debugger to dig into the problem. Surprisingly enough, Visual Studio is not available on production machines. The request to install any software in production is a process that requires C-level executive approval and CISO signature. The IT manager explains that stopping the production when 10,000 users are connected is not a good idea, and breaking into the debugger and single-stepping the debugger with users connected to the live system is not an acceptable method of operation in production. That is the first time Junior learns about watchdogs. The watchdog monitors the system in production and sets alarms. It will send messages to all operations and C-level executives if the system does not respond for more than 30 seconds (Which, by the way, makes single-stepping almost impossible).
So, is it a dead-end? What else can Junior do to save the world?
Many developers feel like fish out of water in the sys-ops world. It is a foreign environment. All the debugging and performance tools they use every day to troubleshoot bugs and performance issues are missing.
How can a developer collect error information and performance data in such environments without affecting the operations? What are the tools that are already installed on the system and could be used to collect data? What tools can get the relevant error traces without installation?
Luckily, there are many ways to get a lot of debugging and performance information in the windows environment. After several decades of solving such problems in the field, I decided to share my knowledge and experience in this domain. I am delivering another iteration of my 3-day virtual workshop: Debugging Windows applications in production environments.
If that sounds interesting to you, All the information is in the following link...