Recently I left Microsoft where I worked for almost 15 years and where about 10 of those years were spent in Escalation Services where my daily routine was debugging failing or faulting applications. This all began with user and kernel mode Windows processes and then once the .Net Framework shipped I move to the ASP.Net and CLR teams and began debugging more managed processes. Normally customers would send my team crash dumps or memory dumps of the offending process(s) and we would use tools such as WinDbg or CDB to dig deeper into the process to determine what was happening. There are several challenges when doing this type of work and one of the most painful is locating and referencing the correct symbols files (*.pdb).
If you don’t know, symbol files are used to help give meaning to memory addresses found within a process and are most useful when building stack traces. When the debugger tools write out a process memory dump the amount of memory used by that process will equal the site of the memory dump (certain dump operation options allow you to create mini dumps but for our work a full memory dump was typically preferred). Over the years as we moved from x86 to x64 the process memory dumps being sent to us from our customers began to grow very large. It was not uncommon to have customers uploading 1-4 GB dump files (or more) and often it was not a single file that was being sent. Sometimes it was a huge challenge in capturing a memory dump at the exact time an issue was reproducing so sometimes this meant multiple memory dumps where being sent and often some of the files did not capture the event being reported. What made this a little better was the fact these files zip / rar / compress very nicely so that did help but often it was still a logistical challenge to get the correct files uploaded.
About a year ago my friend and co-worker at Microsoft Aaron Barth and I were talking about these challenges and started brain storming ideas on how we could try to address some if not all of these challenges. Over the years we had experience with many customers whom had serious issues which only reproduced in production environments however these customers did not want, or could not, install tools or applications on production machines. From our experience most hang and performance issues could be resolved quickly or at a minimum we could be given a good line of investigation if we just knew what each thread within the process was doing at some period of time during which the issue was being experienced. We often scripted the debugger tools to create snapshots of memory dumps and once available typically dumped out each thread with a ~*kb400 command, parsed through the results and frequently where able to pinpoint the cause or develop a working theory as to the issue. But, as mentioned, installing the debugger tools, configuring scripts, and uploading one or more memory dumps hoping at least one of them was taken at the time of the event, could in some cases, take hours.
So armed with a strong motivation to make things better for our customers and heck even for us, I set off to develop a tool. I setout creating a tool called SNAP. The design goals of SNAP included:
- Easy to use and deploy
- No need for symbols
- Generated output could be easily shared with a subject matter expert (SME) such as the developer or 3rd party support team.
The SNAP tool is very easy to use and is simply a command line tool with 3 main tasks or commands each with a number of options. The 3 commands are SNAP, DUMP, and LIST.
- Snap Captures a stack trace of the managed stack for each thread within a process and writes the output into a XML log file on disk. Each ‘Snap’ results in an XML log file and can have the snap interval, max snap count, or duration configured.
- Dump This command captures a full memory dump of the process. It sometimes is still necessary and helpful to have a memory dump to troubleshoot some issues. Rather than having to install the debugger package we can use this tool to capture this information too. As with the Snap command this command can be configured to capture dump files on a certain interval, for a period of time or for a certain number of dump files.
- List This command is a bit like tlist.exe and will display a list of managed processes running on the current system.
So you are sold and want to get started using SNAP? Cool, here are the one, two, three…
- Download SNAP from the link below. The download includes a .Net 3.5 and .net 4.0 flavor of the tool along with x86 and x64 platform versions as well (so 4 exe(s) all together). Use the exe which is most appropriate for the target process you are attempting to collect information.
- Select a process to ‘snap’. You have several choices here, you can choose the process by Process ID (PID), process name eg. w3wp.exe, or by the Application Pool Name in the case this is a IIS process. If you know you want to focus on a single process you can choose the PID option and use your favorite tool to grab the PID of the target process, I am a little partial to running snap -c list to get the PID but you should feel free to choose what works best for you. If you choose to use the process name and there is more than one process with this name all processes will be targeted and you will have snap logs from each and no worries because the snap log naming format will ensure you can determine which process each log is sourced.
- Execute the snap command, eg snap -c snap -p 1234 with the snaps running wait until the issue you are troubleshooting reproduces and once complete hit Ctrl+C to stop the snap process.
- Analyze the log files. I like to start with the largest file first. I open the file in a text editor and go through each stack to try determine what may be occurring at the time of the event.
To put this all together I have put together a demo video which walks you through using the Visual Studio 2010 profiler and SNAP to troubleshoot a SharePoint 2010 performance issue. If you want to just see how SNAP works for this use case then skip ahead to 14:00 into the video. While this video is focused on a SharePoint performance issue it is important to understand SNAP was not written for and has no dependencies on SharePoint so you should feel free to use this tool on any managed process.
Over the last year since I have written the tool it has been used across many large and mid-size customer environments to troubleshoot an entire host of issues literally around the world. Suffice it to say this tool has been well battle tested and it has proven its value in a number of scenarios where a debugger is typically needed to troubleshoot further.
I have used this tool in a number of real world farm/server down scenarios:
- I have seen were a proxy was misconfigured and a majority of threads were waiting on an external web service call to return.
- I have seen where a SharePoint delegate control was being used to make a reverse DNS lookup on the incoming client’s IP address and subsequently timing out about 60% of the threads in flight were in this routine.
- I have used this when requests seem to hang and noticed most threads were trying to hit SharePoint’s UPA service, which at the time the only server running that service was unavailable and again a very large majority of threads were waiting to initialize the UPA.
In all cases all I did was analyze the largest Snap log file and look at each stack and made a determination as to what most stacks were doing at the time of the issue (see video for more details). I have not really needed to look at more than one log or have any need to create some kind of log parser or analyzer though I am sure if someone were to build one that would be well received.
As with any tool offered for free on this site I don’t offer direct support, warranty, licensing, etc. but I am willing to answer any questions you may have.
Tool Download: Here (681 kb)