Introducing sp_LogHunter: Quickly Search For Important Error Log Messages From SQL Server (Video Edition)

Introducing sp_LogHunter: Quickly Search For Important Error Log Messages From SQL Server


Video Summary

In this video, I introduce you to a stored procedure called `sp_log_hunter` that I developed after dealing with large and cumbersome SQL Server error logs for a client. The motivation behind creating this tool was the difficulty in filtering through extensive log files using the GUI interface in SQL Server Management Studio (SSMS). Error logs were taking too long to load, often timing out, and filtering them manually was extremely slow and inefficient. To address these issues, I crafted a stored procedure that can quickly sift through error logs based on specific criteria, such as severity level, date range, and custom search terms.

The `sp_log_hunter` stored procedure offers several parameters for customization, including the number of days back to look in the log files, the ability to filter by specific messages or languages, and options to limit searches to particular error logs. It’s designed to be user-friendly with a clear help section detailing its functionality and parameters. Additionally, I’ve included debugging features to ensure robustness and flexibility for future enhancements. This tool is available on GitHub under an MIT license, ensuring it remains open-source and accessible to everyone who needs it.

Full Transcript

Erik Darling here with Darling Data. In today’s video, I’m going to introduce you to a store procedure that I wrote dead sober over the weekend and some of the week, also dead sober, that digs through SQL Server’s error logs. The impetus for this was client-related, but really what happened was they had lots of real really big error log files and we had to track down some nasty stuff happening to the server in those error logs, but that was really difficult to do via the GUI, via the interface that you have to work with error logs in SQL Server Management Studio. The logs were taking a long time to load, they were timing out in some cases, filtering them was really slow, there was just a lot going on, and so I made it my duty and responsibility to do that. I don’t want to make it that way. To find a way to find a way to find a way to find the things in the error logs that we cared about. And then, you know, I turned that into a stored procedure. And then, you know, I started looking at other messages that like, so like SQL Server has that sys.messagesView, and you can find like high severity error messages and things that happen when, you know, bad things are going on with SQL Server. And you can use those to sort of figure out like a common set of terms and messages where bad stuff is happening.

Now, there’s always going to be some false hits if the search is too expensive, but that’s okay because I’d rather like over-report a little bit than under-report things. You know, error logs are usually full of a lot of garbage messages that you don’t need, failed logins, successful backups.

Some people for auditing will log successful logins and like you just have like pages and pages and pages and pages, like millions of entries of just stuff that gets in the way of you figuring out when things were going from bad to worse to the server being broken. So I have this stored procedure now, sp underscore log hunter, which will hopefully make your life easier when you’re trying to find all the interesting stuff in SQL servers error logs as well. It’s available over on GitHub, as all my scripts are. It is free, it is open source, it is MIT licensed, and it will remain that way unless I really go broke. Just kidding. Who would buy this? Everyone buys me. So the stored procedure has a few parameters with some default values. Days back is how far back in the error log you want to look.

By default, it goes back seven days. And this should be a negative value in the stored procedure. There’s logic to flip it from a positive value to a negative value. You can’t search into the future of an error log as much as you’d like to. As much as I’d like to. You can’t see the future of SQL Server. It is too unpredictable. It broke the magic eight ball. So you can choose how far back you want to look in the error log for certain things. If there’s a specific message you want to look for. Now this only works for one string at a time. I think maybe long term I’ll let you just comma separate some stuff. But right now, that’s too much for me. If there’s a specific thing in the error log you want to look for. You can look for that. If you only want to look for that specific thing across error logs, you can do that as well. You can also specify a language ID. By default, it is English. But if you use a different language for SQL Server, you can choose a different language ID. Where this falls a little bit short is that if the error log is in a different language, you’re going to have to do the translation for whatever search terms you want into that other language because I just did not have the moxie for all that. If you only want to search for the first error log because there are a lot of other error logs for whatever reason, you can do that as well. There’s also sort of the stop help debug version parameters that I try to stick in all my store procedures. So here’s the help section. It’s pretty much what you’d expect from me. It goes through the parameters, tells you what the license is, stuff like that. It does a couple checks up front. This doesn’t work with Amazon RDS because Amazon RDS gives you access to a sort of a wrapper store procedure for what I use. And it does not take a search string as a parameter, at least that I’ve been able to find in the documentation.

I don’t have access to an RDS instance to like test it and do stuff with it. So if anyone out there in RDS land does, I’d be more than happy to take your feedback on this. So it can’t, doesn’t run in RDS. And of course it does not run in Azure SQL DB because, well, I don’t know. I’m trying not to get shanked it past. So probably won’t say anything too not nice about Azure SQL DB here, at least outside of my error message. Validates that the language ID that you’re looking for exists, right? Because, you know, it’s important, I guess.

I do, here’s where I fix the days back parameter. So if you set it to zero, I just go one day back. So we will actually look for some stuff. And if you have used a positive number, then I will do a little math, mathemagic right there to set it to a negative number. There’s also heaps of debug stuff in here. I’ll show you that after I show you some other stuff. I’ll show you that in action after I show you some other stuff. There are some temp tables as usual. There is a temp table that builds a dynamic string up and a computed column because why not? That should be persisted.

Yeah, I don’t know. I don’t know how I missed that. Debugging live. You’re welcome. Typing and demos debugging live. I don’t, I mean, it doesn’t need to be persisted. I just prefer to be for various reasons.

Um, there, so what I do in this store procedure is, uh, I use this, uh, this other system store procedure called SP enum error logs. And what that does, I mean, if you just run this sort of outside of the normal scope of the store procedure, is it will look, it returns a list of log files and the dates that they contain data up to in them. And it also tells you the size and bytes, which I guess is helpful for some people. Um, but I use this as part of the filtering mechanism. So, uh, if whatever, however many days back you want to look, I’ll delete anything, any logs from the table, past that days back, uh, uh, past that days back map date math, but only for, uh, archives greater than zero. Because SQL Server, uh, starts with these zero numbering counting system here. And I don’t want to delete out the first log file because we still need to look at something. If I said where, if I just left that is any log that’s, you know, uh, with like, you know, that didn’t meet the date requirement, we would might delete the archive, like the first log file. So that’s no good. Uh, if the, if you want to use the first log only, I just delete everything where the archive is over, the archive number is over one.

And then, uh, I insert some canary strings into our search table. So I always want to look for Microsoft SQL Server detected and SQL Server has encountered, um, because those are things that I look for pretty consistently. And these get a special, uh, treatment for how many days back they search. So the Microsoft SQL Server stuff kind of brings up stuff from the beginning of the error log.

That’s also what detected does. Uh, and there’s, I also want, I’ll always want to search for SQL Server has encountered because that will show you the 15 second IO warnings that show up in the error log when your, uh, storage is under duress. Uh, so either storage or, you know, storage networking is under duress. So I always want to show that as much as possible. So those go back 90 days and less days back is greater than 90 days. You can search back almost as many days as you want in there. I don’t really, I don’t really have too many checks on that, but, um, you know, uh, these are the design choices we make when we’re writing code for better or for worse. So, uh, this is where I list off the strings of, um, stuff I found in error messages that I thought looked important enough to always get. Uh, there’s a lot in here. I don’t want to talk through all of it, but a lot of it had to do with corruption, bad things happening, server terminating crashes, things like that. Uh, and it’s a fairly exhaustive list. And what I tried to do was bring it down to the fewest number of search terms possible.

There’s around 106 or so in there right now. What happened, what I found is like, there are like, there was like really specific error log messages that I wanted to look for. And, uh, there were like groups of them. And the, the, the, the thing about sys.messages and like, but my first attempt, I tried to use the data in sys.messages to give me a list of errors that I cared about.

The thing is that sys.messages has a bunch of tokens in there. So like, when you think about using raise error, when you see those like percents and stuff in there, you can tokenize messages to like, you know, like substitute them with parameters or barrier variables or values or whatever. And the list was really long. And what I found in testing was that it was better to make fewer round trips to the error log to search for things than it was to make more very specific searches for things. So this is where I kind of got into like the over-reporting versus under-reporting thing. Some of these search terms are very like just single words. Like, so there are like a ton of messages and sys.messages that say something like contact system administrator, contact database administrator, contact technical support in various ways. And so searching for all of those strings independently was like a nightmare because it was just like more round trips to the error log to find stuff. So just searching for the word contact was a lot faster than searching for those specific phrases. So there’s a lot of, there’s a lot of sort of, you know, fuzzy searching in here for stuff that I thought sounded important while looking through error messages, like groups of error messages that all had things in common. So I have this down to like the right now, I think the fewest number of important error messages that I could, that I can, that I can like put in the list. If there’s anything else you want in here that like you think, Eric, we always got to find this. Let me know if there’s anything, if there’s any noise that comes in here, you can also tell me about that. And I can add some stuff to the deleting mechanism that does some cleaning up already of what gets returned to the error log. So that’s the, that first part. And then down here, this is where I handle the custom message.

Part of the reason why I don’t have, why I don’t have the multiple custom message thing set up yet is because I haven’t had to use that yet. And of course, this thing is brand new. So I don’t know if it’s a feature even anyone would use. And I don’t, quite frankly, I don’t know if anyone’s even going to use a store procedure. So there’s that. Some more feedback for the, for looping over the error logs and stuff here. So this is the fun part of the store. This was, this was writing loops in SQL Server is one of my favorite things because I am naturally not terribly good at it. And so writing and debugging them is always a fun learning process for me. So this, this has a while loop and a cursor in it, breaking all the rules, right? But so if there are, if you’re, if you’re a search, your days back, all that stuff spans multiple error logs, we need to enumerate those for the search, right? So while the lowest log file is less than the highest log file that meets your date criteria is, is, is, is there for searching, we, we keep doing the loop. We also have a cursor in the loop that goes and hits the, well, it does a couple of things up front. This is where I hit the error log to search for stuff. But the first thing I have to do is sort of replace this string token with the log file that we’re searching for. If we scroll back up a little bit to the computed column, this string of at signs in brackets is the log number that we want to look at. So every time we loop through, I need to change this string to, to the log number that we’re going to search for. And the number one there just signifies error log. There’s like SQL Server Agent also has logs. I haven’t worked out searching for those searching through those yet. I’ve never found anything important enough in there to look through. So I just had, I just didn’t design it for that because it was, you know, again, scoping issues with these initial store procedure rights.

So, um, again, more, some more debug info. Uh, this is the part of the store procedure that inserts, data into the table that I returned from the error logs with the search string that we’re looking for. Uh, there’s a little bit of error handling in here. Um, when I was initially writing this and trying to use sys.messages and like fix token, string tokenization stuff, this was like, there was a lot of errors that I was capturing and being like, oh, that, the hell is that string?

There was some, it was, it was a mess. So I, I abandoned that pretty quickly. Uh, and then, you know, if find the next log file that we want to search through, rerun the loop that goes through the table, searches all those strings and goes and goes and goes, uh, some logic here to figure out if it’s time to stop running to get close out the cursor. Uh, there’s a delete statement here that gets rid of some junk, uh, that was getting returned. Uh, so like, I didn’t want to like DBCC trace on and trace off.

Like I want to know when people are turning trace flags on and off in SQL Server, but the error log is just like chock full of these DBCC trace on, trace off 3604 messages. So I want to get rid of those. There’s also a bunch of messages in there that say this instance of SQL Server has been using a process ID for 10 billion years because you never patched SQL Server or restarted. Uh, there’s some data, uh, database administrative connection messages, some login failed messages that I never want to see. We don’t get successful logins already. So we don’t need to worry about those, uh, some stuff about endpoints and stuff that I don’t care about. So I delete these messages out so we don’t see them again. This is where if there’s something you don’t want to see in, uh, the error log, this is some noisy message that always comes back. You’re like, Eric, we don’t need that. Tell me about it. Add it here. Never see it again.

Um, after that, pretty simple. We just return any data that comes back from the error log. Uh, if there are any errors in the error table from trying to search through the error log, any commands, those will get returned if they exist there. Cool. Uh, this is what the results look like from the store procedure.

So I stuck a custom message into the error log that said, he’s dead Jim. And we found that when we searched for a custom message, uh, again, like I tried to get some of the, like some of the stuff that I think is interesting or important from like the startup portion of SQL Server, because there’s always some ghosts and goblins and gremlins in there. Uh, so I try to get some canary values just to show that like there’s stuff in the error log. Like, even if it’s not like the crazy stuff that we need to find and get to at first, we at least know that there’s stuff in there that’s coming back.

Now the debug mechanism, if you run into any issues with this and you want to see what’s going on when, while this is running, I’m, you know, pretty, pretty psyched on this. It’ll tells you what the days back parameter, uh, is it’ll tell you how many, what log files we got, uh, how many were valid for searching after we deleted stuff. Uh, it shows you the, uh, search command table, right? So this is all the stuff that goes into the table and this is the final command that runs to find data in the error logs. All right. And then down here we have the, again, the final results from that. And then, uh, uh, I don’t know, good stuff there. Anyway, this is what the messages tab looks like. And I’m going to zoom in a little bit so that you can, uh, see, uh, we, you know, there’s a bunch of razors in there to, oops, uh, tell us what, what point in the loop we’re at in doing things. Uh, there is mess, there are messages in there that tell us how many log files and how many searches we have to run.

Um, and the, I really, I mean, this is not really a great, you know, static value to have, but, uh, I find it, found it was useful in here to sort of keep track of, you know, how many searches, uh, how many searches I’ve done out of how many total searches there are. Uh, we also print out the string of what we’re running. So if, if you just want to like, see what may, like, if you get an error, while this is executing, you shouldn’t, but if you do, uh, if you get an error while this is executing, you can see exactly which command command was, uh, what it hit a problem with. So it prints out all this stuff. And I’m just going to scroll down a little bit further to where the log number changes, because, uh, I think if you, if you follow me on Twitter, you may have recently seen a tweet of mine that said, uh, there’s nothing more satisfying than seeing debug messages get raised in a, in a loop that’s running correctly. And this is, this is what I was talking about. So if we get down, uh, this is where I get, I guess having 106 search terms is, uh, bite me in the keister a little bit, but, uh, if we get down to, uh, where things change, you can see the message where we get the next log.

We go back into the cursor and then we restart our searches on the next log file. So interesting, practical debugging methodology brought to you by Erik Darling of Erik Darling data. Uh, so this is, again, brand new store procedure. I would still consider it sort of in beta, even though I’m very happy with the beta version of it. Um, I would encourage you to go over to GitHub, download it, uh, keep an eye on GitHub because these new store procedures often get a lot of tweaks and tinkering, even live in videos where you had persisted to a computed column, stuff like that. Uh, so GitHub is the place to go for questions, support, enhancement, requests, stuff like that. Um, that’s, that’s, that’s where you do it since that’s where I do things. Blog comments are not the place for that.

Uh, and I guess that’s that. I guess we’re just about at the 20 minute mark. I’m sick of talking. I’m going to take a break while this thing renders and uploads to YouTube and record another video, uh, about some cool new stuff that got added to SP underscore human events block viewer.

And, uh, we’ll see what else happens anyway. Thank you for watching. Uh, if you liked the video, don’t be afraid to give it the old thumbs up. Uh, if you, uh, like this type of SQL Server content, uh, pretty, pretty, pretty please with, um, more free SQL Server content on top. Uh, give, give me, give me the old subscribe. Um, if, if you have a problem with this video, whether it’s, it’s me, my physicality, my breathing, my audio, my video, uh, let, please leave constructive comments on the video. Uh, if you, if you, if something is wrong, you need to, you need to kind of give, give me a little bit more, a little, a little more specific feedback as to what I need to, what I need to fix for the next one. Anyway, that’s SP underscore log hunter. Go over to GitHub.

Get, uh, I guess happy hunting as they say. Why did you keep popping up? Go away. Anyway, uh, thank you for watching. Okay.

Going Further


If this is the kind of SQL Server stuff you love learning about, you’ll love my training. Blog readers get 25% off the Everything Bundle — over 100 hours of performance tuning content. Need hands-on help? I offer consulting engagements from targeted investigations to ongoing retainers. Want a quick sanity check before committing to a full engagement? Schedule a call — no commitment required.