This week I ran into a problem I’d never experienced in 20 years of working in SQL Server. A query that had run perfectly fine in an old version of SQL Server (2008R2 to be precise) suddenly became the slowest query on earth in a new edition (SQL Server 2016). The two servers were comparable — in fact the new version was running on a box that should have been beefier than the original in almost every way, except that one was on a near-obsolete version of Windows and SQL, and the other wasn’t. And like the last Rocky movie, the old-timer was kicking ass. 5.2 seconds for the original version, 4.5 MINUTES for the new one.
Troubleshooting this involved the usual suspects: I looked at CPUs, RAM and how each of the servers were mounting their SAN storage. Turned out that the new server only had 16 GB RAM, and each SQL Instance (three running on this server) only had 5 GB each to work with — compared to the old server, which had 32 GB RAM and only two instances…allowing for 15 GB per instance. I got our Ops guys to bump up my RAM, and…. Nothing changed. RAM wasn’t the issue.
I broke down the query into its smallest parts and tested each piece, wondering if somehow an index was bad on the new version, or statistics were stale, or something! Part of the proc called an AS400 box using OpenQuery. Was there some sort of latency between servers? Nope. This was just as fast as the original. There was a cursor looping through the data retrieved from the AS400… was the cursor somehow slower? Nope. There was a series of queries that used our partitioned tables’ date range function to grab the right data from the right file in the right filegroup. Could that be causing it?
TURNS OUT: YES.
What we discovered is the the new Cardinality Estimator released in SQL 2014 did not play nice at all with this part of our query. Switching Compatibility before each run proved it beyond a shadow of a doubt. On the same box, in the same database, with only the Compatibility swapping between 100 and 130 the speed of the query jumped from five seconds to five minutes!
I jumped to the Internet to find out what the heck was going on, and what our options were. Hat-tip to Robert Davis (SQLSoldier) and Kenneth Fisher (@SQLStudent144) for helping me figure out what they were. We could either keep the Compatibility at 100 — not feasible, since we’re making this move to stay in service compliance and utilize new features like Temporal Tables. Or we could tell the server to force that database to use the Legacy (read: “old”) Cardinality Estimator. Or, we could set which CE we wanted to use on the queries and procs that showed performance issues.
After a brief discussion with our development team (“WHAT!? You want us to test ALL of our procs and functions?!”) we decided to go with Option B, and just used the FORCE_LEGACY_CARDINALITY_ESTIMATION at the database level (thanks, Derek Hammer (@SQLHammer)!).
I wish I knew exactly WHY this happened in SQL. I mean, I know that it was caused by a change that Microsoft made in how SQL builds query execution plans — SQL Server 2014 had the first such change they’d made there in over a decade. But I don’t know WHAT specifically changed, or how I could have structured that query so that the new estimator didn’t jack it around so much.
And THAT’S why I want to attend Paul Randal’s performance tuning class in Chicago. This guy didn’t just write the book on SQL Server… he helped write SQL Server! No other blogger goes into SQL’s internal mechanics like he does, and word on the street is that his class is phenomenal. The things I could learn here would be crucial to helping us continue to support servers that run from SQL 2005 to 2016 — and Azure as well. I’d be able to better support our development team, know WHY something broke, and the best way to fix it, and continue being the vital asset that the company hired me to be.
See, problems like what I faced this week are why I love being a DBA in the first place. When we have a problem it’s not just a broken wire, or a broken connection, or a badly written piece of code, or bad data from a third party. It could be any of those, or none of them, or all of them!
Performance tuning is detective work. It’s solving a real-time mystery with live data that you desperately need to keep alive, and not just alive, but in better shape than it ever was. Performance tuning is taking all your knowledge, intuition, and experience and pitting it against a problem that is stealing the most important resource from your company: time. Our data needs to be fast, because any time it’s not fast, someone else in the company is sitting there saying “This damn computer. I’m never going to get the information I need on time.”
I have a chance to win an entry into one of Paul’s classes (I’m shooting for the IEPTO1: Performance Tuning and Optimization – Part 1 class in Chicago, October 2-6, 2017 (link to class details)). Paul is a fantastic teacher, a wellspring of knowledge, and Subject Matter Expert #1. I know that if I win, then everyone at my mill wins, and there’s no better prize than that.