Monday, March 30, 2015

Query Performance Tuning Example

As DBA, application performance issue and complaints often thrown our way. It is often fun to troubleshoot those performance issue. When it has been identified that the problem lies on SQL Server, it get even more exciting. Here is one of the example of how I tuned a query a while ago and also some lessons learned.

I received a request to identify bottlenecks of an application where its users has reported in a survey that they have been experienced slow application response in their daily operations. Without any information of which specific module or functionality that experience the 'problem', I did some digging and notice there were different types of queries and stored procedures developed in this application to serve both operational as well as a reporting purpose.

Using one of my own query below, I look up the most run (Order by COUNT) query as well as the one that take the longest times (Order by TIME).

DECLARE @TopNumber smallint = 20;
DECLARE @OrderByField varchar(20) = 'COUNT';

--Top 10 CPU Usage Query
--Declare @TopNumber smallint = 10;
--DECLARE @OrderByField varchar(20) = 'CPU';

-- CPU   - avg_cpu_time
-- TIME   - avg_elapsed_time
-- READ   - avg_physical_read + avg_logical_read
-- WRITE  - avg_logical_write
-- COUNT  - cache_ob_lookup
-- RECOMPILE  - avg_ob_recompile
-- PARALLEL  - avg_over_parallel

WITH query_stat AS 
(SELECT TOP(@TopNumber)
  SUM(qt.plan_generation_num) / SUM(cp.usecounts) AS avg_ob_recompile,
  SUM(cp.usecounts) AS total_exec_count,
  SUM(qt.total_worker_time) / SUM(qt.execution_count) / 1000 AS avg_cpu_time,
  SUM(qt.total_elapsed_time) / SUM(qt.execution_count) / 1000 AS avg_elapsed_time,
  SUM(qt.total_logical_writes) / SUM(qt.execution_count) AS avg_logical_write,
  SUM(qt.total_logical_reads) / SUM(qt.execution_count) AS avg_logical_read,
  SUM(qt.total_physical_reads) / SUM(qt.execution_count) AS avg_physical_read,
  (SUM(qt.total_worker_time) - SUM(qt.total_elapsed_time)) / SUM(qt.execution_count) / 1000 AS avg_over_parallel,
  MIN(qt.sql_handle) AS sql_handle,
  qt.query_hash
FROM sys.dm_exec_cached_plans cp WITH (NOLOCK)
INNER JOIN sys.dm_exec_query_stats AS qt WITH (NOLOCK) ON
 cp.plan_handle = qt.plan_handle
CROSS APPLY sys.dm_exec_sql_text(qt.sql_handle) st 
GROUP BY qt.query_hash
ORDER BY 
 (CASE @OrderByField 
  WHEN 'CPU' THEN SUM(qt.total_worker_time) / SUM(qt.execution_count)
  WHEN 'READ' THEN SUM(qt.total_logical_reads + total_physical_reads) / SUM(qt.execution_count)
  WHEN 'WRITE' THEN SUM(qt.total_logical_writes) / SUM(qt.execution_count)
  WHEN 'COUNT' THEN SUM(cp.usecounts)
  WHEN 'TIME' THEN SUM(qt.total_elapsed_time) / SUM(qt.execution_count)
  WHEN 'RECOMPILE' THEN SUM(qt.plan_generation_num) / SUM(cp.usecounts)
  WHEN 'PARALLEL' THEN (SUM(qt.total_worker_time) - SUM(qt.total_elapsed_time)) / SUM(qt.execution_count)
 END) DESC
)

SELECT TOP(@TopNumber)  
 DB_NAME(st.dbid) AS database_name,
 OBJECT_NAME(st.objectid, st.dbid) AS proc_name,
 cp.cacheobjtype,
 cp.objtype,
 qs.avg_ob_recompile,
 total_exec_count,
 avg_cpu_time,
 avg_elapsed_time,
 avg_logical_write,
 avg_logical_read,
 avg_physical_read,
 avg_over_parallel,
 SUBSTRING(st.text, (qt.statement_start_offset/2) + 1,
   ((CASE statement_end_offset 
    WHEN -1 THEN DATALENGTH(st.text)
    ELSE qt.statement_end_offset 
     END - qt.statement_start_offset)/2) + 1) AS sql_text, 
 st.text AS ob_text,
 ph.query_plan
FROM query_stat AS qs
INNER JOIN sys.dm_exec_query_stats AS qt WITH (NOLOCK) ON
 qs.query_hash = qt.query_hash AND
 qs.sql_handle = qt.sql_handle
INNER JOIN sys.dm_exec_cached_plans cp WITH (NOLOCK) ON
 cp.plan_handle = qt.plan_handle
CROSS APPLY sys.dm_exec_sql_text(qt.sql_handle) AS st
CROSS APPLY sys.dm_exec_query_plan(qt.plan_handle) AS ph
OPTION (RECOMPILE);

By analyzing the two different set of data, I was able to find some correlation and identified some frequently called slow queries. I decided to focus on the 'operational' queries as that is more aligned with the user complaint, With some help from developer to filter out the reporting queries, I was able to identify one stored procedure that appear to have significant impact on user operational task experience.

Here is the target, dbo.SPBuildFastIndex

The stored procedure appears to insert new entry into a table when there a new entry, and in addition to that, it could also be executed to delete all entries in the table and reinsert all data. This stored procedure is one of the most run stored procedure with higher than average duration for its execution. From the system cached data, the stored procedure executed around 3000 times a day. From the statistics, it appears that there are certain SQL statements within the stored procedure take an average of 20 seconds to complete in production and some instances of that certain entries take between 1-2 minutes to complete.

ALTER PROCEDURE dbo.SPBuildFastIndex (@ObjectID int)
AS
...
INSERT INTO FastIndexTbl
(LocID, FastIndexOTID, TransID, NameData, Info, Timestamp)
(
 SELECT 
  t.LocID, 
  100, 
  t.TransID, 
  dbo.NameStr(b.FirstName, b.MiddleName, b.LastName, b.CompanyName),
  CONVERT(varchar(50),d.DetailNumber) + ', ' + 
  ISNULL(dbo.NameStr(i.FirstName, i.MiddleName, i.LastName, NULL), 'No Data'), 
  ISNULL(CAST(o.OpsDate AS datetime), ISNULL(t.TransDate, '1/1/1900'))
 FROM Trans t 
 INNER JOIN Detail d ON t.TransID=d.TransID
 INNER JOIN Ind i ON d.DetailInd=i.IndID
 INNER JOIN Acct a ON t.AcctID=a.AcctID
 INNER JOIN Ind b ON b.IndID=a.CustID
 INNER JOIN [Ops] o ON d.DetailID=o.DetailID
 WHERE 
  d.DetailTypeID = 100  
  AND dbo.NameStr(b.FirstName, b.MiddleName, b.LastName, b.CompanyName) IS NOT NULL
  AND t.TransID = CASE WHEN @ObjectID = 0 THEN t.TransID ELSE @ObjectID END
);


I picked the slow SQL Statement out for more isolated analysis and execute the select statement with the ObjectID parameter specified (eg. 100001),

--Turn on statistics to display IO and time information of the execution
SET STATISTICS IO ON;
SET STATISTICS TIME ON;

I execute these commands below to clear the cache each time I made changes for better comparison

--Delete cache and data in memory for comparison of each change
CHECKPOINT;
DBCC DROPCLEANBUFFERS;
DBCC FREEPROCCACHE;

Here is the execution plan of the query (view from SQL Sentry Plan Explorer). Click on the picture for larger view.


Here is some of the IO and time stat from cold cache.

Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Acct'. Scan count 1, logical reads 22357, physical reads 20, read-ahead reads 22739, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Ops'. Scan count 1, logical reads 12541, physical reads 27, read-ahead reads 12555, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Detail'. Scan count 1, logical reads 21372, physical reads 24, read-ahead reads 21437, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Trans'. Scan count 1, logical reads 53846, physical reads 76, read-ahead reads 53872, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Ind'. Scan count 2, logical reads 282086, physical reads 205, read-ahead reads 141176, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

 SQL Server Execution Times:
   CPU time = 34203 ms,  elapsed time = 43402 ms

Better view from Statistics Parser by Richie Rump.

Here is the IO and time statistics from warm cache (data available in memory)


The duration of reading the data from memory reduce around 10 seconds.

The elapsed time for the warm cache likely to illustrate what is experiencing by end user. However, I would be using the cold cache data for better comparison with other improvement options.

Below were a few methods I tried to identify the level of improvements.

Index implementation
The IO stat as well as the execution plan show that the dbo.Ind table returns and process large number of pages. Mainly because this table is wide and there is no useful narrower index to be used to fulfill this query. Other index scan operators are chosen likely due to the number of rows required to return for process. Implementing some indexes could help reduce number of pages required and returned.

I examined the tables for redundant index as well as reusable index. I found sp_BlitzIndex from Brent Ozar Unlimited makes the index examination easier and faster. I cleaned up some redundant indexes and decided to create some indexes. Eg. on dbo.ind table.

CREATE INDEX [IX_Ind_IndlID_Includes]
ON [dbo].[Ind] ([IndID]) INCLUDE ([FirstName], [MiddleName], [LastName], [CompanyName]);





The duration reduce around 10 seconds with the index implementation from cold cache. Total pages read reduces from ~392K to 47K.


Scalar User Defined Function and Inline Table Valued Function

The SQL statement uses a scalar user defined function (UDF) dbo.NameStr. The function is as innocent as concatenating first, middle, last name and sometimes company name into one string.

Understanding that this function is implemented to make code maintenance simpler and avoid the need to rewrite the same script every times when there is a need to concatenate the name into one string/column.

CREATE FUNCTION [dbo].[NameStr]
(  
  @FirstName varchar(50), @MiddleName varchar(50), @LastName varchar(50), @CompanyName varchar(100) = ''
)
RETURNS varchar(150)
AS
BEGIN

DECLARE @ret varchar(150)

IF IsNull(@CompanyName,'') = ''
  SET @ret = @LastName + CASE WHEN IsNull(@FirstName, '') <> '' THEN ', ' ELSE '' END + @FirstName + CASE WHEN @MiddleName is NULL THEN '' ELSE ' ' + @MiddleName END
ELSE
  SET @ret = @CompanyName

RETURN @ret 

END

However, scalar function has some disadvantages.

1) When scalar UDF is used in a query, SQL Server invoke this function as a separate module for each row within the iterator. This introduces overhead from each invocation of the module of each row
2) Scalar UDF estimated cost is 0 which is not true and it affects how SQL optimizer choose its optimal plan
3) Scalar UDF does not use parallelism in a plan

SQL Profiler trace show that the scalar function is invoked and executed for each row.


SQL plan show that the cost of the function is 0 which is not true


All the execution plan so far has been with no parallelism involved.



There are few ways to remedy this particular case,

1) Rewrite query to include the logic in the query itself, or
2) Replace with inline table valued function

If the decision is to use inline table valued function (TVF), an inline TVF can be created with very similar syntax as the existing scalar UDF, but instead of returning a scalar value, it returns a table with single value (in this case). Please refer to reference for the exact syntax. Here is an example of a inline TVF for this purpose,

CREATE FUNCTION [dbo].[NameStrTable]
( 
  @FirstName varchar(50), @MiddleName varchar(50), @LastName varchar(50), @CompanyName varchar(100) = ''
)
RETURNS TABLE
AS
RETURN 
…
);

Replace any dbo.NameStr(FirstName, MiddleName, LastName, Company) with (SELECT Name FROM dbo.NameStrTable(FirstName, MiddleName, LastName, Company))

SELECT 
 t.LocID, 
 100, 
 t.TransID, 
 (SELECT Name FROM dbo.NameStrTable(b.FirstName, b.MiddleName, b.LastName, b.CompanyName)),
 CONVERT(varchar(50),d.DetailNumber) + ', ' + 
 ISNULL((SELECT Name FROM dbo.NameStrTable(i.FirstName, i.MiddleName, i.LastName, NULL)), 'No Data'), 
 ISNULL(CAST(o.OpsDate AS datetime), ISNULL(t.TransDate, '1/1/1900'))
FROM Trans t 
INNER JOIN Detail d ON t.TransID=d.TransID
INNER JOIN Ind i ON d.DetailID=i.IndID
INNER JOIN Acct a ON t.AcctID=a.AcctID
INNER JOIN Ind b ON b.IndID=a.CustID
INNER JOIN [Ops] o ON d.DetailID=o.DetailID
WHERE 
 d.DetailTypeID = 100  
 AND (SELECT Name FROM dbo.NameStrTable(b.FirstName, b.MiddleName, b.LastName, b.CompanyName)) IS NOT NULL
 AND t.TransID = CASE WHEN @ObjectID = 0 THEN t.TransID ELSE @ObjectID END;


The execution plan below show that SQL Server optimizer chose a parallel plan for this execution. In this case the degree of parallelism is 4.





The duration reduces about 30 seconds from cold cache. The execution was performed without any new index implemented previously.

From the SQL Profiler, it appears that SQL Server does not invoke the inline TVF for each row like scalar UDF and hence avoid the overheads that scalar UDF were experiencing.




The combination of having the function in line with the query for better cost estimation, parallel plan and prevention of overhead cost of invoking the function for each row allow the query to perform much faster.


Parameter Sniffing CASE Condition and Parameter Embedding Optimization
I will get to the reason why I cross out parameter sniffing in a minute.

If you remember, at the beginning of the review, we discussed about the purpose of this stored procedure. In order to build the fast index to speed up search lookup, any new entry entered executed this stored procedure to insert a record into the search index, in addition this stored procedure could be also used build a new one from scratch.

The key part is that this stored procedure is serving two purposes. In order to do that, there is the CASE statement at the search condition (where clause)

t.TransID = CASE WHEN @ObjectID = 0 THEN t.TransID ELSE @ObjectID END;

When a zero is passed in as ObjectID parameter, the query filter all rows in sales table along with other condition (e.g. other joins and condition); If a specific ObjectID is passed in, it only filter that specified sales item in sales table along with other condition (e.g. other joins and search condition).

There are about 1.5 million rows in sales table. When the ObjectID parameter is zero, SQL Server needs to read all these 1.5 mil rows (if this is the only search condition) along with the relevant data required from other tables. SQL Server is likely to choose a scan operator along with merge or hash joins as they are more efficient due to the large rows to be processed. On the other hand, if a specified ObjectID parameter value is specified (non-zero), SQL Server only need to read this one row along with the data required from other tables. Since the SaleID is a clustered key, SQL Server is likely to choose a seek operator along with nested joins which is more efficient for small row set.

Since this two search conditions are applied as a CASE statement, SQL Server would need to evaluate this two conditions and choose a plan that is good enough for both conditions. In this case, the plan shown previously was chosen. Mostly scan operators along with merge and hash joins.



So what is the performance difference of these two plans? Without any index implementation and function replacement, here is the statistics with current plan (statistics same with what shown previously)

Plan SQL Server chose,



With optimal plan,




Wow. the huge performance difference between these two execution plans. This optimal execution plan takes less than a second. Even without any new indexes, the CPU and IO demand is very low compare to the current plan as it is able to seek that particular record through the index seek operation.

As we know now that the current plan is not an optimal plan for single sales item due to the CASE statement, there are a few ways to remedy it. Before that, I want to discuss the topic on the parameter sniffing. When a parameter is passed in the first time, SQL Server evaluates the parameter to choose and generate a compiled plan, execute with that plan and then cache it for reuse. Parameter sniffing occurs when the plan that was chosen and cached is not optimal for subsequent parameter passed in. This usually occur when the data is not evenly distributed in the table. It may seem like the ObjectID with zero was passed in the first time and caused SQL Server to generate a less optimal plan for subsequent specified ObjectID execution. However, this is not the case here as SQL Server evaluates both condition in the CASE statement to generate a plan, regardless if the zero or specified value (non-zero) is passed in the first time. In addition, since the SaleID is an unique column, it allows accurate cardinality estimation and the same compiled plan should be optimal for different ObjectID value as long as it is not a zero.

Now, let’s explore different ways to remedy this.

1) OPTION (RECOMPILE)
2) Rewrite separate static query with IF statement
3) Rewrite as dynamic query with IF statement

OPTION (RECOMPILE)

SELECT 
 t.LocID, 
 100, 
 t.TransID, 
 dbo.NameStr(b.FirstName, b.MiddleName, b.LastName, b.CompanyName),
 CONVERT(varchar(50),d.DetailNumber) + ', ' + 
 ISNULL(dbo.NameStr(i.FirstName, i.MiddleName, i.LastName, NULL), 'No Data'), 
 ISNULL(CAST(o.OpsDate AS datetime), ISNULL(t.TransDate, '1/1/1900'))
FROM Trans t 
INNER JOIN Detail d ON t.TransID=d.TransID
INNER JOIN Ind i ON d.DetailInd=i.IndID
INNER JOIN Acct a ON t.AcctID=a.AcctID
INNER JOIN Ind b ON b.IndID=a.CustID
INNER JOIN [Ops] o ON d.DetailID=o.DetailID
WHERE 
 d.DetailTypeID = 100  
 AND dbo.NameStr(b.FirstName, b.MiddleName, b.LastName, b.CompanyName) IS NOT NULL
 AND t.TransID = CASE WHEN @ObjectID = 0 THEN t.TransID ELSE @ObjectID END
OPTION (RECOMPILE);


Adding the OPTION (RECOMPILE) at the end of the statement force SQL Server to regenerate a new plan based on the parameter for every execution. In addition, it applies parameter embedding optimization that allows SQL parser to fully evaluating the CASE statement. Hence generating an optimal plan for both zero parameter and parameter with specified value. However, there is CPU overhead to recompile and generate a plan, and it may not be desirable especially for query that is executed frequently.



Static query with IF statement,

IF @ObjectID <> 0
BEGIN
 SELECT 
  t.LocID, 
  100, 
  t.TransID, 
  dbo.NameStr(b.FirstName, b.MiddleName, b.LastName, b.CompanyName),
  CONVERT(varchar(50),d.DetailNumber) + ', ' + 
  ISNULL(dbo.NameStr(i.FirstName, i.MiddleName, i.LastName, NULL), 'No Data'), 
  ISNULL(CAST(o.OpsDate AS datetime), ISNULL(t.TransDate, '1/1/1900'))
 FROM Trans t 
 INNER JOIN Detail d ON t.TransID=d.TransID
 INNER JOIN Ind i ON d.DetailInd=i.IndID
 INNER JOIN Acct a ON t.AcctID=a.AcctID
 INNER JOIN Ind b ON b.IndID=a.CustID
 INNER JOIN [Ops] o ON d.DetailID=o.DetailID
 WHERE 
  d.DetailTypeID = 100  
  AND dbo.NameStr(b.FirstName, b.MiddleName, b.LastName, b.CompanyName) IS NOT NULL
  AND t.TransID = @ObjectID;
END
ELSE
BEGIN
  SELECT 
  t.LocID, 
  100, 
  t.TransID, 
  dbo.NameStr(b.FirstName, b.MiddleName, b.LastName, b.CompanyName),
  CONVERT(varchar(50),d.DetailNumber) + ', ' + 
  ISNULL(dbo.NameStr(i.FirstName, i.MiddleName, i.LastName, NULL), 'No Data'), 
  ISNULL(CAST(o.OpsDate AS datetime), ISNULL(t.TransDate, '1/1/1900'))
 FROM Trans t 
 INNER JOIN Detail d ON t.TransID=d.TransID
 INNER JOIN Ind i ON d.DetailInd=i.IndID
 INNER JOIN Acct a ON t.AcctID=a.AcctID
 INNER JOIN Ind b ON b.IndID=a.CustID
 INNER JOIN [Ops] o ON d.DetailID=o.DetailID
 WHERE 
  d.DetailTypeID = 100  
  AND dbo.NameStr(b.FirstName, b.MiddleName, b.LastName, b.CompanyName) IS NOT NULL
END

This method separates the two logics and allow SQL Server to generate, use and cache different plan for different search conditions. Once the compiled plan is cached, it could be reused again for next execution. This method may require the same select and join statements to be rewritten twice.


Dynamic query with IF statement


DECLARE @sql nvarchar(max),
  @parameter nvarchar(4000);

SELECT @parameter = '@transid int';

SELECT @sql = 
 `SELECT 
  t.LocID, 
  100, 
  t.TransID, 
  dbo.NameStr(b.FirstName, b.MiddleName, b.LastName, b.CompanyName),
  CONVERT(varchar(50),d.DetailNumber) + ', ' + 
  ISNULL(dbo.NameStr(i.FirstName, i.MiddleName, i.LastName, NULL), 'No Data'), 
  ISNULL(CAST(o.OpsDate AS datetime), ISNULL(t.TransDate, '1/1/1900'))
 FROM Trans t 
 INNER JOIN Detail d ON t.TransID=d.TransID
 INNER JOIN Ind i ON d.DetailInd=i.IndID
 INNER JOIN Acct a ON t.AcctID=a.AcctID
 INNER JOIN Ind b ON b.IndID=a.CustID
 INNER JOIN [Ops] o ON d.DetailID=o.DetailID
 WHERE 
  d.DetailTypeID = 100  
  AND dbo.NameStr(b.FirstName, b.MiddleName, b.LastName, b.CompanyName) IS NOT NULL`;

  AND t.TransID = CASE WHEN @ObjectID = 0 THEN t.TransID ELSE @ObjectID END

IF @ObjectID <> 0 SELECT @sql = @sql + ' AND t.TransID = @transid;';
ELSE SELECT @sql = @sql + ';';

EXEC sp_executesql @sql, @parameter, @ObjectID;


This method separates the two search condition logics and allow SQL Server to generate, use and cache different plan (using sp_executesql) for different search conditions. Once the compiled plan is cached, each statement doesn't required to be recompiled again for next execution. The dynamic query allows the reuse of the same select and join statement during coding. However, the use of dynamic query may add another layer of complexity of coding and troubleshooting.

Observation
Index Implementation
As we have seen that some indexes do improve the performance for current plan (optimized for CASE condition) from cold cache. It reduces the number of pages required, that translates to lower IO and with that, it indirectly reduces the memory demand.

For this particular stored procedure, it does not appear that these new indexes will significantly improve the query performance that we are experiencing in production especially after the data have been cached in memory. However with an optimal query plan, some covering indexes could be helpful.

Replacing Scalar Function
The prevention of using parallel plan and the significant overhead incurred on scalar UDF could severely affect the performance when a large number of rows are involved like in this case. Although parallel plan is not always better than serial plan, in the case of large number of rows, it could be beneficial for multiple threads to split up the works. With small row set, the scalar UDF may not have significant impact as we have seen with the optimal plan.

From the observation, replacing scalar UDF would improve the performance in production especially when large number of rows are involved. The only thing is that the function is being used in many different stored procedures. As not all stored procedure will replace it with UDF, changing this stored procedure to use a new UDF mean there will be two separate functions to maintain. May not be a big deal but all developers will need to take note of that.

Replacing CASE statement
The CASE statement at the search condition appears to be the main culprit of low performance of this query. SQL optimizer has chosen and generated a plan that is suboptimal for parameter with specified value (non-zero), which is most of the cases here.

Since the situation is not caused by inaccurate cardinality estimation from uneven data distribution according to the parameter, recompilation on every execution may not be desired. There is only a need for two search condition; one with specified value and one for returning all rows. By rewriting the query with two separate logics allow optimal cached plan for each type of execution, and also prevent the overhead of recompiling the plan to generate an optimal plan for each execution.

Lesson Learned
When performing query tuning, the query itself should be examined closely before performing any index changes. Index does help in many scenarios but should not be the first approach. Often, including this scenario, the performance of the query could be significantly improved just by changing the way it is written. With better understanding on different limitation and constraints (eg. scalar function) and how SQL Server Optimizer works would definitely help.


Google+