Performance Analysis? SQL Server slow or Query slow? PSSDIAG is there for the rescue!
Posted by database-wiki on July 3, 2012
Things you need to have:
1. PSSDIAG configuration tool:
Download link:
http://support.microsoft.com/kb/830232
2. Some points to consider before configuring and running PSSDIAG on a production system.
=> Make sure that you choose the correct version of SQL Server and platform to be captured (i.e. SQL 2000/2005/2008, x86/x64/ia64). If you make modifications to events, etc. and then you change the version, platform, then you will lose any modifications you have made to that point.
=> PSSDIAG should be captured on a local disk from the server that has the SQL instance running on it to make sure that all the data is captured.
=> PSSDIAG should be collected on a fast and dedicated disk volume to help ensure minimum performance impact on the SQL Server and also to ensure that the disk statistics are not skewed because of PSSDIAG data collection.
=> Make sure you puts PSSDiag on a disk where no database files reside.
=> Always provide the server-name and instance name while configuring the PSSDIAG to prevent unnecessary data from being collected.
=> Always trace to a locally attached (or SAN) disk, never to a network share or UNC path.
=> Don’t capture extremely high frequency events. (Refer below)
=> The Show Plan Statistics event can be extremely useful, but be aware that capturing this event has a much larger CPU cost than most other events.
=> Trace filtering can dramatically reduce .trc file size and the I/O cost of tracing, but you should be aware that it can actually increase the CPU burden of tracing.
For Long Term PSSDIAG collection:
- Enable NTFS compression in PSSDIAG
- Eliminate Profiler tracing if a trace is not required
- If a Profiler trace is required, minimize the size of the trace
- Enable the “Delete Old Trace Files” custom task group in PSSDIAG
- Consider using PSSDIAG’s /E and /L command line parameter to restart collection each night
- Consider using PSSDIAG’s built in NT service support.
- Collect only Completed events for Batch and SPs if you know that the you are not experiencing timeouts
- Statement level events can be avoided to reduce the size of data collection but that would prevent us from drilling down to the actual statement
Warning: Do not do any of the following in an attempt to reduce the disk space used by a PSSDIAG capture:
- Do not try to run PSSDIAG on a remote machine in an attempt to redirect the profiler .TRC files to a remote machine with more disk space
- Do not use a UNC path or mapped network drive for PSSDIAG’s output path if you are capturing profiler
Possible customization:
- Cluster (similar to MPS Reports) – For corruption RCA cases
- Database Mail/Service Broker/Database Mirroring/Full Text Search
- Delete Old Trace Files – For long running PSSDIAG
- Merge Repl (Dist)/Merge Repl (Pub-Sub)/Tran Repl (Dist)/Tran Repl (Pub)/Repl (General)
- Missing Perfmon Counters
- OSDrivers – For corruption RCA cases
- PSSLooksAlive – For cluster failover RCA
- SQL 2000 Blocking Script/SQL 2005 Perf Stats/SQL 2008 Perf Stats –Select relevant one when configuring for long term collection to reduce size.
- SQL 2000 Perf Stats
- SQL 2005 tempdb Space and Latching
- SQL 2008 Backup MDW
- SQL Backup Restore
- SQL Blocking – For deadlocks
- SQL BPA
- SQL Dumps – To be enabled when you need all SQL Server dumps
- SQL Memory Error
- SQL Setup – Not to be collect unless specifically required
- SQLAgent/SQLMail
High frequency event categories
Broker
Query Notifications
Performance
Security Audit
Scans
Objects
Locks
Transactions (Except SQLTransaction)
OLEDB
Cursors
High frequency events
Stored Procedures : SP:CacheMiss
Stored Procedures : SP:CacheInsert
Stored Procedures : SP:CacheRemove
Stored Procedures : SP:CacheHit
TSQL : Prepare SQL
TSQL : Exec Prepared SQL
TSQL : Unprepare SQL
3. what counters to capture to troubleshoot a specific issue?
For High CPU:
For Timeouts and Deadlocks:
For Blocking and Disk I/O:
For Memory Issues:
Here I have configured the PSSDIAG for the below problem description:
Problem Description:
We have observed that queries take a very variable amount of time to execute. This is a very important issue for us as many of our applications are of a real-time nature and depend upon getting data in a short time.
One particular query which we have focussed on, selects about 100000 rows and takes 500ms usually, but can sometimes take over 10 seconds. The profiler does not show any activity in this period which explains the delay (no other queries which would lock the data).
We need to try to understand why the performance is so variable, what SQL Server is doing during this period which is holding it up from running the query. There are no obvious bottlenecks at the server level such as CPU.
Here is a description of our DB architecture, which is rather complex with respect to replication. (I suggest you draw a diagram as you read through the below, I’m sorry I don’t have a diagram I can send you.)
To summarise the replication architecture, we have DBs which are kept centrally on a pair of servers which provide us with some fall-back/redundancy. Some of the data is needed (for read only) on over 50 other systems, and so we replicate it down to an MSDE/SQLExpress on these systems. Other data only resides at the central servers and our applications query it directly.
There are two main servers, I’ll call them A and B. A is the server where we have been having the long execution times. (We do not run queries on B except when A is out of service.)
A and B have about 5 databases where A is the publisher to B using transactional replication with updatable subscriptions. One of the these databases, which I will call X, is replicated down to many other places as follows.
B has 5 publications set up for X, each to about 10 MSDE (SQL 2000 generation) instances on other servers using transactional replication (simple one-way).
A also replicates to a similar set of MSDE instances as B, but the setup is a little more complex due to limitations of what setup was possible. Because DB X on A is a publisher using updatable subscriptions, we could not set up publishers with simple transactional replication to these MSDEs. So, they are set up as updatable subscriptions down to the MSDE instances. To further complicate matters, since the installation of SQL 2005 SP3, this configuration could not be done (an issue I believe with it being down to MSDE, a more up to date SQL Express would have been ok). Although existing publications continued to operate as before it was not possible to add new ones. Therefore, a new DB has set up on A to act as a copy of X with transactional replication with updatable subscriptions (let’s call it X2), and any new replications to MSDE instances that we need to add get done as one way replication down from X2.
In addition to A and B, there is also another server C. This is set up in the same way as B but to less MSDE instances.
I hope that this description is clear.
I am also sending, in another email, two profiler traces. They were taken on server A and show activity on all databases. At line 246178 in trace 1 you will see a test query taking about half a second. Then at line 247565 you will see exactly the same query taking over 3 seconds. Although there are blocks of activity relating to replication during this 3 seconds, they do not span the whole period and so do not seem to directly explain the delay. There are no other queries running on the same data during this 3 seconds. In trace 3, the query takes over 13 seconds. Again, the trace does not show a solid block of activity in the previous 13 seconds which might explain the duration.
Then I have save the PSSDIAG configuration to pssd.exe.
To run PSSDIAG I need to first send the pssd.exe to cx and he needs to follow the below steps to collect the same:
Note:
If you have configure PSSDIAG for SQL Server 2000 or SQL Server 2005 you need to run pssdiag.exe from the command prompt after extracting pssd.exe. But for SQL Server 2008 onwards you need to run pssdiag.cmd after extracting pssd.exe.
- On the server PSSQL2012 that contains the SQL Server 2005 instance (SQL2012), create a folder PSSDIAG in a drive that has at least 20 gigs of free space ( Make sure you do not put this in the already stressed disk or the SQL Server data file are present) . Copy the file pssd.exe in to the PSSDIAG folder then run it.
- Navigate to the folder using command prompt.
- When users complain that the issue has occurred, then execute pssdiag.exe
4. Once PSSDIAG has started (you will see a message saying that PSSDIAG has started) in the cmd prompt.
5. After seeing “PSSDIAG Collection started. Press Ctrl+C to stop” message – wait for 10 minutes to capture issue occurs and then press ctrl+c in the above screen only once. Also make sure you monitor the size of this PSSDIAG folder for size growth.
6. Inside the PSSDIAG folder you will find an output folder. That the data you will need to start you analysis using SQL NEXUS.
3. SQL Nexus to analyze the PSSDIAG.
Download link:
Analysis using nexus:
For analysis using nexus you need a test SQL Server instance. You can also use SQL Server express for the same.
Install SQL Server express edition on your test or research server. You can download the same from
http://www.microsoft.com/en-us/download/details.aspx?id=23650
Once you have installed nexus and started it for the first time it will problem you for an instance to connect.
You can create a new database from Current DB drop down for each analysis.
You need to using import to import the data into database inside SQL Server 2008 R2 express edition. For me that edition is sqlnexus.
Once the all the data is loaded with out any error particularly profiler trace, you can use SQL Perf Main to view the reports.
Tips and Tricks:
A. Once you have executed the file pssd.exe your folder will look like,
B. Navigate to the folder using command prompt. If it’s a windows 2008 server you need to open cmd prompt as “run as administrator”
C. Run the command as below,
Here I’m starting the PSSDIAG at 19:05 PM on 30th that’s today and stopping at 19:10 PM.
/B means begin time and /E means end time.
For you the command should look like, ( you can also change the date but we want to collect the data at 2:50 to 3:20 that’s when the process should be complete but if it’s going to extend beyond we should be capturing the bad run. )
Pssdiag.exe /R /B 20111201_02:50:00 /E 20111201_03:20:00
We are going to collect the data on 1st December 2011 between 2:50 to 3:20 AM. Once you run the above command the output will be:
D . After which you need to go to the services console and start PSSDIAG service.
I have tested this procedure and it works. All you need to do is use the output folder located in the PSSDIAG folder and import it in SQL Nexus.
For me it’s,
D:\logshipping\tester\output
Delete the PSSDIAG service,
NOTE: The account under which the PSSDIAG service is created/running should have access to connect SQL Server instance and capture profiler trace.
B. For running PSSDIAG on a different server, please edit the PSSDIAG.XML after extracting pssd.exe as
<Machine name=”SERVER_NAME”>
And
Default instance,
<Instance name=”MSSQLSERVER” windowsauth=”true” ssver=”10″ user=”">
Or
Named instance,
<Instance name=”INSTANCE_NAME” windowsauth=”true” ssver=”10″ user=”">
Detailed Analysis of collected Data:
|
Server Configuraton Values |
| AWE – False |
| Cluster – No |
| Database drives – C:, E:, D: |
| Machine Name – RSxxxx12 |
| Number of Databases – 17 |
| Number of Processors – 8 |
| OS Name – Microsoft(R) Windows(R) Server 2003,EnterpriseEdition |
| SQL Bit Version – Intel X86 |
| SQL Edition -Enterprise Edition |
| SQL Server Name – RSxxxx12 |
| SQL Version – 9.00.4035.00 |
| Total RAM – 4,089.73 MB |
| VM – False |
| Win Bit Version – X86-based PC |
| Win Version – Windows NT 5.2 (Build 3790: Service Pack 2) |
SP_CONFIGURE VALUES:
| name | minimum | maximum | config_value | run_value |
| Ad Hoc Distributed Queries |
0 |
1 |
0 |
0 |
| affinity I/O mask |
-2147483648 |
2147483647 |
0 |
0 |
| affinity mask |
-2147483648 |
2147483647 |
0 |
0 |
| Agent XPs |
0 |
1 |
1 |
1 |
| allow updates |
0 |
1 |
0 |
0 |
| awe enabled |
0 |
1 |
0 |
0 |
| blocked process threshold |
0 |
86400 |
5 |
5 |
| c2 audit mode |
0 |
1 |
0 |
0 |
| clr enabled |
0 |
1 |
0 |
0 |
| common criteria compliance enabled |
0 |
1 |
0 |
0 |
| cost threshold for parallelism |
0 |
32767 |
5 |
5 |
| cross db ownership chaining |
0 |
1 |
0 |
0 |
| cursor threshold |
-1 |
2147483647 |
-1 |
-1 |
| Database Mail XPs |
0 |
1 |
0 |
0 |
| default full-text language |
0 |
2147483647 |
1033 |
1033 |
| default language |
0 |
9999 |
0 |
0 |
| default trace enabled |
0 |
1 |
1 |
1 |
| disallow results from triggers |
0 |
1 |
0 |
0 |
| fill factor (%) |
0 |
100 |
0 |
0 |
| ft crawl bandwidth (max) |
0 |
32767 |
100 |
100 |
| ft crawl bandwidth (min) |
0 |
32767 |
0 |
0 |
| ft notify bandwidth (max) |
0 |
32767 |
100 |
100 |
| ft notify bandwidth (min) |
0 |
32767 |
0 |
0 |
| index create memory (KB) |
704 |
2147483647 |
0 |
0 |
| in-doubt xact resolution |
0 |
2 |
0 |
0 |
| lightweight pooling |
0 |
1 |
0 |
0 |
| locks |
5000 |
2147483647 |
0 |
0 |
| max degree of parallelism |
0 |
64 |
0 |
0 |
| max full-text crawl range |
0 |
256 |
4 |
4 |
| max server memory (MB) |
16 |
2147483647 |
2147483647 |
2147483647 |
| max text repl size (B) |
0 |
2147483647 |
2147483647 |
2147483647 |
| max worker threads |
128 |
32767 |
0 |
0 |
| media retention |
0 |
365 |
0 |
0 |
| min memory per query (KB) |
512 |
2147483647 |
1024 |
1024 |
| min server memory (MB) |
0 |
2147483647 |
0 |
0 |
| nested triggers |
0 |
1 |
1 |
1 |
| network packet size (B) |
512 |
32767 |
4096 |
4096 |
| Ole Automation Procedures |
0 |
1 |
0 |
0 |
| open objects |
0 |
2147483647 |
0 |
0 |
| PH timeout (s) |
1 |
3600 |
60 |
60 |
| precompute rank |
0 |
1 |
0 |
0 |
| priority boost |
0 |
1 |
0 |
0 |
| query governor cost limit |
0 |
2147483647 |
0 |
0 |
| query wait (s) |
-1 |
2147483647 |
-1 |
-1 |
| recovery interval (min) |
0 |
32767 |
0 |
0 |
| remote access |
0 |
1 |
1 |
1 |
| remote admin connections |
0 |
1 |
0 |
0 |
| remote login timeout (s) |
0 |
2147483647 |
20 |
20 |
| remote proc trans |
0 |
1 |
0 |
0 |
| remote query timeout (s) |
0 |
2147483647 |
600 |
600 |
| Replication XPs |
0 |
1 |
0 |
0 |
| scan for startup procs |
0 |
1 |
1 |
1 |
| server trigger recursion |
0 |
1 |
1 |
1 |
| set working set size |
0 |
1 |
0 |
0 |
| show advanced options |
0 |
1 |
1 |
1 |
| SMO and DMO XPs |
0 |
1 |
1 |
1 |
| SQL Mail XPs |
0 |
1 |
0 |
0 |
| transform noise words |
0 |
1 |
0 |
0 |
| two digit year cutoff |
1753 |
9999 |
2049 |
2049 |
| user connections |
0 |
32767 |
0 |
0 |
| user options |
0 |
32767 |
0 |
0 |
| Web Assistant Procedures |
0 |
1 |
0 |
0 |
| xp_cmdshell |
0 |
1 |
1 |
1 |
We definitely see blocking, even though this SPID 1004 which was executing the query which is of your concern never got blocked , but still these kind of blocking can cause performance delays, Queries can suffer delays in execution.
The blocking is caused by LOCK BLOCKING.
| first_rownum | last_rownum | num_snapshots | blocking_start | blocking_end | head_blocker_session_id | blocking_wait_type | max_blocked_task_count | max_total_wait_duration_ms | avg_wait_duration_ms | max_wait_duration_ms | max_blocking_chain_depth | head_blocker_session_id_orig |
|
0 |
0 |
1 |
19:25.1 |
19:37.3 |
1509 |
LOCK BLOCKING |
1332 |
843984 |
633 |
828 |
2 |
1509 |
|
1 |
1 |
1 |
19:37.3 |
19:47.6 |
2064 |
LOCK BLOCKING |
30 |
31090 |
1036 |
1047 |
2 |
2064 |
|
2 |
3 |
2 |
19:47.6 |
20:08.1 |
1509 |
LOCK BLOCKING |
1122 |
224103 |
630 |
1062 |
2 |
1509 |
|
4 |
4 |
1 |
20:28.8 |
20:39.3 |
1509 |
LOCK BLOCKING |
6 |
1842 |
307 |
312 |
2 |
1509 |
|
5 |
6 |
2 |
20:39.3 |
20:59.7 |
1232 |
LOCK BLOCKING |
342 |
89550 |
263 |
1609 |
2 |
1232 |
|
7 |
8 |
2 |
20:49.5 |
21:10.0 |
2064 |
LOCK BLOCKING |
30 |
4200 |
124 |
140 |
2 |
2064 |
|
10 |
10 |
1 |
21:30.3 |
21:41.1 |
499 |
LOCK BLOCKING |
1 |
421 |
421 |
421 |
1 |
499 |
|
9 |
12 |
2 |
21:30.3 |
21:51.3 |
1509 |
LOCK BLOCKING |
6 |
2314 |
352 |
406 |
2 |
1509 |
|
11 |
11 |
1 |
21:41.1 |
21:51.3 |
2064 |
LOCK BLOCKING |
30 |
17820 |
594 |
594 |
2 |
2064 |
|
13 |
13 |
1 |
22:01.5 |
22:11.8 |
1509 |
LOCK BLOCKING |
1260 |
873705 |
693 |
1141 |
2 |
1509 |
|
14 |
14 |
1 |
22:11.8 |
22:22.0 |
1546 |
LOCK BLOCKING |
1 |
1032 |
1032 |
1032 |
1 |
1546 |
|
15 |
15 |
1 |
23:03.0 |
23:13.2 |
499 |
LOCK BLOCKING |
1 |
1063 |
1063 |
1063 |
1 |
499 |
|
16 |
16 |
1 |
23:13.2 |
23:23.4 |
2064 |
LOCK BLOCKING |
30 |
10800 |
360 |
360 |
2 |
2064 |
|
17 |
17 |
1 |
23:54.6 |
24:04.9 |
1509 |
LOCK BLOCKING |
42 |
42282 |
1006 |
1156 |
2 |
1509 |
|
19 |
19 |
1 |
24:15.1 |
24:25.3 |
1509 |
NULL |
6 |
0 |
0 |
0 |
2 |
1509 |
|
18 |
18 |
1 |
24:15.1 |
24:25.3 |
499 |
WRITELOG |
1 |
63 |
63 |
63 |
1 |
499 |
|
20 |
20 |
1 |
24:35.5 |
24:46.0 |
1509 |
ASYNC_NETWORK_IO |
1 |
16 |
16 |
16 |
2 |
1509 |
|
21 |
21 |
1 |
24:35.5 |
24:46.0 |
1509 |
NULL |
33 |
0 |
0 |
0 |
2 |
1509 |
As we see above the Head Blocker is Waiting on a WRITELOG which means that there is a potential delay in Write Operation, this indicates Disk Bottleneck.
WaitType WaitTime WaitTimePerSec(ms)
——————————————— ——————– ——————–
LCK_M_S 1322250 3516
WRITELOG 1223125 3252
ASYNC_NETWORK_IO 589906 1568
PAGEIOLATCH_SH 418985 1114
SOS_SCHEDULER_YIELD 312297 830
MSQL_XP 186985 497
PAGEIOLATCH_EX 133156 354
IO_COMPLETION 41062 109
LCK_M_U 29766 79
SLEEP_BPOOL_FLUSH 28453 75
Please go through the description of these wait type from the below KB article.
http://support.microsoft.com/kb/822101
Lock wait type:
| LCK_M_S | 0×03 | Share |
| WRITELOG | 0×81 | This waittype indicates that the SPID is waiting for a transaction log I/O request to complete. This waittype may also indicate a possible disk bottleneck. |
ASYNC_NETWORK_IO
The waittype is Network I/O (client fetch) and below is the definition. ( ASYNC_NETWORK_IO )
- Network I/O (client fetch) – This wait type indicates that SQL Server is waiting for the client application to fetch rows.
PAGEIOLATCH_SH
Poor IO performance can easily become a bottleneck on a database system. The bottleneck is commonly visible through PAGEIOLATCH_xx or WRITELOG waits.
An IO bottleneck is commonly a result of the IO subsystem being undersized or incorrectly configured to handle the load. If the poor performance coincides with spikes in checkpoint IO you may also be able to throttle the number of checkpoint IOs (
http://support.microsoft.com/?kbid=929240
).
As for external causes for slow IO performance, here are some other common causes:
- Driver/Host bus adapter isn’t configured to allow sufficient async IOs (queue depth) to the underlying hardware. How & where you set this is dependent on your hardware/drivers and so you may need to contact their vendor for details. See
http://www.microsoft.com/technet/prodtechnol/sql/bestpractice/pdpliobp.mspx
for general information on this topic, and
http://sqlblog.com/blogs/linchi_shea/archive/2007/09/18/sql-server-and-sans-the-queuedepth-setting-of-a-host-bus-adapter-hba.aspx
for some example charts showing how the IO performance can vary significantly as a result of this configuration.
From the above data, Any wait category that waits for more than 1000 ms per sec is a bottleneck which we need to consider.
We have a considerable amount of missing indexes when we need to create it. Also check with the application team before implementing the same.
To Co-relate our above theory we see the Perfmon counters [NOTE WE REMOVED THE C DRIVE COUNTER]
Counter Name Avg Max Min
—————————————— ——————– ——————– ——————–
LogicalDisk(D:)\% Idle Time 19.194 59.339 0.000
LogicalDisk(D:)\Avg. Disk Queue Length 2.348 4.308 0.687
LogicalDisk(D:)\Avg. Disk sec/Read 0.238 0.925 0.000
LogicalDisk(D:)\Avg. Disk sec/Transfer 0.278 0.900 0.078
LogicalDisk(D:)\Avg. Disk sec/Write 0.282 0.910 0.068
LogicalDisk(D:)\Disk Bytes/sec 88256.383 516716.327 10354.913
LogicalDisk(D:)\Split IO/Sec 0.143 6.574 0.000
LogicalDisk(E:)\% Idle Time 84.256 94.898 16.350
LogicalDisk(E:)\Avg. Disk Queue Length 0.723 11.347 0.111
LogicalDisk(E:)\Avg. Disk sec/Read 0.022 0.079 0.006
LogicalDisk(E:)\Avg. Disk sec/Transfer 0.018 0.077 0.006
LogicalDisk(E:)\Avg. Disk sec/Write 0.009 0.016 0.005
LogicalDisk(E:)\Disk Bytes/sec 1480158.906 7070527.456 570347.259
LogicalDisk(E:)\Split IO/Sec 0.769 5.184 0.000
Looking at the performance of the disk:
| Counter Name | Avg | Max | Min | Server |
| LogicalDisk(_Total)\Avg. Disk Queue Length |
129.256 |
398.375 |
60.16 |
\\RSxxxx12\ |
| LogicalDisk(_Total)\Avg. Disk sec/Transfer |
0.145 |
0.451 |
0.077 |
\\RSxxxx12\ |
| LogicalDisk(_Total)\Disk Bytes/sec |
22140967.98 |
29825877.25 |
16101317.61 |
\\RSxxxx12\ |
| LogicalDisk(_Total)\Disk Transfers/sec |
892.64 |
1058.612 |
699.41 |
\\RSxxxx12\ |
| LogicalDisk(_Total)\Split IO/sec |
15.894 |
28.822 |
5.58 |
\\RSxxxx12\ |
| LogicalDisk(C:)\Avg. Disk Queue Length |
126.185 |
393.851 |
58.876 |
\\RSxxxx12\ |
| LogicalDisk(C:)\Avg. Disk sec/Transfer |
0.148 |
0.467 |
0.08 |
\\RSxxxx12\ |
| LogicalDisk(C:)\Disk Bytes/sec |
20572552.69 |
28850140.58 |
15002983.59 |
\\RSxxxx12\ |
| LogicalDisk(C:)\Disk Transfers/sec |
851.455 |
943.152 |
670.623 |
\\RSxxxx12\ |
| LogicalDisk(C:)\Split IO/sec |
14.982 |
28.027 |
4.384 |
\\RSxxxx12\ |
| LogicalDisk(D:)\Avg. Disk Queue Length |
2.348 |
4.308 |
0.687 |
\\RSxxxx12\ |
| LogicalDisk(D:)\Avg. Disk sec/Transfer | 0.278 (278 milliseconds = 0.278 seconds) |
0.9 |
0.078 |
\\RSxxxx12\ |
| LogicalDisk(D:)\Disk Bytes/sec | 88256.383 (0.08 MB per second = 81.92 KB) |
516716.327 |
10354.913 |
\\RSxxxx12\ |
| LogicalDisk(D:)\Disk Transfers/sec |
9.177 |
21.314 |
4.775 |
\\RSxxxx12\ |
| LogicalDisk(D:)\Split IO/sec |
0.143 |
6.574 |
0 |
\\RSxxxx12\ |
| LogicalDisk(E:)\Avg. Disk Queue Length |
0.723 |
11.347 |
0.111 |
\\RSxxxx12\ |
| LogicalDisk(E:)\Avg. Disk sec/Transfer |
0.018 |
0.077 |
0.006 |
\\RSxxxx12\ |
| LogicalDisk(E:)\Disk Bytes/sec |
1480158.906 |
7070527.456 |
570347.259 |
\\RSxxxx12\ |
| LogicalDisk(E:)\Disk Transfers/sec |
32.007 |
149.33 |
9.369 |
\\RSxxxx12\ |
| LogicalDisk(E:)\Split IO/sec |
0.769 |
5.184 |
0 |
\\RSxxxx12\ |
For 0.08 MB (81.92 KB)of data on an average it take 278 milliseconds = 0.278 seconds which does not look good at all.
Considering For SAN’s – Avg Disk Bytes/sec can be 250 – 300 MB/Sec for single disk – Avg Disk Bytes/Sec > 50-70 MB/Sec
All our log files are located in the D drive that’s performing bad and we see an obvious wait type WRITELOG.
We started to dig into the specific query which was provided by you.
SELECT Service.ID, FileShop.szName, FileShop.szPath FROM FileShop, Service WHERE Service.lPartitionViewID = FileShop.lPartitionViewID AND FileShop.lPartitionViewID <> 0UNIONALL SELECT 0, FileShop.szName, FileShop.szPath FROM FileShop WHERE FileShop.lPartitionViewID = 0
On the RTSinfo DB.
From the profiler see only one instance of this query and it took close to 20 seconds to complete.
SQL:BatchStarting 7 RSxxxx79 4504 Microsoft Open Database Connectivity RTPerformer 1004 2011-12-02 20:19:29.850 RTSinfo 0XE1B7F4B2572D6940834D202EE9196261 0 0 -1823749497 RTformer SELECT Service.ID, FileShop.szName, FileShop.szPath FROM FileShop, Service WHERE Service.lPartitionViewID = FileShop.lPartitionViewID AND FileShop.lPartitionViewID <> 0 UNION ALL SELECT 0, FileShop.szName, FileShop.szPath FROM FileShop WHERE FileShop.lPartitionViewID = 0
From the PSSDIAG snapshot data, all the query was suspended due to ASYNC_NETWORK_IO before it completed.
./RSxxxx12__SQL_2005_Perf_Stats_Startup.OUT:2011-12-02T20:19:37.293 1004 0 0 0 0 SUSPENDED ASYNC_NETWORK_IO 15 ASYNC_NETWORK_IO 0 2-Read Committed 1 94 2375 76 1 2 6879 3627 11 606323 9285 76699 2011-12-02T20:19:29.850 2011-12-02T20:19:29.850 2 0 1 7471 .00 0 SELECT 2011-12-02T20:19:29.867 2-Read only 2-Active 2011-12-02T20:19:29.850 suspended SELECT 0 -1 7 6 0 76 2011-11-25T19:57:47.530 RSxxxx79 Microsoft Open Database Connectivity 4504 4 ODBC RTformer 4096 10.0.16.63 running 6
./RSxxxx12__SQL_2005_Perf_Stats_Startup.OUT:2011-12-02T20:19:37.293 1004 0 NULL NULL NULL NULL NULL NULL RTSinfo NULL NULL SELECT Service.ID, FileShop.szName, FileShop.szPath FROM FileShop, Service WHERE Service.lPartitionViewID = FileShop.lPartitionViewID AND FileShop.lPartitionViewID <> 0 UNION ALL SELECT 0, FileShop.szName, FileShop.szPath FROM FileShop WHERE FileShop.lPartitionViewID = 0 0x060007009F1FE836B8A16354000000000000000000000000
./RSxxxx12__SQL_2005_Perf_Stats_Startup.OUT:2011-12-02T20:19:47.630 1004 0 0 0 0 SUSPENDED ASYNC_NETWORK_IO 0 ASYNC_NETWORK_IO 0 2-Read Committed 1 140 3622 128 1 2 6879 3627 11 606323 9285 76699 2011-12-02T20:19:29.850 2011-12-02T20:19:29.850 2 0 1 17787 .00 0 SELECT 2011-12-02T20:19:29.867 2-Read only 2-Active 2011-12-02T20:19:29.850 suspended SELECT 0 -1 7 6 0 129 2011-11-25T19:57:47.530 RSxxxx79 Microsoft Open Database Connectivity 4504 4 ODBC RTformer 4096 10.0.16.63 running 6
./RSxxxx12__SQL_2005_Perf_Stats_Startup.OUT:2011-12-02T20:19:47.630 1004 0 NULL NULL NULL NULL NULL NULL RTSinfo NULL NULL SELECT Service.ID, FileShop.szName, FileShop.szPath FROM FileShop, Service WHERE Service.lPartitionViewID = FileShop.lPartitionViewID AND FileShop.lPartitionViewID <> 0 UNION ALL SELECT 0, FileShop.szName, FileShop.szPath FROM FileShop WHERE FileShop.lPartitionViewID = 0 0x060007009F1FE836B8A16354000000000000000000000000
Analysis:
The disk contention are usually due to these reasons:
1. There is an underlying H/W issue. This could mean that some driver or firmware needs an upgrade (essentially all your firmware/drivers should always be on the latest builds available) or it could be a problem with the SAN configuration or it could be that this particular SAN is not dedicated to SQL Server.
2. OR it could be that the disk contention is due to the fact that the queries executing on SQL Server are performing excessive amount of reads due to sub-optimal plans. Sub-optimal plans would be generated if your indexes are out of shape or if your statistics are skewed. This would require a statistics update with a full scan. (Please refer my statistics update post.) You should be able to find out which process is driving IO by looking at the Process performance object in Perfmon and the I/O Data Bytes counter. There is no fixed value for good and bad performance for this counter. The amount of I//O being pushed through should be compared with the response time of the disks and the I/O thresholds which the disk can handle.
3. OR it could be that the disk activity on the disks has increased over a period of time due to increase in data or addition of new databases. In such a case, it would be a good idea to provide a separate disk or storage area for some of the databases as the current activity might be too much for the disk to handle. Sometimes, load balancing for physical storage also helps in alleviating a disk contention issue.
4. Ensure that the client application processes all rows it is requesting. If not all rows will be needed or used – consider only querying the rows that are needed (TOP X for example).
5. Look at your NIC configuration on the server and make sure there are no issues (physical card issue, autodetect not picking the fastest speed)
6. Validate the network components between the application/clients and the SQL Server instance (router, for example). This should be done as per
Your application design. Client -> webserver ( if applicable ) -> SQL Server box
- Update the network related drivers on Client , webserver ( if applicable ) , SQL Server box
-Here are a few articles that summarize this:
http://blogs.msdn.com/joesack/archive/2009/01/09/troubleshooting-async-network-io-networkio.aspx
http://support.microsoft.com/kb/822101
-I think it’s pretty safe to say that the network delay is causing the issue and need to elimate this from the equation first.
7. STATISTIC report from perf stats log in the output folder confirms that statistics are outdated. Some cases update stats with full scan can be helpful in other cases update stats by sampling can be helpful, i depends upon environment to environment.
8. Suggest you create indexes on the table(s) for which this particular query belongs to, use database tuning advisor.
SELECT Service.ID, FileShop.szName, FileShop.szPath FROM Fileshop, Service WHERE Service.lPartitionViewID = Fileshop.lPartitionViewID AND FileShop.lPartitionViewID <> 0 UNION ALL SELECT 0, FileShop.szName, FileShop.szPath FROM Fileshop WHERE FileShop.lPartitionViewID = 0
9. Not very related to query just a general recommendation, Change Max Memory Settings in SQL Server to a value less than the Physical Memory on the box. Here is an example to set Max. Server Memory of SQL Server to 2.5 GB:
sp_configure ‘max server memory (MB)’,2560
go
reconfigure with override
go
Additional Info:
LOG_M_S – is this just a normal thing?
Share lock waittype is indicated by LCK_M_S, from our analysis what I can tell you is the physical I/O bottleneck that is prolonging the duration of transactions and locks.
ASYNC_NETWORK_IO – how can the rate at which clients fetch cause a blocking issue?
ASYNC_NETWORK_IO normally means that the server is waiting for the client to fetch the data. Here is some information regarding the waittype:
http://msdn.microsoft.com/en-us/library/ms179984.aspx
Based on our experience, in most of the cases, this problem should be caused by the client application issue (we may see this behavior when we are fetching a very large result set as well, so it’s not necessary means an issue when you see the waittype).
So in our case ASYNC_NETWORK_IO is not causing blocking. What we are saying is, we see Lock Blocking is different and we think its caused by the physical I/O bottleneck but at the same time we also see ASYNC_NETWORK_IO wait type which is often caused by the client not processing results
from the server. This causes the network buffers to fill. The server cannot send more data to the client, so the task executing the batch needs to pause while waiting for the ability to continue sending results and hence the wait type ASYNC_NETWORK_IO.
So LCK_M_S lock blocking should be correlated with physical I/O bottleneck and ASYNC_NETWORK_IO should be correlated with the network bandwidth or the query that’s coming from the client is not fine-tuned. ( for example, Its like querying entire table instead setting condition for the data fetch using where clause. )





















danieladeniji said
Thanks for sharing this. You really went into a lot of depth and I personally found it correlates with some of the problems that I am having at this time.
I have being meaning to get a bit more in pssdiag and this is a good encouragement; for that as well.
Happy SQL.
sqlserver911 said
Thanks for your feedback. Let me know if you need further clarification.
Happy SQL