| Disk Queue Length vs. Disk Latency Times: Which is Best for Measuring Database Performance |
|
|
|
| Monday, 29 September 2008 18:22 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Introduction
There are several common metrics that people use to look for performance bottlenecks. They are usually:
- Disk queue length - Disk latency/transfer times - MB/sec - IO/sec
For the rest of this note, I’ll focus on the top two.
Disk Queue Length
The most common performance monitoring metric people quote is usually disk queue length. While this is important, now days it’s almost impossible to use as an accurate metric because the SAN used in an enterprise database application abstracts the DBA from a lot of the physical characteristics needed to make a reasonable judgment about what is good or bad.
The usual “rule of thumb” for “bad performance” is a queue length greater than 2 for a disk drive. But when you have a SAN with 100 drives in the cabinet, you have no idea how many are being used for your “drive F:” for your data file.
For example, the DBA watches perfmon and sees the disk queue length is 10 on drive F:. While this is a big number, it’s relative. If the F: is one physical disk, yes, you have a problem. If the F: is a logical disk composed of 8 physical disks, then it’s not an issue. Why? Because 8 disks x 2 queue length = 16… and the DBA sees 10… so 10 is less than 16 (our threshold for pain)… hence there is not an issue.
The main problem with queue length is this: very few people can ever provide me with a comprehensive disk map of their SAN showing all the logical volumes. For example, how many disks are participating in an EMC Sym hyper volume.
The following is a very small example of what a mapping would look like. Each of the “73” are 73G disk drives. You can see multiple drives participating in the LUN (Note, this is RAID 0+1 so you see 2 73G drives making up a 59.39G formatted capacity):
Without a map like this, disk queue length is meaningless to me. I have nothing to calculate what the queue length should be less than.
With regards to SQL Server, it will not let the transaction log volume have a queue length of more than 3. Thus, the DBA may never see a “problem”. The reason for this is SQL Server is trying not to swarm the disk drives and will throttle back the work to the log. Since the database follows the Write Ahead Log (WAL) protocol, if it can’t write to the log fast enough, it will not be able to write to the data volumes either… and it will slow down the whole system’s throughput.
Per the previous posting on throughput testing, if you don’t have enough spindles to support the volume of data being pushed through the system, the whole box will slow down. This would be especially important in an enterprise app such as SAP or PeopleSoft that run large batch jobs and are writing to the transaction log constantly.
Disk Transfer Time
As you can read from above, disk queue length can be deceptive. You need a lot of information to make a good judgment.
Disk transfer time is what the disk sub system is reporting to the operating system. It abstracts the DBA from having to know the absolute physical characteristics of the underlying disk.
The key point is that it keeps the DBA from rat holing on a lot of very specific information that most people don’t have… namely, a comprehensive disk mapping of their SAN. It also distills a lot of other complicated measurements down to a few simple ones that mean the most.
For example, if I have a disk queue length of 20 on my F: drive, but my response time is less than 5ms… then everything is working great. What this probably means is that the F: drive has 20 (or more) spindles behind it or that there is massive caching taking place.
In the context of disk latency, the following is an example of modern performance characteristics of a SCSI disk drive:
• Maxtor Atlas 15K II – Ultra320 SCSI – Average 3ms, Max 8ms – Max Sustained 98 MB/sec – Burst 320 MB/sec – 8M Cache – 15K RPM
Source Maxtor: http://www.maxtor.com/_files/maxtor/en_us/documentation/data_sheets/atlas_15kii_data_sheet.pdf
To monitor the times:
Note, when you read the perfmon data and see a number like “.010” this means 10 milliseconds. Additionally, when monitoring a SAN, they often have their own perfmon counter you need to use. For example, EMC and Veritas have their own.
There is plenty of room for intelligent people to argue on the following data, but I use the following table to determine the meaning of the data:
I have a good friend that will credibly argue that anything more than 1 ms response times is a poorly configured SAN. He has customers that get in the 250 to 500 microsecond responds times. Note, many times the performance problems are tied to firmware revisions, HBA configuration, and/or BIOS issues.
Summary
When you start focusing on response time, which is what really matters most, the queue length starts to become irrelevant. If you can get some kind of super drive that can handle everything, terrific! (Think: Solid State RAM Drives.)
What the latency metric does is cut through a lot of the red tape that architects usually have. This metric gives the DBA the ability to just tell the SAN architect when they lay out the LUNs is, “Slice the disk in a way where I get 5 ms disk access times. The rest is up to you.”
If the database can get sustained 5ms disk access times with a heavy stress test (ex. SQLIOSTRESS) then it can probably handle the load of the user base with the app running… which is what you ultimately want for success.
Thanks to Frank McBath
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Comments
Thanks for your reply, I really appreciate it. And now I have to say sorry for replying so late.
The LUN's that you created, are they on the same physical disks? If so then you won't gain any performance improvement.
If you can create a LUN on a new set disks in a RAID1-0 configuration, maybe that canb solve the problem with the write performance.
RAID1-0 gives you two write penalties (write to two disks), RAID5 gives 4 write penalties (read block+parity and then write block+parity). This explains why RAID5 write is slower compared to RAID1-0.
For you I extended the amount of text that can be entered in the comments :)
- MSSQL 2005
- Windows 2003 Server
The SAN itself is somewhat invisible to me (part of the problem), but here's what I do know:
- One HBA in the server
- EMC Symmetrix DMX4
- The devices we're mounting are configured as Raid5
We had a separate LUN for the database storage. When we separated the data into separate filegroups on the same LUN, there was a very small improvement, but we still had the same spikes.
Since it was behaving like a disk bottleneck, we decided to treat it like one: We set up four new LUNs, each mounted as a separate device. We isolated each of the two problem data filegroups on separate LUNs and further isolated the index filegroups onto two more. Each data/index pair is further isolated by separating it into its own database, though still running under the same MSSQL instance. We only load one data/index pair at a time, using the opposite pair as an "online" database and leaving the loading pair "offline" so far as the application is concerned.
Once we implemented this configuration, even though we still saw the same miserable performance (nearly 1 sec/write) on the isolated devices, the original LUN showed nearly no impact. This left the main application unaffected by the large throughput.
So, in addition to having a decent workaround, it seems that this also rules out the HBA, the local cache, the OS, local CPU/memory and LAN bottlenecks. It also seems to rule out the fiberchannel switch, since this is all going over one cable. Unfortunately, we still have no better visibility into what's really going on...
What type of disks are you using? Maybe you need FC 15000rpm RAID1-0 if you are now using SATA. Another idea is adding a few disks to the RAID config to spread the writes.
I encountered these problems on backup servers writing to SATA LUN's. I had to adjust the schedule for the backups during the night.
What kind of process are you talking about?
We've been testing this heavily, trying to tune it for better performance. Since my access to stats from the network and SAN is fairly limited, I've been focusing on symptoms at the OS level, and I've stumbled on some stats that rule out the HBA and the fiberchannel: We isolated the heaviest I/O on a separate LUN, mounted as another drive in Windows. When we did that, writes on the original LUN dropped back to a very reasonable range (milliseconds), even during periods of heavy access on the new LUN. The write speed on the new LUN, however, was still bottlenecking the other process, measuring in the 1+ seconds (yes, seconds) per write range. So that rules out CPU, fiberchannel, the HBA and the fiber switching. Thoughts?
[Roderick]
Hi Bill,
Missed your post to the article on my website, so that explains the radiosilence. It' s an interesting problem you got (or had).
Could be an issue with the type of disks (SATA) os RAID type (RAID 1 is faster then RAID 5), amount of disks in RAID 5 (general idea is more disks give better performance). Could be an issue with other hosts writing to the same set of disks at the same time. Could be an issue with the operating system's memory with W2003 32bit where the kernel gets to less memory if the 3GB switch is used in the boot.ini.
Could be the storage has reached it' s max I/O throughput, VDI projects run into these kind of problems nowadays. The cache is just not big enough for the amount of writes. Some storage devices give you the option to expand the cache mem to the new SSD disks. Costs money though.
You solved this issue? I'm interested in what problem you encountered. Can you be more specific about your issue? (applicaton/OS/storage type/disks type/nr of disks)
Best regards,
Roderick
Roderick
on our USP-v
But in perfmon you cant see anytning less than 1ms. 0.001s
RSS feed for comments to this post