TCQ, RAID, SCSI, and SATA
See also How We Test Drives
See also Western Digital Raptor WD740GD Review
See also Seagate Cheetah 10K.6 Review
IntroductionOver the last few years, Western Digital has maintained a virtual vice-lock on the high-performance, high-capacity desktop and enthusiast markets. The venerable WD Caviar series has combined enviable speed and capacity with reasonable prices. However, aside from a relatively obscure, short-lived SCSI line, when it came to the lucrative enterprise arena the firm simply watched from a distance as titans such as Seagate, Maxtor, and Hitachi battled for market share. A little over a year ago WD tested the enterprise waters with the introduction of the world's first 10,000 RPM ATA drive, the Raptor WD360GD. The Raptor paired SCSI-class mechanics with the new and relatively inexpensive Serial ATA interface in an attempt to undercut the rather hefty premiums that SCSI subsystems demanded. StorageReview's performance results, however, revealed that while the WD360GD delivered world-class single-user results, its multi-user performance remained unimpressive when contrasted with existing 10k RPM SCSI units. |
The WD360GD lacked a key element that the SCSI world has enjoyed for years- tagged command queuing (TCQ), a feature that intelligently reorders requests to minimize actuator movement. In September of 2003, Western Digital announced the follow-up Raptor WD740GD, a second-generation unit that brought a host of improvements to the line. Though the doubling of the Raptor's capacity to 74 gigabytes is the most visible improvement, the most intriguing undoubtedly is the implementation of TCQ.
TCQ In Brief
Not to be confused with operating-system reordering and optimization, tagged command queuing is a hardware-level process designed to streamline the delivery of data in highly-random accesses under heavy loads. Without TCQ, a drive can only accept a single command at a time. It thus operates on a first-come, first-serve basis, completing requests in the order they are received. This is not always the most effective way to service data requests, especially in an intensive, non-localized environment.
Through the process of tagged command queuing, a host adapter adds special tags to individual commands. The drive itself, privy to its own physical layout of sectors across three dimensions, can take into account rotation and seek distances and reorder commands to serve them more efficiently. Requested data is thus returned to the controller in a more streamlined manner; it can then use the additional information it added earlier to transparently return the data to the operating system.
Consider the diagram to the left. In a traditional, non-queued paradigm, the drive would accept the request for data piece A, move the actuator and retrieve it, accept the request for B, retrieve it, then move to piece C. A drive that can buffer and queue requests, however, would be able to retrieve A, then opt to retrieve C first, followed by B, resulting in a net savings of time in completing these three requests.
TCQ must be supported by both the controller and the hard drive itself. It was introduced to the SCSI world as early as 1990 and was formally codified into the SCSI-2 standard by 1994. The feature rapidly proved itself invaluable in the world of multi-user servers and is today consistently deployed across virtually all host adapters and disks. Likewise, TCQ was formally implemented in the 1998 ATA-4 standard. Unlike SCSI devices, however, ATA drives simply were not used in enterprise applications where features like hot-swappability and low access times were paramount. Further, the traditional ATA stronghold, single-user machines, just did not benefit from TCQ; indeed, in many cases the additional imposed overhead actually reduced rather than enhanced performance in these areas. As a result, the feature went largely ignored by the industry.
Today, however, the advent of Serial ATA, its associated hot swap features, and its promised interoperability with the upcoming Serial Attached SCSI (SAS) standard has resulted in a brightening future for ATA in the enterprise. The forthcoming SATA II standard includes provisions to incorporate tagged command queuing a la ATA-4's standard. Native SATA drive architectures such as the Seagate Barracuda 7200.8 and Maxtor MaXLine III tout the inclusion of "Native SATA" tagged command queuing, or "Native Command Queuing" (NCQ) for short. NCQ's fundamental paradigm is identical to that of tagged command queuing; the NCQ moniker simply differentiates the SATA II standard from the existing ATA-4 model.
With its deep research and development pockets, Seagate was the only manufacturer to avoid a less expensive and faster-to-market PATA-to-SATA bridge for its first SATA products. For financial and temporal reasons, other manufacturers such as Western Digital introduced their first products with bridged operation. The Raptor WD740GD is one of these designs. While the practical ramifications are negligible (it is bottom line performance that counts, after all!), the Raptor's bridge prevents it from using the SATA II NCQ standard. Thus, to implement tagged command queuing into its budding enterprise-oriented line in a timely fashion, Western Digital opted to include ATA-4-style TCQ in the Raptor. Fortunately for WD, the firm has received enthusiastic response from many controller manufacturers. Most firms designing NCQ-enabled SATA host adapters are also incorporating Raptor-style queuing. One such manufacturer is Promise Technology.
The Tests
In this first of what will be several articles examining the effects of SATA's tagged command queuing, we will take a look at how the upcoming Promise FastTrak TX4200 compares to the currently shipping, non-TCQ-enabled FastTrak S150 TX4. The relationship between these two controllers is especially interesting as the TX4200 is simply a FastTrak S150 TX4 with added TCQ code. The S150 TX4, in turn, is simply a RAID-enabled SATA150TX4, the SR Testbed's long-standing reference SATA controller. A direct contrast between the two Promise RAID controllers can thus isolate the effects of TCQ from other variables.
The Promise FastTrak TX4200 features:
- 4 Serial ATA Ports for up to 4 drives
- RAID 0/1/10 and JBOD
- 32-Bit / 33-66 MHz PCI Operation
- NCQ & SATA TCQ Support
TCQ, of course, has been around for some time in the SCSI world- all current host adapters, RAID controllers, and hard drives support a very mature implementation. To discover what disadvantages, if any, SATA TCQ suffers when contrasted with more established SCSI solutions, results from a Mylex AcceleRaid 170 RAID controller paired with up to four 73 GB Seagate Cheetah 10K.6 drives have been included in these tests.
The Mylex AcceleRaid 170 features:
- 1 68-pin LVD Port for up to 15 drives
- RAID levels 0, 1, 0+1, 3, 5, 10, 30, 50, JBOD
- 32 MB ECC SDRAM Cache
- 32-bit / 33 MHz PCI Operation
Though TCQ confers benefits even when a single drive operates under heavy random loads, its true potential shines when there are also multiple actuators to work with. Hence, the tests that follow also take a hard look at the scaling provided by arrays in both multi-user and single-user scenarios- our first formal take on RAID in over two years.
In the following tests, Testbed3's hardware and benchmarks sort out the multiple dimensions of potential performance drivers:
- How does TCQ benefit multi-user and single-user performance?
- How does TCQ affect a RAID array's ability to scale performance upwards as more drives are added?
- How does SATA TCQ stack up against SCSI's implementation?
- How does a RAID array scale under increasingly heavy random I/O?
- What benefits does a RAID array deliver to the highly-localized I/O that dominates non-server (single-user) use?
Since these tests take advantage of the standard SR testbed, let us take a moment to consider a potential limitation of the machine's hardware, the 33 MHz, 32-bit PCI slot.
Limitations of the PCI Bus
The 133 MB/sec limit of the standard 32-bit, 33 MHz PCI bus may be of concern to some, especially those seeking for various reasons to maximize sequential transfer rates. The practical real-world limit remains slightly below that threshold- STR tests associated with the results below top out at 126 MB/sec. A single Raptor in its outer zone can push nearly 72 MB/sec while a Cheetah 10K.6 can do 69 MB/sec- it takes only two of either to saturate the PCI bus.
Let us take a closer look, however, at just how important STR is in the majority of applications. The StorageReview File Server DriveMark generates an average transfer size of 22 kilobytes. In other words, the average generated I/O operation in the suite consists of repositioning the actuator to the desired location followed by the reading or writing of 22 KB of data. In the same vein, the SR Office DriveMark's average transfer size is 23 KB. The SR High-End DriveMark, based on a suite of applications that includes video and audio editing, is the only test that reaches significantly beyond these sizes, generating a relatively high 69.5 KB transfer per IO.

A single Raptor WD740GD, with its maximum transfer rate of 72 MB/sec, can transfer 22 KB in:
Hence, the average IO request in the SR File Server DriveMark concludes with a read or write to the platter that takes an average of 0.3 ms to complete.
A PCI-throttled RAID0 array can transfer 22 KB of data in:
In a typical access pattern that features significant localization such as the Office DriveMark, an adept drive such as the WD740GD can achieve about 600 I/Os per second. Inversely stated, each I/O (which, again, consists of positioning + transfer) takes about 1.7 milliseconds. In a single-drive, highly-localized scenario, the Raptor average 1.7 milliseconds per I/O. Of this 1.7 ms, 0.3 ms, or 18%, is the transfer of data to or from the platter. The other 82% of the operation consists of moving the actuator to or waiting for the platter to spin to the desired location. The situation further polarizes itself as transfer rates rise. At 126 MB/sec, transfers consist of just 11% of the total service time. In effect, sequential transfer rates ranging from 50 MB/sec to 130 MB/sec and higher "write themselves out of the equation" by trivializing the time it takes to read and write data when contrasted with the time it takes to position the read/write heads to the desired location.
The diagram to the right illustrates the relationship between positioning and transfers in typical single- and multi-user scenarios. Observe how the time spent positioning the actuator and platter (red) dominates the relatively small amount of time spent reading/writing the data itself (yellow). Even an asymptotic case of an infinite transfer rate unleashed through an infinitely fast bus would only eliminate the yellow portion of the total time it takes to service one request.
Therefore, while the PCI bus can limit sequential transfer rates, its practical effect in capping real-world speed in typical use is not nearly as significant as one may believe at first blush. As a result, the scaling demonstrated in this article also represents the increases one will gain from arrays operating on higher-speed buses.
Our third Raptor WD740GD Sample
The evaluation sample provided to SR by Western Digital for our review published last January was manufactured in December 4th, 2003. For this review, WD sent us four more samples, all dated March 4th 2004. Though much of the focus of this article rests on multi-drive arrays, for control purposes, it was necessary to retest a single drive from this new batch on our reference Promise SATA150TX4 controller. Some differences arise:

Small reductions in performance are evident- around 3% in the Office DriveMark, for example. Most notable is the 8% drop in the Bootup DriveMark. Western Digital representatives attribute the difference to revised firmware. A closer look at the two different samples reveals the following extended model #s:
![]() |
![]() |
Note the differences in the final digit of the extended MDL designation when comparing our second and third samples. The December unit ends with a zero while the March unit concludes with a 1. Why the change? First, we should point out that all manufacturers quietly and regularly refresh the firmware on all their drives after initial release, either to correct bugs or to tweak performance as piles of configuration experience pour in.
Second, as has been painfully obvious over the past several months, SATA command queuing has proven to be a constantly moving target while drive and controller manufacturers alike continue to tweak their products. As controller manufacturers such as Pacific Digital, Silicon Image, and Promise Technology continue to develop pre-release adapter samples, drive manufacturers such as Western Digital are forced to re-optimize firmware to obtain the best results. Likewise, the same has been true in the opposite direction. Though Western Digital announced the WD740GD in September of 2003 and though the units widely available through the channel since last December feature TCQ functionality, the Raptor team is nonetheless driven to regularly reassess the drive's potential companion host adapters and retune firmware accordingly. A result is the "00FLA1" revision, a unit better suited for the state of today's TCQ-enabled host adapters albeit with a very slight drop in certain performance measures.
Glancing at the figures above reveals that while there are differences, they are for all intents and purposes trivial. One simply will not notice the difference in speed in subjective use. Attempts to specifically procure the earlier 00FLA0 revision would likely prove frustrating and fruitless. We would not sweat over the difference.
A Word on Organization
Presenting the following results can be quite daunting. Many different dimensions of performance emerge when one attempts to form the "big picture." How does performance increase when all other variables save queue depth remain constant? What kind of benefits result from adding more drives to an array? How does choosing mirroring over striping affect performance? The list of questions runs on. As a result, we have avoided use of our standard "HTML-generated" graphs in favor of static graphs. Hopefully, they accurately convey the myriad of information to be gleaned.
Without further ado, let us take a look at some results!

