I’ve been working on improving the performance a SQL Server 2012 database that is at the back end of an enterprise-wide 3rd party application. Coming from a full-stack web development background, and knowing that the database performance is critical to the performance of a web app, I wanted to learn more about optimization. Two of the rules-of-thumb I had lived by when designing databases were:
- Every table should have a clustered primary key (PK).
- Don’t use GUIDs for clustered indexes. Ideally, use an INT or BIGINT IDENTITY column.
In this database, all of the primary keys were GUIDs, and not all of them were clustered. This choice of datatype concerned me, based on my experience. There had to be a reason for this, so I conferred with the app’s vendor about their design. They referred me to a blog post (in German) that showed GUID PK’s could provide better performance under load than integers. Chrome provided enough of a translation for me to follow (what little German I know is from war movies), and I was able to consistently reproduce the author’s results.
In case you’re unable to read the German blog post, here’s a summary: the author’s intention was to demonstrate that in some cases, GUIDs were an acceptable choice as clustered primary keys. He used the ostress utility to run 200 concurrent threads that would each insert 1000 rows into a table with a clustered primary key on a INT IDENTITY column. He repeated that on another table that had a clustered PK on a UNIQUEIDENTIFIER column. The INT PK table inserts caused significant PAGELATCH waits. This is because the key values are sequential, so the rows being inserted are all going to be stored on the last leaf-level page, and therefore we encounter contention for that page. This differs from the UNIQUEIDENTIFIER approach, as the values are random instead of sequential, so there is much less contention over each leaf-level page and the 200,000 inserts complete more quickly – about 5x faster in my testing.
However there were two important differences between the article’s examples and the database I was working with:
- The article’s examples used UNIQUEIDENTIFER columns to store GUIDs. My client’s database stored GUIDs as strings – specifically, NVARCHARs.
- The article did not address the performance of heaps, which are used extensively in my client’s database.
Would the choice of datatypes for the GUIDs make a difference in performance? Could they be inserted faster into a heap than into a clustered table? Using the author’s code examples as templates, I ran my own tests to find the answers. I ran all 4 scenarios in succession and the completion times for inserting 200K rows (via 200 concurrent sessions each inserting 1000 rows) were as follows:
|Scenario #||Configuration||Time To Complete (sec.)|
|1||Clustered PK on BIGINT||26.6|
|2||Clustered PK on UNIQUEIDENTIFIER||5.2|
|3||Clustered PK on NVARCHAR||5.8|
|4||Non-clustered PK on NVARCHAR||9.9|
The heap inserts completed about 3x faster than the BIGINT insert, but they were also almost twice as slow as the inserts on the tables with clustered PKs on GUIDs. This was due to page latch waits, as was the case in Scenario 1, although they were fewer in number. The peak number of transactions per second was also higher for Scenario 3 (~80k) as compared to Scenario 2 (~70k). There was no appreciable difference between the tables with clustered PKs on UNIQUEIDENTIFIER and NVARCHAR (2 versus 3).
To better visualize this, look at the performance graphs for each of the four scenarios:
What conclusion can be drawn from this? Most importantly, heaps with GUID PK’s are not better than clustered tables with GUID PK’s in terms of insert performance. So don’t be afraid to convert your heaps to clustered tables and see if your application(s) perform better than before.