SQL Server Graph Database and Performance
This post was originally published on this site
When Microsoft announced, that their upcoming release would contain an implementation of Graph Database objects, I was intrigued. There are already a couple of good posts around, that explain how to use the feature, e.g. how to create a table structure and query the information, what you can do, and what not. In my opinion, the main purpose of using this technology should be some gain in performance or reduced disk space. Searching around, I only found some claims from Neo4J (one of the main players in the Graph Database space) about their product being orders of magnitudes faster than a relational implementation. But nothing yet about differences within SQL Server.
For my performance test, I decided to use a friends of friends example, that Neo4J also used for their performance comparison. This means, that we have a bunch of people, where we somehow track who is a friend of whom. In the relational world, we do this with two tables. One table to store all the people, and one (link-) table to track the friendships. In a Graph Database, the people are represented by nodes, and edges are used for the relationships, of course.
In the example picture, you see, that Alex considers Brian and Charles friends and Brian and Charles consider David as friends. So David would be a friend of a friend of Alex.
Here’s the scripts to create the necessary tables (both relational and graph).
-- Create graph tables CREATE TABLE [dbo].[g_user]( [id] [int] NOT NULL, [name] [varchar](255) NULL, PRIMARY KEY CLUSTERED ( [id] ASC )WITH (STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF) ON [PRIMARY] ) AS NODE ON [PRIMARY] GO CREATE TABLE [dbo].[g_friendOf] AS EDGE ON [PRIMARY] GO
-- Create relational tables CREATE TABLE [dbo].[t_user]( [id] [bigint] NOT NULL, [name] [varchar](255) NOT NULL PRIMARY KEY CLUSTERED ( [id] ASC )WITH (STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF) ON [PRIMARY] ) GO CREATE TABLE [dbo].[t_friendOf]( [user_1] [bigint] NOT NULL, [user_2] [bigint] NOT NULL) ON [PRIMARY] GO
I‘ve used a small Azure Database for the testing. My test sample consists of 300’000 People, that know 10 others on average.
The first thing I noticed, is that the insert into the edge table takes a long time, if one doesn’t disable the auto-created indexes. Then the graph tables use a bit more space than the relational tables, as shown in the picture. Actually, this is not surprising when you consider the overhead, that is used for those tables.
Now, let’s cut to the chase. To retrieve the friends of friends (with a depth of 5) from one particular person, I’ve used the following code, which retrieved 8980 records.
-- retrieve friends from graph SELECT base = u1.[name] , friend = u2.[name] , friendoffriend = u3.[name] , friendoffriendoffriend = u4.[name] , f4 = u5.[name] FROM [g_user] u1, [g_user] u2, [g_user] u3, [g_friendOf] uf1, [g_friendOf] uf2, [g_user] u4, [g_friendOf] uf3, [g_user] u5, [g_friendOf] uf4 WHERE MATCH(u1-(uf1)->u2-(uf2)->u3-(uf3)->u4-(uf4)->u5) AND u1.id = 301;
-- retrieve friends from relational table SELECT base = u1.[name] , friend = u2.[name] , friendoffriend = u3.[name] , friendoffriendoffriend = u4.[name] , f4 = u5.[name] FROM [dbo].[t_user] u1 INNER JOIN [dbo].[t_friendOf] uf1 ON uf1.[user_1] = u1.[id] INNER JOIN [dbo].[t_user] u2 ON u2.[id] = uf1.[user_2] INNER JOIN [dbo].[t_friendOf] uf2 ON uf2.[user_1] = u2.[id] INNER JOIN [dbo].[t_user] u3 ON u3.[id] = uf2.[user_2] INNER JOIN [dbo].[t_friendOf] uf3 ON uf3.[user_1] = u3.[id] INNER JOIN [dbo].[t_user] u4 ON u4.[id] = uf3.[user_2] INNER JOIN [dbo].[t_friendOf] uf4 ON uf4.[user_1] = u4.[id] INNER JOIN [dbo].[t_user] u5 ON u5.[id] = uf4.[user_2] WHERE u1.id = 301
To get an additional level of friends, is pretty straightforward, and I’m sure you can figure this out yourself in a second. To compare the performance, I’ve run each script multiple times, and calculated the average time needed: the graph query takes 9.5 seconds and the relational query 8.2 seconds to finish. It’s about a 10 – 15 percent performance advantage for the relational query, and this is consistent for other depths as well.
If we look at the query plan – for the better readability, the one for depth 2 is shown – we see that we should be able to improve the performance with a bit of indexing. Alas, when I try to create the index for the graph table, I get the message that “Cannot access internal graph column ‘from_obj_id_AB483464BD2C4E45B7E754A3CA570491′”. Running the query after potimizing the relational table, the results improve dramatically (now the query takes around 1 second). Actually, I allready tried the same scenario in September, and then the response times for the graph queries were much worse (the graph query took around 50% longer to complete), than they are now. So it seems, that Microsoft is working on the graph queries. At PASS Summit, they also showed some cool stuff like heterofeneous associations (allowing derived tables to be used in Match queries) or recursive traversals (which allkow you to find out how one person is connected to another).
Comparing the queries, I find the graph example much more readable. If you have more complex scenarios, this becomes even more obvious. I am very happy, that Microsoft is obviously continuing the work on graphs in SQL Server. Of course, mine has been quite a simple example, and your mileage may vary. But I think that the query performance is good enough, to try and test the feature now. The most obvious use case for graphs in SQL Server at them moment are, at least in my oppinion, those scenarions, where you can reduce a complex relational query to a simple graph. But as I’ve also seen, things might change fast, so that even simple sceanrios might benefit. So lets stay tuned for further improvements to the Microsoft SQL Server graph implementation.