On “Benchmarking RedisGraph 1.0”

On “Benchmarking RedisGraph 1.0”

Recently RedisGraph published a blog [1], comparing their performance to that of TigerGraph’s, following the tests [2] in TigerGraph’s benchmark report [3], which requires solid performance on 3-hop, 6-hop, and even 10-hop queries. Multi-hop queries on large data sets are the future of graph analytics.

After reviewing the report in depth with our engineering team, here are our key findings, including identifying fake news created by RedisGraph in their benchmarking process:

  • TigerGraph remains the fastest on queries with 2 or more hops. This applies to the larger dataset (a Twitter user-follower graph with 41.6M vertices and 1.5B edges). For the small dataset test, only 64M edges, RedisGraph is sometimes faster. We included this small dataset (0.9 GB raw data) in our benchmark because many of the other graph databases we tested could barely handle the larger Twitter dataset.
  • RedisGraph is faster on 1-hop queries. This is expected since Redis is specially designed for key-value lookups. In the real world, if you only need to do one hop, a key-value database or RDBMS is sufficient enough; you don’t need a graph product.
  • No Analytics Graph Query tests were reported. PageRank and Connected Components are tested in TigerGraph’s benchmark report but not in RedisGraph’s report. These types of queries are essential for finding hubs of influence or power, discovering communities, ranking, measuring similarity, and many other analytical tasks which are the reason that users choose a graph in the first place.
  • No Data Loading tests were reported. Data loading to a graph is the first step for graph analytics. Many times our customers and users have told us they rejected other graph databases based on their slow loading speed alone. Twitter’s dataset is only 29GB, a small dataset to TigerGraph. The larger the dataset, the bigger data loading and query performance gap between TigerGraph and other graph products, due to TigerGraph’s industry-first Native Parallel Graph design.
  • “Fake” new test: “Parallel requests benchmark”: RedisGraph added a new test. They repeated the 1-hop to 6-hop tests on their platform, allowed the queries to run in parallel by using “22 client threads running on the same test machine, which generated 300 requests in total.” However, they didn’t run parallel request tests on TigerGraph. They just took our average query times for a single query request and simply multiplied by 300.. We believe in fair tests and that is why we published the full steps and process for other people to follow in our report in github [2].

For queries starting from a single point, Redis has done well, and thank them for trying our benchmark.

If you are in the market for a graph database for analytics, we’d like to share with you some additional insights, with information drawn from Redis’ website and blog.

  • RedisGraph is not a distributed graph system [4],  You cannot use multiple machines to speed up your query performance. And, if your dataset is large, loading and querying the graph is impossible without a distributed graph system.
  • RedisGraph is pure in-memory. For users with terabytes or petabytes of data, the hardware cost can be prohibitively high for a pure in-memory system. TigerGraph still works very fast with some data stored on disk, by giving in-memory priority to the most critical elements for graph traversal and query processing. In fact, users can tell TigerGraph how they want to manage memory usage [5].   
  • RedisGraph’s execution is single-threaded. In their own words[1]:

RedisGraph was built from the ground up for extremely high parallelism, with each query processed by a single thread that utilizes the GraphBLAS library for processing matrix operations with linear algebra.”

Parallel processing is the preferred mode for graph processing, because parallel processing lets you access each of a vertex’s neighbors simultaneously. On large queries, a single thread will not scale well. TigerGraph’s Native Parallel Graph technology scales out to multiple machines for a single query execution AND scales up to multiple CPU cores on a single machine to speed up query performance for all workloads.

  • TigerGraph is designed for parallel computation.  TigerGraph’s capability for parallel computation of single queries is technically superior and well-suited for real-time scenarios, because it makes full use of available resources to complete the current workload. The number of threads for a single query is configurable, to fine tune the performance. If an application has a lot of data, or runs complex queries, TigerGraph’s built-in parallelism lets users either scale up (by using more threads, or running on a more powerful machine with more CPU cores), or scale out by simply adding more machines, to improve query performance.  Thinking about this in another way, Redis gives you only one way to speed up a query: to go to a more powerful machine. TigerGraph gives you two additional ways: increase the number of threads or scale out to more machines.
  • RedisGraph’s read and write operations block one another. From their blog [1]:

“RedisGraph enforces write/readers separation by using a read/write (R/W) lock so that either multiple readers can acquire the lock or just a single writer. As long as a writer is executing, no one can acquire the lock, and as long as there’s reader executing, no writer can obtain the lock.”

So RedisGraph may have parallelism in theory, but not when operations are being blocked.

In TigerGraph, as an enterprise graph database platform, reads don’t block writes, writes don’t block reads. Reads and writes work on their own ‘snapshots’ of the graph when they come to TigerGraph. Check out the details on our transaction documentation for more details.

In general, taking a shortcut by building a graph API on top of a key-value store has inherent and critical flaws which cause both query performance and data consistency issues. Please check out our blog “Building a Graph Database on a Key-Value Store?“, which is excerpted from our eBook “Native Parallel Graphs: The Next Generation of Graph Database for Real-Time Deep Link Analytics.”  

Redis as a key-value DB is a widely-adopted product. However, one size doesn’t fit all. As discussed in our blog, designing a graph API on top of a key-value store is an expensive and complex job which will not yield a high-performance result. While key-value stores excel at single key-value transactions, they lack ACID properties and the complex transaction functionality required by a graph update. Thus, building a graph database on top of a key value store opens the door to data inconsistencies, wrong query results, slow query performance for multi-hop queries, and an expensive and rigid deployment.  

Summary

  1. TigerGraph is the fastest on deep link analytics on large datasets. 
  2. RedisGraph is good for small datasets and simple 1-hop queries.
  3. RedisGraph does not support concurrent read and write queries.
  4. RedisGraph declined to show their performance for loading or for analytical queries like PageRank or community detection.
  5. With the inherent and critical limitations in its technical implementation, RedisGraph is neither a native graph nor a distributed graph.

It is great for the user community to see more graph offerings and more public benchmark results. TigerGraph will continue to lead the open and repeatable benchmarking efforts. Everything you need to reproduce our tests is available on GitHub [2].   

A Call to Action for RedisGraph

In the spirit of fair engineering tests that serve the interests of the user community, we ask Peter Cailliau, RedisGraph benchmarking author to do the following:

  1. Make all testing scripts public so potential users and third parties can repeat and verify any benchmark testing.  
  2. Update its benchmarking report with facts, not extrapolations, especially in the Section ‘Parallel Request Benchmark’. Do an apples-to-apples comparison, rather than just multiplying TigerGraph’s performance number from a different test by 300.  
  3. Update the benchmarking report with clear facts, not misrepresentations. Specifically:“It is important to note that TigerGraph applied a timeout of three minutes for the one and two-hop queries and 2.5 hours for the three and six-hop queries for all requests on all databases (see TigerGraph’s benchmark report for details on how many requests timed out for each database). If all requests for a given data set and given database timed out, we marked the result as ‘N/A’. When there is an average time presented, this only applies to successfully executed requests (seeds), meaning that the query didn’t time out or generate an out of memory exception. This sometimes skews the results, since certain databases were not able to respond to the harder queries, resulting in a better average single request time and giving a wrong impression of the database’s performance.

    Peter Cailliau states that his presentation is “giving a wrong impression of a database’s performance” yet does it anyway.

    By not showing that some graph databases could not complete the harder queries, as TigerGraph’s report showed, they seem to be doing a favor to the competitors, by hiding their flaws and boosting their average query times. TigerGraph receives no such boost because TigerGraph completed all the tests. The effect is to make other graph databases appear to be closer in performance to TigerGraph than they really are. We used timeouts in our tests, a practice used in real-world applications, to abort tests which were running an unreasonably long time. TigerGraph received no benefit because the thresholds were always much longer than TigerGraph needed.   We think that Peter Cailliau should have also mentioned that TigerGraph never timed out.

  4. Correct a counting and computation error in its k-hop neighborhood query. According to [1],

The k-hop neighborhood query is a local type of graph query. It counts the number of nodes a single start node (seed) is connected to at a certain depth, and only counts nodes that are k-hops away.

Below is the Cypher code:

MATCH (n:Node)-[*$k]->(m) where n.id=$root return count(m)  

Here, $root is the ID of the seed node from which we start exploring the graph, and $k represents the depth at which we count neighbors. To speed up execution, we used an index on the root node ID.

However, this query will double-count some neighbors. The correct query should be

MATCH (n:Node)-[*$k]->(m) where n.id=$root return count(distinct m)  

The count (distinct m) query performs additional computation to remove duplicate counts.  Whatever the reason or limitation behind RedisGraph, we think the author needs to clarify this issue to the readers.  

Last Words

We’d like to thank RedisGraph for giving our benchmark test a try, and congratulate them on the results.  We also appreciate that they share our belief about the importance of real-time deep link analytics.

The graph market is growing rapidly. We welcome RedisGraph to the Graph World, and we feel honored that newcomers see us as the high-performance bar. We welcome any open and fact-based benchmarking, which we view as healthy and necessary for the whole graph market’s growth and user education.    

 

Yu Xu, CEO TigerGraph

[1] https://redislabs.com/blog/new-redisgraph-1-0-achieves-600x-faster-performance-graph-databases/

[2] https://github.com/tigergraph/ecosys/tree/benchmark/benchmark

[3] https://info.tigergraph.com/benchmark

[4] https://redislabs.com/blog/release-redisgraph-v1-0-preview/

[5] This feature is available in the Enterprise edition.