Introducing TigerGraph, a Native Parallel Graph Database

Sep 19th, 2017 9:00am by Yu Xu

Feature image via Pixabay.

Dr. Yu Xu is the founder and CEO of TigerGraph, the world’s first native parallel graph database. Dr. Xu received his Ph.D. in Computer Science and Engineering from the University of California San Diego. He is an expert in big data and parallel database systems and has over 26 patents in parallel data management and optimization. Prior to founding GraphSQL, Dr. Xu worked on Twitter’s data infrastructure for massive data analytics. Before that, he worked as Teradata’s Hadoop architect where he led the company’s big data initiatives.

Graph databases are the fastest growing category in all of data management, according to DB Engines, a database consultancy. They have gained quick adoption for several reasons. First and foremost, they overcome the challenge of storing massive, complex and interconnected data by storing data in a graph format, including nodes, edges and properties. They also offer advantages over both traditional RDBMS and newer big data products — including, but not limited to the following:

Faster at “joining” related data objects,
Better scalability to larger data sets,
More flexibility for evolving structures.

Indeed, over the past few years, the industry has seen a considerable adoption of graph databases by enterprises, mostly in vertical sectors. Leading graph vendor players include DataStax and Neo4j.

While graph technology offers clear competitive advantages, these current solutions are not without their limitations. The most critical of which is the inability to support real-time analytics — a sore point given that enterprise data is always growing and becoming more complex. This is because solutions thus far have been either storage or compute-focused, feature-limited analytics capabilities and are unable to update graphs in real time.

Introducing TigerGraph: The First Native Parallel Graph (NPG)

As the world’s first Native Parallel Graph (NPG) database, TigerGraph sets out to solve these challenges. Unlike other technologies, the TigerGraph NPG focuses on both storage and computation, supporting real-time graph updates and offering built-in parallel computation. NPGs represent what’s next in the graph database evolution: a complete, distributed, parallel graph computing platform supporting web-scale data analytics in real-time.

Unifying the MapReduce and parallel graph processing paradigms, TigerGraph’s computing platform is based on the BSP (Bulk Synchronous Parallel) programming model which enables developers to implement a scalable parallel graph algorithm quickly and easily. A SQL-like graph query language (GSQL) provides for ad-hoc exploration and interactive analysis of Big Data.

Unique Advantages

The platform is developed considering real-time processing of massive data, extreme scalability, and performance demands from modern data-driven organizations. The architecture is modular and supports both scale-up and scale-out deployment models for distributed applications. General advantages of the NPG include:

Fast data loading speed to build graph,
Fast execution of parallel graph algorithms,
Real-time capability for streaming updates and inserts using REST,
Ability to unify real-time analytics with large-scale offline data processing.

Specific performance features include:

Ability to traverse hundreds of millions of vertices/edges per second per machine,
Ability to load 50 to 150 GB of data per hour, per machine.
Ability to stream 2B+ daily events in real-time.

This last point was proven on a graph with 100 billion vertices and 600 billion edges on an only 20 commodity machine cluster, battle-tested by the world’s largest e-payment company with over two years in production.

A Closer Look Under the Hood

Fresh Development

The NPG’s core system was developed from scratch using C++ and system programming concepts to provide an integrated data technology stack. A native graph storage engine (GSE) was developed to co-locate with the graph processing engine (GPE) for fast and efficient processing of data and algorithms.

GPE is designed to provide built-in parallelism for a MapReduce-based computing model available via APIs. The graph is optimally stored both on disk and in-memory, allowing the system to take advantage of the data locality on disk, in-memory and CPU cache.

High Compression Rates

The TigerGraph system performs efficient data compression to take further advantage of the memory and CPU cache. The compression ratio, input data size to output graph size, varies with the data and structure; however, a 10x compression ratio is very common. For example, 1TB of input data, when transformed and loaded in the graph, requires only 100GB of system memory.

Such compression reduces not only the memory footprint but also cache misses, speeding up overall query performance.

MPP Computational Model

In the TigerGraph system, the graph is both a storage and a computational model. Each vertex and edge can be associated with a compute function. Therefore, it acts as a parallel unit of storage and computation simultaneously. This is unlike existing graph databases such as Neo4j.

With this approach, vertices and edges in the graph are not just static data units anymore; they become active computing elements. These vertices can send messages to each other via edges and respond to the messages.

A vertex or an edge in the graph can store any amount of arbitrary information. The TigerGraph system executes compute functions in parallel on every vertex/edge, taking advantage of multi-core CPU machines and in-memory computing.

Graph Partitioning

The TigerGraph system supports a variety of graph partitioning algorithms to deliver the best performance. In most cases, the partitioning is automatically performed on the input data and it delivers great results without requiring optimization and tuning. But the flexibility in the TigerGraph system allows application-specific and other mixed partitioning strategies to achieve even greater application performance.

Built into the TigerGraph system is the ability to run multiple graph engines as an active-active network. Each graph engine can host identical graphs with different partitioning algorithms tailored for different types of application queries. The front-end server (generally a REST server) can route application queries to different graph engines based on the types of query.

A Transformational Technology

Whereas for large quantities of data, loading speed is slow, the TigerGraph NPG can load certain sized data in one hour — compared to over a day with other database systems. Naturally, this offers positive implications for use cases with massive amounts of data.

Further, by offering parallelism for large-scale graph analytics, the NPG supports graph parallel algorithms for Very Large Graphs (VLGs) — providing a considerable technological advantage which grows as graphs inevitably grow larger. The NPG works for limited, fast queries that touch anywhere from a small portion of the graph to millions of vertices and edges, as well as more complex analysis that must touch every single vertex in the graph itself. Additionally, the NPG’s real-time incremental graph updates make it suitable for real-time graph analytics, unlike other solutions.

An advantage with NPG lies in the fact that it represents graphs as a computational model. As discussed previously, compute functions can be associated with each vertex and edge in the graph, transforming them into active parallel compute-storage elements, in a behavior identical to what neurons exhibit in human brains. Vertices in the graph can exchange messages via edges, facilitating massively parallel and fast computation.

To learn more about the TigerGraph Native Parallel Graph, visit the Wiki here.

Dr. Yu Xu is the founder and CEO of TigerGraph, the world’s first native parallel graph database. Dr. Xu received his Ph.D in Computer Science and Engineering from the University of California San Diego. He is an expert in big...