By Victor Lee, InfoWorld |

About |

Emerging tech dissected by technologists

TigerGraph: The parallel graph database explained

How TigerGraph achieves fast data ingest, fast graph traversal, and deep link analytics even for large data sets

Graph databases excel at answering complex questions about relationships in large data sets. But they hit a wall—in terms of both performance and analysis capabilities—when the volume of data grows very large, and when the answers must be provided in real time.

That’s because existing graph technologies have trouble loading large quantities of data, or ingesting fast-arriving data, in real time. They also struggle to deliver fast traversal speed. While deeper analytics require deeper traversal of the graph, today’s graph databases typically slow down or time out after two hops of traversal.

TigerGraph is a distributed, native graph computing platform designed to get around these limitations. TigerGraph’s native parallel graph architecture and real-time deep link analytics aim to provide the following advantages:

Faster data loading to build graphs quickly
Faster execution of parallel graph algorithms
Real-time capability for streaming updates and inserts using REST
Ability to unify real-time analytics with large-scale offline data processing
Ability to scale up and scale out for distributed applications

In the sections that follow, we’ll take a brief look at how graph processing works, explore the benefits of deep link analytics, and lift the hood on TigerGraph to understand how it as able to provide deep link analytics in real time.

Graph traversal: More hops, more insight

Why deep link analytics? Because the more links you can traverse (hop) in a graph, the greater the insight you achieve. Consider a hybrid knowledge and social graph. Each node connects to what you know and who you know. Direct links (one hop) reveal what you know. Two hops reveal everything that your friends and acquaintances know. Three hops? You are on your way to revealing what everyone knows.

The graph advantage is knowing the relationships between the data entities in the data set, which is the heart of knowledge discovery, modeling, and prediction. Each hop can lead to an exponential growth in the number of connections and, accordingly, in the amount of knowledge. But therein lies the technological hurdle. Only a system that performs hops efficiently and in parallel can deliver real-time deep link (multi-hop) analytics.

A simple example like real-time personalized recommendation reveals the value and power of following multiple links in a graph:

“Customers who liked what you liked also bought these items.”

This translates into a three-hop query:

Starting from a person (you), identify items you viewed / liked / bought.
Second, find other people who have viewed / liked / bought those items.
Third, identify additional items bought by those people.

Person → product → (other) persons → (other) products

Using previous graph technology, you’d be limited to only two hops in larger data sets. TigerGraph easily extends the query to three or more hops to deliver key insights from within very large data sets.

TigerGraph’s real-time deep link analytics

TigerGraph supports three to more than 10 hops of traversal across a big graph, along with fast graph traversal speed and data updates. This combination of speed, deep traversals, and scalability offers huge advantages for several use cases.

One use case is fraud prevention. One way that businesses detect potential fraud is to find connections to known bad transactions. For example, starting from an incoming credit card transaction, here is one path to bad transactions:

New transaction → credit card → cardholder → (other) credit cards → (other) bad transactions

As a graph query, this pattern uses four hops to find connections only one card away from the incoming transaction. Today’s fraudsters try to disguise their activity through circuitous connections between themselves and known bad activity or bad actors. To detect fraud accurately, you need to explore multiple possible patterns and assemble a more holistic view.

With the ability to uncover multiple, hidden connections, TigerGraph is able to minimize credit card fraud. This traversal pattern applies to many other use cases—where you can simply replace the credit card transaction with a web click event, a phone call record, or a money transfer.

TigerGraph system overview

The ability to draw deep connections between data entities in real time requires new technology designed for scale and performance. There are many design decisions which work cooperatively to achieve TigerGraph’s breakthrough speed and scalability. Below we will look at these design features and discuss how they work together.

A native graph

TigerGraph is a pure graph database, from the ground up. Its data store holds nodes, links, and their attributes, period. Some graph database products on the market are really wrappers built on top of a more generic NoSQL data store. This virtual graph strategy has a double penalty when it comes to performance. First, the translation from virtual graph operation to physical storage operation introduces some extra work. Second, the underlying structure is not optimized for graph operations.

Compact storage with fast access

We don’t describe TigerGraph as an in-memory database, because having data in memory is a preference but not a requirement. Users can set parameters that specify how much of the available memory may be used for holding the graph. If the full graph does not fit in memory, then the excess is stored on disk. Best performance is achieved when the full graph fits in memory, of course.

Data values are stored in encoded formats that effectively compress the data. The compression factor varies with the graph structure and data, but typical compression factors are between 2x and 10x. Compression has two advantages: First, a larger amount of graph data can fit in memory and in cache. Such compression reduces not only the memory footprint, but also CPU cache misses, speeding up overall query performance. Second, for users with very large graphs, hardware costs are reduced. For example, if the compression factor is 4x, then an organization may be able to fit all its data in one machine instead of four.

Decompression/decoding is very fast and transparent to end users, so the benefits of compression outweigh the small time delay for compression/decompression. In general, decompression is needed only for displaying the data. When values are used internally, often they may remain encoded and compressed.

Hash indices are used to reference nodes and links. In Big-O terms, our average access time is O(1) and our average index update time is also O(1). Translation: accessing a particular node or link in the graph is very fast, and stays fast even as the graph grows in size. Moreover, maintaining the index as new nodes and links are added to the graph is also very fast.

Parallelism and shared values

When speed is your goal, you have two basic routes: Do each task faster, or do multiple tasks at once. The latter avenue is parallelism. While striving to do each task quickly, TigerGraph also excels at parallelism. Its graph engine uses multiple execution threads to traverse a graph.

The nature of graph queries is to “follow the links.” Start from one or more nodes. Look at the available connections from those nodes and follow those connections to some or all of the neighboring nodes. We say you have just “traversed” one “hop.” Repeat that process to go to the original node’s neighbors’ neighbors, and you have traversed two hops. Since each node can have many connections, this two-hop traversal involves many paths to get from the start nodes to the destination nodes. Graphs are a natural fit for parallel, multithreaded execution.

A query of course should do more than just visit a node. In a simple case, we can count the number of unique two-hop neighbors or make a list of their IDs. How does one compute a total count, when you have multiple parallel counters? The process is similar to what you would do in the real world: Ask each counter to do its share of the world, and then combine their results in the end.

Recall that the query asked for the number of unique nodes. There is a possibility that the same node has been counted by two different counters, because there is more than one path to reach that destination. This problem can occur even with single-threaded design. The standard solution is to assign a temporary variable to each node. The variables are initialized to False. When one counter visits a node, that node’s variable is set to True, so that other counters know not to count it.

Storage and processing engines written in C++

Language choices also affect performance. TigerGraph’s graph storage engine and processing engine are implemented in C++. Within the family of general purpose procedural languages, C and C++ are considered lower-level compared to other languages like Java. What this means is that programmers who understand how the computer hardware executes their software commands can make informed choices to optimize performance. TigerGraph has been carefully designed to use memory efficiently and to release unused memory. Careful memory management contributes to TigerGraph’s ability to traverse many links, both in terms of depth and breadth, in a single query.

Many other graph database products are written in Java, which has pros and cons. Java programs run inside a Java Virtual Machine (JVM). The JVM takes care of memory management and garbage collection (freeing up memory that is no longer needed). While this is convenient, it is difficult for the programmer to optimize memory usage or to control when unused memory becomes available.

GSQL graph query language

TigerGraph also has its own graph querying and update language, GSQL. While there are many nice details about GSQL, I will focus on two aspects that are key to supporting efficient parallel computation: the ACCUM clause and accumulator variables.

The core of most GSQL queries is the SELECT statement, modeled closely after the SELECT statement in SQL. The SELECT, FROM, and WHERE clauses are used to select and filter a set of links or nodes. After this selection, the optional ACCUM clause may be used to define a set of actions to be performed by each link or adjacent node. I say “perform by” rather than “perform on” because conceptually, each graph object is an independent computation unit. The graph structure is acting like a massively parallel computational mesh. The graph is not just your data storage; it is your query or analytics engine as well.

An ACCUM clause may contain many different actions or statements. These statements can read values from the graph objects, perform local computations, apply conditional statements, and schedule updates of the graph. (Updates do not take place until the query is over.)

To support these distributed, in-query computations, the GSQL language provides accumulator variables. Accumulators come in many flavors, but they are all temporary (existing only during query execution), shared (available to any of the execution threads), and mutually exclusive (only one thread can update it at a time). For example, a simple sum accumulator would be used to perform the count of all the neighbors’ neighbors mentioned above. A set accumulator would be used to record the IDs of all those neighbors’ neighbors. Accumulators are available in two scopes: global and per-node. In the earlier query example, we mentioned the need to mark each node as visited or not. Here, per-node accumulators would be used.

MPP computational model

To reiterate what we have revealed above, the TigerGraph graph is both a storage model and a computational model. Each node and link can be associated with a compute function. Therefore, TigerGraph acts as a parallel unit of storage and computation simultaneously. This would be unachievable using a generic NoSQL data store or without the use of accumulators.

Automatic partitioning

In today’s big data world, enterprises need their database solutions to be able to scale out to multiple machines, because their data may grow too large to be stored economically on a single server. TigerGraph is designed to automatically partition the graph data across a cluster of servers, and still perform quickly. The hash index is used to determine not only the within-server data location but also which-server. All the links that connect out from a given node are stored on the same server. Computer science theory tells us that finding the best overall graph partitioning, if we could even define “best,” is usually very slow, so we don’t try. Our default mode is to use random hashing, which works very well in most cases. The TigerGraph system also supports user-directed partitioning for users who have a particular partitioning scheme in mind.

Distributed computation mode

TigerGraph recently introduced a distributed computation mode that significantly improves performance for analytical queries that traverse a large portion of the graph. In distributed query mode, all servers are asked to work on the query; each server’s actual participation is on an as-needed basis. When a traversal path crosses from server A to server B, the minimal amount of information that server B needs to know is passed to it. Since server B already knows about the overall query request, it can easily fit in its contribution.

1 2 Page 1 Next

Page 1 of 2