Data Lakes will Yield More Business Value when Combined with Graph Databases

Although traditional data warehouse environments have been around for years now, scalability and performance have become a challenge in the face of a big data deluge. This is where data lakes can help – these enable diverse data sources to be joined with massive scale on commodity hardware and cloud solutions, with a variety of data processing and analytic tools. Data lakes allow users to free capacity in their data warehouses both from a storage and processing standpoint, resulting in significant savings.

 Data lakes have the ability to store massive amounts of data from diverse sources, such as non-relational and relational data, web logs, social media feeds, and other collected sources. However, while data lakes can store and process massive amounts of data, gaining insights into connections in real time remains a challenge. This is where graph databases can help.

In this article I am going to do the following:

1) Examine three representative use cases to show how graph databases can enable data lakes to yield more business value;

2) Provide an overview of how to get started with graph databases, including schema creation and data loading.

Use cases

Customer journey / customer 360

Customer 360 use cases are very popular in data lakes as they can combine multiple data sources into a single view. The data sources can include relational data, streaming data, web logs, emails and documents. Where a graph database can supplement an existing implementation is by providing deeper insights based on a larger set of connections between products, customers, locations, and even time-series data. Providing smarter recommendations based on additional factors can lead to increased revenues and higher customer satisfaction.

Fraud detection

Detecting fraud in the big data era improved significantly with the ability to join massive data sets and apply machine learning models. Predictive models can be built based on historical data in the Data Lakes, but uncovering fraud in real time requires speed and accuracy.  A graph database can use a combination of entity resolution, (identifying and merging entities from different sources which refer to the same real-world entity), deep link analytics (the ability to analyze 3 to 10+ relationships to uncover non-obvious or hidden relationships), and pattern matching (uncovering conditional and frequently used patterns) to provide insights based on connections.  

Metadata management

With the large diverse data sets joined in data lakes, the concepts of data lineage and data governance have become a major challenge. Enter graph as metadata management.  Metadata is merely “data about other data”. With a graph database it is easy to represent metadata such as data structures, table metadata, organizational units, and processes in a hierarchical structure to best understand data location, data lineage, data redundancy, and even compliance with regulations such as California Consumer Privacy Act (CCPA) and General Data Protection Regulation (GDPR)..

Getting Started with TigerGraph

A graph database is a purpose-built solution to store and analyze connections between entities and their relationships. Graphs provide a flexible means for representing the relationships among connected data and have flexible schema that can adapt to change. Where data lakes can provide insights based on historical data, a graph database can provide the solution for the next-generation analytics based on connected data across historical, operational and master data. 

TigerGraph’s massively parallel processing (MPP) architecture combined with an efficient query processing engine allow for analyzing entities and their relationships at scale. TigerGraph is fully ACID compliant, giving the ability to perform both transactional and analytic workloads on the same platform. As with data lakes, TigerGraph’s scalability and flexible schema allow the joining of diverse data sets. TigerGraph should be considered a complementary technology to existing RDBMS and data lake technologies. Interoperability is achieved by using industry standard connectors and tools, such as HDFS, S3, Kafka, and Spark via JDBC. Your data lake can now be used to load and maintain the graph along with feature extraction for the next generation AI and ML engines.

Loading data into TigerGraph

In this exercise we will look into a common scenario: data stored in the lake is used to load and update the graph.  For our test scenario, we will use the latest Cloudera Quickstart VM (5.13.0) and the TigerGraph Developer Edition (2.5). We will use Apache Spark on the Cloudera instance to read a Parquet file and send data to TigerGraph via JDBC.

Getting Started: Cloudera

To get started on the Cloudera side, we are going to use the Tutorial Exercise 1 – Ingest and Query Relational Data included with the Quickstart VM.  

The workflow is common to many data lake scenarios: 

     ⇨ Use sqoop to pull transactional data from RDBMS

     ⇨ Build Hive tables and store on HDFS as parquet

     ⇨ Query data with Impala, Hive, Spark. 

The resulting table and schema is typical with customers, products, orders and more:

Since the Cloudera Quickstart VM runs an outdated technology stack, we made some modifications in order to build the final solution:

Upgrade Java to 1.8.0-openjdk: required by Spark 2.x and TigerGraph JDBC driver

Convert to parcels:  allows for the upgrade of Spark from 1.6 to 2.4

⇨ Install Spark 2.4.0 parcel:  download a parcel for Spark2, activate it, and then go to Services -> Add new service

Build the TigerGraph JDBC driver (from github):  follow instructions

Install the JDBC driver and add it to ‘classpath.txt’:  under /opt/cloudera/parcels/SPARK2/lib/spark2/conf

Building Schema in TigerGraph

Once the data is ingested and queried for validation, we can build the same schema as represented in a graph database. We represent each table as a Vertex, columns as attributes, and the table joins represented as edges. This may not be the optimal schema for the use case, but is a good starting point.

Once the schema is created, we need to create a loading job in order to ingest data.  We will create a separate loading job for each vertex / parquet file. The loading job defines the mapping between the source columns and attributes. A handy reference for the source columns is the .csv files found on the Quickstart VM under /home/cloudera. Simply copy the files to the TigerGraph instance under /home/tigergraph/tigergraph/loadingData/.

Here is my loading job for the ‘customers’ vertex using the GSQL interface as the TigerGraph user:

$> gsql
GSQL-Dev > use graph MyGraph
Using graph 'MyGraph'
GSQL-Dev > CREATE LOADING JOB customers FOR GRAPH MyGraph { DEFINE FILENAME file1 = "/home/tigergraph/tigergraph/loadingData/customers.csv"; LOAD file1 TO VERTEX customers VALUES($0, $1, $2, $3, $4, $5, $6, $7, $8) USING SEPARATOR=",", HEADER="false", EOL="\n"; }

Once we have the loading job defined, we can use Cloudera / Spark to load data directly from the Parquet file to the Graph.

Loading Data into TigerGraph

Here is the workflow we will use to load the data:

  • Use Apache Spark to read the data on HDFS in parquet format
  • Use the TigerGraph Spark JDBC connector to load the data into the graph
  • Verify data has been loaded into the graph

We will use Apache Spark to read the file in as a DataFrame, connect to the TigerGraph instance, invoke the loading job, and save the data to the graph. Below is the command used with the spark2-shell running under YARN.

val df ="/user/hive/warehouse/customers")
// invoke loading job
    "driver" -> "com.tigergraph.jdbc.Driver",
    "url" -> "jdbc:tg:",
    "username" -> "tigergraph",
    "password" -> "tigergraph",
    "graph" -> "MyGraph",
    "dbtable" -> "job customers", // loading job name
    "filename" -> "file1", // filename defined in the loading job
    "sep" -> ",", // separator between columns
    "headers" -> "true",
    "eol" -> "\n", // End Of Line
    "schema" -> "value", // column definition, each line only has one column
    "batchsize" -> "10000",
    "debug" -> "0")).save()

The first line defines the filename and uses the Spark parquet reader. We then configure the connection to the TigerGraph instance and invoke the loading job created in the previous step.  Finally we save the DataFrame to the graph. We can verify via GraphStudio under the “Load Data” page the system loads 12,345 vertices. (Note: the process can be repeated for additional data files as long as a loading job has been defined with the proper mappings.)

Gaining Additional Insights Using a Graph Algorithm

TigerGraph extends recommendations previously possible through the implementation of 20+ Open Source graph algorithms expressed in our turing-complete programming language GSQL.  In the following example, we will calculate the 5 most similar customers based on purchase history which could be used for targeted marketing campaigns and smarter recommendations.  We will use the jaccard similarity algorithm to calculate the most similar customers.  The query is run using a customer_id as input and will limit the results based on the topK value:

CREATE QUERY recommendation(VERTEX<customers> sourceCustomer, set<string> productName, INT topK) FOR GRAPH MyGraph {

This query calculates the Jaccard Similarity between a given customer (of type customers) and every other customer who shares similar products (of type products). The topK “similar” customers are printed out.


 Customer: 833
	    productNames (optional):
		      Perfect Fitness Perfect Rip Deck
			O'Brien Men's Neoprene Life Vest
		  topK: 5

A Jaccard Similarity score is calculated for each similar customer (who share similar purchases with the input sourceCustomer). The set of similar customers is sorted with the topK # customers printed out.

Jaccard similarity = intersection_size / (size_A + size_B – intersection_size)

More info:

  How to find Jaccard similarity?

  Similarity Algorithms in GSQL

CREATE QUERY recommendation(VERTEX<customers> sourceCustomer, set<string> productName, INT topK) FOR GRAPH MyGraph {

SumAccum<INT> @intersection_size, @@set_size_A, @set_size_B;
SumAccum<FLOAT> @similarity;

A(ANY) = {sourceCustomer};

ACCUM @@set_size_A += s.outdegree("ordered");

ordersSet = SELECT t
FROM A:s > orders:t;

order_itemsSet = SELECT t
FROM ordersSet > order_items:t;

productsSet = SELECT t
FROM order_itemsSet > products:t
WHERE productName.size() == 0 OR (t.product_name in productName);

order_itemsSet = SELECT t
FROM productsSet:s > order_items:t;

ordersSet = SELECT t
FROM order_itemsSet:s > orders:t;

FROM ordersSet:s > customers:t
WHERE t != sourceCustomer
ACCUM [email protected]_size += 1,
[email protected]_size_B = t.outdegree("ordered")
POST-ACCUM [email protected] = [email protected]_size*1.0/
(@@set_size_A + [email protected]_size_B - [email protected]_size)
ORDER BY [email protected] DESC

PRINT B[B.customer_fname, B.customer_lname, [email protected]];

The result set prints out the top 5 similar customers by similarity in JSON format:

    "B": [
        "v_id": "1443",
        "v_type": "customers",
        "attributes": {
          "B.customer_fname": "Denise",
          "B.customer_lname": "Cohen",
          "[email protected]": 2.16667
        "v_id": "1464",
        "v_type": "customers",
        "attributes": {
          "B.customer_fname": "Amber",
          "B.customer_lname": "Dixon",
          "[email protected]": 2
        "v_id": "6950",
        "v_type": "customers",
        "attributes": {
          "B.customer_fname": "Nicholas",
          "B.customer_lname": "Smith",
          "[email protected]": 2
        "v_id": "10175",
        "v_type": "customers",
        "attributes": {
          "B.customer_fname": "Michael",
          "B.customer_lname": "Gibson",
          "[email protected]": 1.83333
        "v_id": "1492",
        "v_type": "customers",
        "attributes": {
          "B.customer_fname": "Gerald",
          "B.customer_lname": "Patel",
          "[email protected]": 1.83333

Next Steps

Graph databases complement data lakes and enable businesses to uncover relationships and insights which would otherwise be unattainable. Building an initial graph data model can leverage existing known schemas to evolve over time. TigerGraph enables this model with flexible schema updates and interoperability with your existing infrastructure.

Now is an ideal time for you to get started with TigerGraph. Sign up here for attend one of our weekly live demonstrations or here to request a demo personalized to the specific needs of your business.

You Might Also Like