Skip to content
START FOR FREE
START FOR FREE
  • SUPPORT
  • COMMUNITY
Menu
  • SUPPORT
  • COMMUNITY
MENUMENU
  • Products
    • The World’s Fastest and Most Scalable Graph Platform

      LEARN MORE

      Watch a TigerGraph Demo

      TIGERGRAPH CLOUD

      • Overview
      • TigerGraph Cloud Suite
      • FAQ
      • Pricing

      USER TOOLS

      • GraphStudio
      • Insights
      • Application Workbenches
      • Connectors and Drivers
      • Starter Kits
      • openCypher Support

      TIGERGRAPH DB

      • Overview
      • GSQL Query Language
      • Compare Editions

      GRAPH DATA SCIENCE

      • Graph Data Science Library
      • Machine Learning Workbench
  • Solutions
    • The World’s Fastest and Most Scalable Graph Platform

      LEARN MORE

      Watch a TigerGraph Demo

      Solutions

      • Solutions Overview

      INCREASE REVENUE

      • Customer Journey/360
      • Product Marketing
      • Entity Resolution
      • Recommendation Engine

      MANAGE RISK

      • Fraud Detection
      • Anti-Money Laundering
      • Threat Detection
      • Risk Monitoring

      IMPROVE OPERATIONS

      • Supply Chain Analysis
      • Energy Management
      • Network Optimization

      By Industry

      • Advertising, Media & Entertainment
      • Financial Services
      • Healthcare & Life Sciences

      FOUNDATIONAL

      • AI & Machine Learning
      • Time Series Analysis
      • Geospatial Analysis
  • Customers
    • The World’s Fastest and Most Scalable Graph Platform

      LEARN MORE

      CUSTOMER SUCCESS STORIES

      • Ford
      • Intuit
      • JPMorgan Chase
      • READ MORE SUCCESS STORIES
      • Jaguar Land Rover
      • United Health Group
      • Xbox
  • Partners
    • The World’s Fastest and Most Scalable Graph Platform

      LEARN MORE

      PARTNER PROGRAM

      • Partner Benefits
      • TigerGraph Partners
      • Sign Up
      TigerGraph partners with organizations that offer complementary technology solutions and services.​
  • Resources
    • The World’s Fastest and Most Scalable Graph Platform

      LEARN MORE

      BLOG

      • TigerGraph Blog

      RESOURCES

      • Resource Library
      • Benchmarks
      • Demos
      • O'Reilly Graph + ML Book

      EVENTS & WEBINARS

      • Graph+AI Summit
      • Graph for All - Million Dollar Challenge
      • Events &Trade Shows
      • Webinars

      DEVELOPERS

      • Documentation
      • Ecosystem
      • Developers Hub
      • Community Forum

      SUPPORT

      • Contact Support
      • Production Guidelines

      EDUCATION

      • Training & Certifications
  • Company
    • Join the World’s Fastest and Most Scalable Graph Platform

      WE ARE HIRING

      COMPANY

      • Company Overview
      • Leadership
      • Legal Terms
      • Patents
      • Security and Compliance

      CAREERS

      • Join Us
      • Open Positions

      AWARDS

      • Awards and Recognition
      • Leader in Forrester Wave
      • Gartner Research

      PRESS RELEASE

      • Read All Press Releases
      TigerGraph Recognized in 2022 Gartner® Critical Capabilities for Cloud Database Management Systems for Analytical Use Cases
      January 12, 2023
      Read More »

      NEWS

      • Read All News

      A Shock to the System: ShockNet Predicts How Economic Shocks Could Affect the World Economy

      TigerGraph Recognized for the First Time in the 2022 Gartner® Magic Quadrant™ for Cloud Database Management Systems

  • START FREE
    • The World’s Fastest and Most Scalable Graph Platform

      GET STARTED

      • Request a Demo
      • CONTACT US
      • Try TigerGraph
      • START FREE
      • TRY AN ONLINE DEMO

Data Lakes will Yield More Business Value when Combined with Graph Databases

  • Emily McAuliffe
  • May 7, 2020
  • blog, Business, Graph Databases
  • Blog >
  • Data Lakes will Yield More Business Value when Combined with Graph Databases

Although traditional data warehouse environments have been around for years now, scalability and performance have become a challenge in the face of a big data deluge. This is where data lakes can help – these enable diverse data sources to be joined with massive scale on commodity hardware and cloud solutions, with a variety of data processing and analytic tools. Data lakes allow users to free capacity in their data warehouses both from a storage and processing standpoint, resulting in significant savings.

 Data lakes have the ability to store massive amounts of data from diverse sources, such as non-relational and relational data, web logs, social media feeds, and other collected sources. However, while data lakes can store and process massive amounts of data, gaining insights into connections in real time remains a challenge. This is where graph databases can help.

In this article I am going to do the following:

1) Examine three representative use cases to show how graph databases can enable data lakes to yield more business value;

2) Provide an overview of how to get started with graph databases, including schema creation and data loading.

Use cases

Customer journey / customer 360

Customer 360 use cases are very popular in data lakes as they can combine multiple data sources into a single view. The data sources can include relational data, streaming data, web logs, emails and documents. Where a graph database can supplement an existing implementation is by providing deeper insights based on a larger set of connections between products, customers, locations, and even time-series data. Providing smarter recommendations based on additional factors can lead to increased revenues and higher customer satisfaction.

Fraud detection

Detecting fraud in the big data era improved significantly with the ability to join massive data sets and apply machine learning models. Predictive models can be built based on historical data in the Data Lakes, but uncovering fraud in real time requires speed and accuracy.  A graph database can use a combination of entity resolution, (identifying and merging entities from different sources which refer to the same real-world entity), deep link analytics (the ability to analyze 3 to 10+ relationships to uncover non-obvious or hidden relationships), and pattern matching (uncovering conditional and frequently used patterns) to provide insights based on connections.  

Metadata management

With the large diverse data sets joined in data lakes, the concepts of data lineage and data governance have become a major challenge. Enter graph as metadata management.  Metadata is merely “data about other data”. With a graph database it is easy to represent metadata such as data structures, table metadata, organizational units, and processes in a hierarchical structure to best understand data location, data lineage, data redundancy, and even compliance with regulations such as California Consumer Privacy Act (CCPA) and General Data Protection Regulation (GDPR)..

Getting Started with TigerGraph

A graph database is a purpose-built solution to store and analyze connections between entities and their relationships. Graphs provide a flexible means for representing the relationships among connected data and have flexible schema that can adapt to change. Where data lakes can provide insights based on historical data, a graph database can provide the solution for the next-generation analytics based on connected data across historical, operational and master data. 

TigerGraph’s massively parallel processing (MPP) architecture combined with an efficient query processing engine allow for analyzing entities and their relationships at scale. TigerGraph is fully ACID compliant, giving the ability to perform both transactional and analytic workloads on the same platform. As with data lakes, TigerGraph’s scalability and flexible schema allow the joining of diverse data sets. TigerGraph should be considered a complementary technology to existing RDBMS and data lake technologies. Interoperability is achieved by using industry standard connectors and tools, such as HDFS, S3, Kafka, and Spark via JDBC. Your data lake can now be used to load and maintain the graph along with feature extraction for the next generation AI and ML engines.

Loading data into TigerGraph

In this exercise we will look into a common scenario: data stored in the lake is used to load and update the graph.  For our test scenario, we will use the latest Cloudera Quickstart VM (5.13.0) and the TigerGraph Developer Edition (2.5). We will use Apache Spark on the Cloudera instance to read a Parquet file and send data to TigerGraph via JDBC.

Getting Started: Cloudera

To get started on the Cloudera side, we are going to use the Tutorial Exercise 1 – Ingest and Query Relational Data included with the Quickstart VM.  

The workflow is common to many data lake scenarios: 

     ⇨ Use sqoop to pull transactional data from RDBMS

     ⇨ Build Hive tables and store on HDFS as parquet

     ⇨ Query data with Impala, Hive, Spark. 

The resulting table and schema is typical with customers, products, orders and more:

Since the Cloudera Quickstart VM runs an outdated technology stack, we made some modifications in order to build the final solution:

⇨ Upgrade Java to 1.8.0-openjdk: required by Spark 2.x and TigerGraph JDBC driver

⇨ Convert to parcels:  allows for the upgrade of Spark from 1.6 to 2.4

⇨ Install Spark 2.4.0 parcel:  download a parcel for Spark2, activate it, and then go to Services -> Add new service

⇨ Build the TigerGraph JDBC driver (from github):  follow instructionshttps://github.com/tigergraph/ecosys/tree/master/tools/etl/tg-jdbc-driver

⇨ Install the JDBC driver and add it to ‘classpath.txt’:  under /opt/cloudera/parcels/SPARK2/lib/spark2/conf

Building Schema in TigerGraph

Once the data is ingested and queried for validation, we can build the same schema as represented in a graph database. We represent each table as a Vertex, columns as attributes, and the table joins represented as edges. This may not be the optimal schema for the use case, but is a good starting point.

Once the schema is created, we need to create a loading job in order to ingest data.  We will create a separate loading job for each vertex / parquet file. The loading job defines the mapping between the source columns and attributes. A handy reference for the source columns is the .csv files found on the Quickstart VM under /home/cloudera. Simply copy the files to the TigerGraph instance under /home/tigergraph/tigergraph/loadingData/.

Here is my loading job for the ‘customers’ vertex using the GSQL interface as the TigerGraph user:

$> gsql
GSQL-Dev > use graph MyGraph
Using graph 'MyGraph'
GSQL-Dev > CREATE LOADING JOB customers FOR GRAPH MyGraph { DEFINE FILENAME file1 = "/home/tigergraph/tigergraph/loadingData/customers.csv"; LOAD file1 TO VERTEX customers VALUES($0, $1, $2, $3, $4, $5, $6, $7, $8) USING SEPARATOR=",", HEADER="false", EOL="\n"; }

Once we have the loading job defined, we can use Cloudera / Spark to load data directly from the Parquet file to the Graph.

Loading Data into TigerGraph

Here is the workflow we will use to load the data:

  • Use Apache Spark to read the data on HDFS in parquet format
  • Use the TigerGraph Spark JDBC connector to load the data into the graph
  • Verify data has been loaded into the graph

We will use Apache Spark to read the file in as a DataFrame, connect to the TigerGraph instance, invoke the loading job, and save the data to the graph. Below is the command used with the spark2-shell running under YARN.

val df = spark.read.parquet("/user/hive/warehouse/customers")
    
// invoke loading job
df.write.mode("overwrite").format("jdbc").options(
  Map(
    "driver" -> "com.tigergraph.jdbc.Driver",
    "url" -> "jdbc:tg:http://192.168.1.12:14240",
    "username" -> "tigergraph",
    "password" -> "tigergraph",
    "graph" -> "MyGraph",
    "dbtable" -> "job customers", // loading job name
    "filename" -> "file1", // filename defined in the loading job
    "sep" -> ",", // separator between columns
    "headers" -> "true",
    "eol" -> "\n", // End Of Line
    "schema" -> "value", // column definition, each line only has one column
    "batchsize" -> "10000",
    "debug" -> "0")).save()

The first line defines the filename and uses the Spark parquet reader. We then configure the connection to the TigerGraph instance and invoke the loading job created in the previous step.  Finally we save the DataFrame to the graph. We can verify via GraphStudio under the “Load Data” page the system loads 12,345 vertices. (Note: the process can be repeated for additional data files as long as a loading job has been defined with the proper mappings.)

Gaining Additional Insights Using a Graph Algorithm

TigerGraph extends recommendations previously possible through the implementation of 20+ Open Source graph algorithms expressed in our turing-complete programming language GSQL.  In the following example, we will calculate the 5 most similar customers based on purchase history which could be used for targeted marketing campaigns and smarter recommendations.  We will use the jaccard similarity algorithm to calculate the most similar customers.  The query is run using a customer_id as input and will limit the results based on the topK value:

CREATE QUERY recommendation(VERTEX<customers> sourceCustomer, set<string> productName, INT topK) FOR GRAPH MyGraph {
/*

This query calculates the Jaccard Similarity between a given customer (of type customers) and every other customer who shares similar products (of type products). The topK “similar” customers are printed out.

SAMPLE INPUT:

 Customer: 833
	    productNames (optional):
		      Perfect Fitness Perfect Rip Deck
			O'Brien Men's Neoprene Life Vest
		  topK: 5

A Jaccard Similarity score is calculated for each similar customer (who share similar purchases with the input sourceCustomer). The set of similar customers is sorted with the topK # customers printed out.

Jaccard similarity = intersection_size / (size_A + size_B – intersection_size)

More info:

  How to find Jaccard similarity?

  https://www.youtube.com/watch?v=5RRyzjvC5z4

  Similarity Algorithms in GSQL

  https://github.com/tigergraph/gsql-graph-algorithms/tree/master/algorithms/examples/Similarity

CREATE QUERY recommendation(VERTEX<customers> sourceCustomer, set<string> productName, INT topK) FOR GRAPH MyGraph {

SumAccum<INT> @intersection_size, @@set_size_A, @set_size_B;
SumAccum<FLOAT> @similarity;

A(ANY) = {sourceCustomer};

A = SELECT s
FROM A:s
ACCUM @@set_size_A += s.outdegree("ordered");

ordersSet = SELECT t
FROM A:s > orders:t;

order_itemsSet = SELECT t
FROM ordersSet > order_items:t;

productsSet = SELECT t
FROM order_itemsSet > products:t
WHERE productName.size() == 0 OR (t.product_name in productName);

order_itemsSet = SELECT t
FROM productsSet:s > order_items:t;

ordersSet = SELECT t
FROM order_itemsSet:s > orders:t;

B = SELECT t
FROM ordersSet:s > customers:t
WHERE t != sourceCustomer
ACCUM [email protected]_size += 1,
[email protected]_size_B = t.outdegree("ordered")
POST-ACCUM [email protected] = [email protected]_size*1.0/
(@@set_size_A + [email protected]_size_B - [email protected]_size)
ORDER BY [email protected] DESC
LIMIT topK;

//PRINT B;
PRINT B[B.customer_fname, B.customer_lname, [email protected]];
}


The result set prints out the top 5 similar customers by similarity in JSON format:

[
  {
    "B": [
      {
        "v_id": "1443",
        "v_type": "customers",
        "attributes": {
          "B.customer_fname": "Denise",
          "B.customer_lname": "Cohen",
          "[email protected]": 2.16667
        }
      },
      {
        "v_id": "1464",
        "v_type": "customers",
        "attributes": {
          "B.customer_fname": "Amber",
          "B.customer_lname": "Dixon",
          "[email protected]": 2
        }
      },
      {
        "v_id": "6950",
        "v_type": "customers",
        "attributes": {
          "B.customer_fname": "Nicholas",
          "B.customer_lname": "Smith",
          "[email protected]": 2
        }
      },
      {
        "v_id": "10175",
        "v_type": "customers",
        "attributes": {
          "B.customer_fname": "Michael",
          "B.customer_lname": "Gibson",
          "[email protected]": 1.83333
        }
      },
      {
        "v_id": "1492",
        "v_type": "customers",
        "attributes": {
          "B.customer_fname": "Gerald",
          "B.customer_lname": "Patel",
          "[email protected]": 1.83333
        }
      }
    ]
  }
]

Next Steps

Graph databases complement data lakes and enable businesses to uncover relationships and insights which would otherwise be unattainable. Building an initial graph data model can leverage existing known schemas to evolve over time. TigerGraph enables this model with flexible schema updates and interoperability with your existing infrastructure.

Now is an ideal time for you to get started with TigerGraph. Sign up here for attend one of our weekly live demonstrations or here to request a demo personalized to the specific needs of your business.

You Might Also Like

TigerGraph Showcases Unrivaled Performance at Scale

TigerGraph Showcases Unrivaled Performance at Scale

January 12, 2023
How to Create a Visual Graph Analytics Application Using TigerGraph Insights in 30 mins

How to Create a Visual Graph...

November 14, 2022
Turbocharge your business intelligence with TigerGraph’s ML Workbench on TigerGraph Cloud

Turbocharge your business intelligence with TigerGraph’s...

November 14, 2022

Introducing TigerGraph 3.0

July 1, 2020

Everything to Know to Pass your TigerGraph Certification Test

June 24, 2020

Neo4j 4.0 Fabric – A Look Behind the Curtain

February 7, 2020

TigerGraph Blog

  • Categories
    • blogs
      • About TigerGraph
      • Benchmark
      • Business
      • Community
      • Compliance
      • Customer
      • Customer 360
      • Cybersecurity
      • Developers
      • Digital Twin
      • eCommerce
      • Emerging Use Cases
      • Entity Resolution
      • Finance
      • Fraud / Anti-Money Laundering
      • GQL
      • Graph Database Market
      • Graph Databases
      • GSQL
      • Healthcare
      • Machine Learning / AI
      • Podcast
      • Supply Chain
      • TigerGraph
      • TigerGraph Cloud
    • Graph AI On Demand
      • Analysts and Research
      • Customer 360 and Entity Resolution
      • Customer Spotlight
      • Development
      • Finance, Banking, Insurance
      • Keynote
      • Session
    • Video
  • Recent Posts

    • It’s Time to Harness the Power of Graph Technology [Infographic]
    • TigerGraph Showcases Unrivaled Performance at Scale
    • TigerGraph 101 An Introduction to Graph | Jan 26th @ 9am PST
    • Data Science Salon New York
    • Tech For Retail
    TigerGraph

    Product

    SOLUTIONS

    customers

    RESOURCES

    start for free

    TIGERGRAPH DB
    • Overview
    • Features
    • GSQL Query Language
    GRAPH DATA SCIENCE
    • Graph Data Science Library
    • Machine Learning Workbench
    TIGERGRAPH CLOUD
    • Overview
    • Cloud Starter Kits
    • Login
    • FAQ
    • Pricing
    • Cloud Marketplaces
    USEr TOOLS
    • GraphStudio
    • TigerGraph Insights
    • Application Workbenches
    • Connectors and Drivers
    • Starter Kits
    • openCypher Support
    SOLUTIONS
    • Why Graph?
    industry
    • Advertising, Media & Entertainment
    • Financial Services
    • Healthcare & Life Sciences
    use cases
    • Benefits
    • Product & Service Marketing
    • Entity Resolution
    • Customer 360/MDM
    • Recommendation Engine
    • Anti-Money Laundering
    • Cybersecurity Threat Detection
    • Fraud Detection
    • Risk Assessment & Monitoring
    • Energy Management
    • Network & IT Management
    • Supply Chain Analysis
    • AI & Machine Learning
    • Geospatial Analysis
    • Time Series Analysis
    success stories
    • Customer Success Stories

    Partners

    Partner program
    • Partner Benefits
    • TigerGraph Partners
    • Sign Up
    LIBRARY
    • Resources
    • Benchmark
    • Webinars
    Events
    • Trade Shows
    • Graph + AI Summit
    • Million Dollar Challenge
    EDUCATION
    • Training & Certifications
    Blog
    • TigerGraph Blog
    DEVELOPERS
    • Developers Hub
    • Community Forum
    • Documentation
    • Ecosystem

    COMPANY

    Company
    • Overview
    • Careers
    • News
    • Press Release
    • Awards
    • Legal
    • Patents
    • Security and Compliance
    • Contact
    Get Started
    • Start Free
    • Compare Editions
    • Online Demo - Test Drive
    • Request a Demo

    Product

    • Overview
    • TigerGraph 3.0
    • TIGERGRAPH DB
    • TIGERGRAPH CLOUD
    • GRAPHSTUDIO
    • TRY NOW

    customers

    • success stories

    RESOURCES

    • LIBRARY
    • Events
    • EDUCATION
    • BLOG
    • DEVELOPERS

    SOLUTIONS

    • SOLUTIONS
    • use cases
    • industry

    Partners

    • partner program

    company

    • Overview
    • news
    • Press Release
    • Awards

    start for free

    • Request Demo
    • take a test drive
    • SUPPORT
    • COMMUNITY
    • CONTACT
    • Copyright © 2023 TigerGraph
    • Privacy Policy
    • Linkedin
    • Facebook
    • Twitter

    Copyright © 2020 TigerGraph | Privacy Policy

    Copyright © 2020 TigerGraph Privacy Policy

    • SUPPORT
    • COMMUNITY
    • COMPANY
    • CONTACT
    • Linkedin
    • Facebook
    • Twitter

    Copyright © 2020 TigerGraph

    Privacy Policy

    • Products
    • Solutions
    • Customers
    • Partners
    • Resources
    • Company
    • START FREE
    START FOR FREE
    START FOR FREE
    TigerGraph
    PRODUCT
    PRODUCT
    • Overview
    • GraphStudio UI
    • Graph Data Science Library
    TIGERGRAPH DB
    • Overview
    • Features
    • GSQL Query Language
    TIGERGRAPH CLOUD
    • Overview
    • Cloud Starter Kits
    TRY TIGERGRAPH
    • Get Started for Free
    • Compare Editions
    SOLUTIONS
    SOLUTIONS
    • Why Graph?
    use cases
    • Benefits
    • Product & Service Marketing
    • Entity Resolution
    • Customer Journey/360
    • Recommendation Engine
    • Anti-Money Laundering (AML)
    • Cybersecurity Threat Detection
    • Fraud Detection
    • Risk Assessment & Monitoring
    • Energy Management
    • Network Resources Optimization
    • Supply Chain Analysis
    • AI & Machine Learning
    • Geospatial Analysis
    • Time Series Analysis
    industry
    • Advertising, Media & Entertainment
    • Financial Services
    • Healthcare & Life Sciences
    CUSTOMERS
    read all success stories

     

    PARTNERS
    Partner program
    • Partner Benefits
    • TigerGraph Partners
    • Sign Up
    RESOURCES
    LIBRARY
    • Resource Library
    • Benchmark
    • Webinars
    Events
    • Trade Shows
    • Graph + AI Summit
    • Graph for All - Million Dollar Challenge
    EDUCATION
    • TigerGraph Academy
    • Certification
    Blog
    • TigerGraph Blog
    DEVELOPERS
    • Developers Hub
    • Community Forum
    • Documentation
    • Ecosystem
    COMPANY
    COMPANY
    • Overview
    • Leadership
    • Careers  
    NEWS
    PRESS RELEASE
    AWARDS
    START FREE
    Start Free
    • Request a Demo
    • SUPPORT
    • COMMUNITY
    • CONTACT
    Dr. Jay Yu

    Dr. Jay Yu | VP of Product and Innovation

    Dr. Jay Yu is the VP of Product and Innovation at TigerGraph, responsible for driving product strategy and roadmap, as well as fostering innovation in graph database engine and graph solutions. He is a proven hands-on full-stack innovator, strategic thinker, leader, and evangelist for new technology and product, with 25+ years of industry experience ranging from highly scalable distributed database engine company (Teradata), B2B e-commerce services startup, to consumer-facing financial applications company (Intuit). He received his PhD from the University of Wisconsin - Madison, where he specialized in large scale parallel database systems

    Todd Blaschka | COO

    Todd Blaschka is a veteran in the enterprise software industry. He is passionate about creating entirely new segments in data, analytics and AI, with the distinction of establishing graph analytics as a Gartner Top 10 Data & Analytics trend two years in a row. By fervently focusing on critical industry and customer challenges, the companies under Todd's leadership have delivered significant quantifiable results to the largest brands in the world through channel and solution sales approach. Prior to TigerGraph, Todd led go to market and customer experience functions at Clustrix (acquired by MariaDB), Dataguise and IBM.