Skip to content
START FOR FREE
START FOR FREE
  • SUPPORT
  • COMMUNITY
Menu
  • SUPPORT
  • COMMUNITY
MENUMENU
  • Products
    • The World’s Fastest and Most Scalable Graph Platform

      LEARN MORE

      Watch a TigerGraph Demo

      TIGERGRAPH CLOUD

      • Overview
      • TigerGraph Cloud Suite
      • FAQ
      • Pricing

      USER TOOLS

      • GraphStudio
      • Insights
      • Application Workbenches
      • Connectors and Drivers
      • Starter Kits
      • openCypher Support

      TIGERGRAPH DB

      • Overview
      • GSQL Query Language
      • Compare Editions

      GRAPH DATA SCIENCE

      • Graph Data Science Library
      • Machine Learning Workbench
  • Solutions
    • The World’s Fastest and Most Scalable Graph Platform

      LEARN MORE

      Watch a TigerGraph Demo

      Solutions

      • Solutions Overview

      INCREASE REVENUE

      • Customer Journey/360
      • Product Marketing
      • Entity Resolution
      • Recommendation Engine

      MANAGE RISK

      • Fraud Detection
      • Anti-Money Laundering
      • Threat Detection
      • Risk Monitoring

      IMPROVE OPERATIONS

      • Supply Chain Analysis
      • Energy Management
      • Network Optimization

      By Industry

      • Advertising, Media & Entertainment
      • Financial Services
      • Healthcare & Life Sciences

      FOUNDATIONAL

      • AI & Machine Learning
      • Time Series Analysis
      • Geospatial Analysis
  • Customers
    • The World’s Fastest and Most Scalable Graph Platform

      LEARN MORE

      CUSTOMER SUCCESS STORIES

      • Ford
      • Intuit
      • JPMorgan Chase
      • READ MORE SUCCESS STORIES
      • Jaguar Land Rover
      • United Health Group
      • Xbox
  • Partners
    • The World’s Fastest and Most Scalable Graph Platform

      LEARN MORE

      PARTNER PROGRAM

      • Partner Benefits
      • TigerGraph Partners
      • Sign Up
      TigerGraph partners with organizations that offer complementary technology solutions and services.​
  • Resources
    • The World’s Fastest and Most Scalable Graph Platform

      LEARN MORE

      BLOG

      • TigerGraph Blog

      RESOURCES

      • Resource Library
      • Benchmarks
      • Demos
      • O'Reilly Graph + ML Book

      EVENTS & WEBINARS

      • Graph+AI Summit
      • Graph for All - Million Dollar Challenge
      • Events &Trade Shows
      • Webinars

      DEVELOPERS

      • Documentation
      • Ecosystem
      • Developers Hub
      • Community Forum

      SUPPORT

      • Contact Support
      • Production Guidelines

      EDUCATION

      • Training & Certifications
  • Company
    • Join the World’s Fastest and Most Scalable Graph Platform

      WE ARE HIRING

      COMPANY

      • Company Overview
      • Leadership
      • Legal Terms
      • Patents
      • Security and Compliance

      CAREERS

      • Join Us
      • Open Positions

      AWARDS

      • Awards and Recognition
      • Leader in Forrester Wave
      • Gartner Research

      PRESS RELEASE

      • Read All Press Releases
      TigerGraph Recognized in 2022 Gartner® Critical Capabilities for Cloud Database Management Systems for Analytical Use Cases
      January 12, 2023
      Read More »

      NEWS

      • Read All News

      A Shock to the System: ShockNet Predicts How Economic Shocks Could Affect the World Economy

      TigerGraph Recognized for the First Time in the 2022 Gartner® Magic Quadrant™ for Cloud Database Management Systems

  • START FREE
    • The World’s Fastest and Most Scalable Graph Platform

      GET STARTED

      • Request a Demo
      • CONTACT US
      • Try TigerGraph
      • START FREE
      • TRY AN ONLINE DEMO

Graph-Based Customer Entity Resolution

  • Yu Xu
  • December 18, 2018
  • blog, Developers, Graph Databases
  • Blog >
  • Graph-Based Customer Entity Resolution

1. Problem Description

Enterprises know the benefits of merging data from multiple sources, to build more detailed and more complete records about their customers, their products, their employees, etc. The sources may be different departments or computer systems within one enterprise, a combination of internal and external data sources, or the result of corporate merger and acquisition.

Merging data sources is not always easy, however. One challenge in particular is Entity Resolution, deciding when multiple entities from different data sources actually represent the same real-world entity, and then merging them into one entity. Consider the following example:

Assume there are three data sources containing the following types of customer information:

     Source1 (SSN, Email, Address)
     Source2 (SSN, Phone, Name, Age)
     Source3 (Email, Phone, Gender)

Furthermore, assume that SSN, Email, and Phone are each sufficient to uniquely identify an individual (that is, they constitute PII, personally identifiable information). In the real world, phone numbers and email addresses might not uniquely identify an individual, but for this example we will assume they do. The problem is that the different sources use different identifiers, and that individual records might be missing some information. Over time, missing PII of a customer may show up later in another data source. The goal is to use whatever PII we have about a customer to find all information (attributes) of a customer across all data sources.

First, if we do not merge the data tables, then trying to piece together information about one customer at runtime is very challenging and very ad hoc. For example, if we want to search on an SSN 123-45-7890 to find all information about this customer, we need to do the following:

  1. Search Source1 on the SSN column to find Address and Email values.
  2. If an Email value is present in Source1, then use that Email value (say e1) to search Source3 on the Email column to find Gender and Phone values.
  3. If a Phone value is present in Source 3, then use that Phone value (say p1) to search Source2 to find Name and Age.
  4. We also need to search Source2 on SSN to see if there is matching record.

We can try different sequences of searching the three data sources to piece together all the information about one customer. For example, we can first search Source2 on the SSN column, then search Source3 and Source1. Still, any search strategy is going to have the same complexity. This type of back and forth piecing together customer information is not only slow, but also unmanageable considering the complexity of adding more data sources.

To avoid this clumsy runtime searching, we can try to create a new unified table to store all information about customers. For example, we may try to create a unified Customer table like

Customer (SSN, Email, Phone, Name, Age, Gender, Address).

The immediate challenge using RDBMS with this approach is that we don’t know which PII should be used as Primary Key (which doesn’t allow missing information and cannot be NULL) since any of the PII could be missing at sometime. We could add a system-generated unique CustomerID to each row in the Customer table to get around the Primary Key requirement. However, when we need to merge records from multiple data sources to a single record in the Customer table, we need to piece together information back and forth from the multiple data sources (similar to the search example discussed earlier). Relational databases are not designed for complex ETL processing requiring an unknown length of ‘chaining’ and ‘merging’ different records. Every new data source with a new identifier or new attribute will require going through this slow and difficult process of searching on multiple attributes in search of a match, and then creating or altering a table. Experienced RDBMS developers and DBAs know the inherent limitations of RDBMS to handle this type of ETL and entity resolution.

2. Graph-Based Solution

The good news is that Graphs are a natural solution for problems such as a entity resolution.

2.1 Graph Modeling

A recommended graph-based solution is to create a Customer Vertex for each customer, connected to various PII vertices such as SSN, Email, Phone.

Figure 1 shows an example graph schema (created in GraphStudio, TigerGraph’s browser-based visual SDK). Each Customer vertex represents a unique customer. Each PII Vertex (Phone/SSN/ Email) represents a unique PII value. This schema represents our flexible and extensible concept of “identifier” for a Customer. Any of the three vertex types are sufficient, and more can be added. All the other attributes (such as Age, Gender, and Address) will be stored with the Customer vertex.

Figure 1

2.2 Data Loading and Graph Creation

We will first show step by step how as data are loaded, we incrementally build and merge a graph. Then we’ll summarize the steps as an algorithm.  

Assume we have the following example data as below, where _ means that the information is missing or unavailable at the time.

Line 1: The first line of data in Source1, <s1, _, a1>, generates the graph shown below.

Since SSN s1 has never been in the graph, a SSN vertex s1 is created. Since there is no email present in this record, we neither create an Email vertex nor try to match to an existing Email vertex.Every data record must match to a new or existing customer, so a new Customer vertex U1 is created with the attribute data Address=a1, and an edge between the two vertices (U1 and s1) is added.

Line 2: Loading the second line in Source1 <_,e1, a2> produces the graph shown below.

Since Email e1 isn’t in the graph before, a new Email vertex e1 is created. Since no SSN value is present, there is no way we can connect this record to any existing customer. Thus a new Customer vertex U2 is created and an edge between the two vertices (U2—e1) is created.

Line 3: Loading the third line in Source1 <s2, e2, a3> produces the graph shown below.

Since neither s2 nor e2 is in the graph, two new vertices (SSN vertex s2, Email vertex e2) are created, a single Customer vertex U3 is created, and two edges (U3—s2 and U3—e2) are added.

So far, each data line has created a separate graph pattern.  This is expected, because each line in a table should represent independent information. Next, as we start to read data from Source 2, we have the potential to perform some merging.

Line 4: Loading the first line in Source2  <s1, p1, n1, 22> produces the graph shown below.

Since s1 is in the graph already, we note its Customer U1. Since p1 isn’t in the graph yet, a Phone Vertex p1 is created, and an edge is added between p1 and s1’s Customer vertex U1. Additionally, we add the attribute values Name=n1 and Age=22 to U1. Notice that here we are essentially ‘merging’ two records: the first record of Source1 and the first record of Source2.

Line 5: Loading the second line in Source2 <s2, p2, n2, 32> produces the graph shown below.

Since SSN s2 vertex is already in the graph, only a Phone Vertex p2 is created, and no new Customer vertex is created. An edge between p2 and s2’s Customer vertex U3 is added. Again we are merging two records, the last record of Source1 and the last record of Source2.

Line 6: Finally, loading the first line in Source3 <e1, p1, f> produces the graph shown below.

Both e1 and p1 are in the graph already, but are connected to two different Customer vertices (U2 and U1 respectively). If we are certain that all our data values are correct and are unique identifiers, then Customer vertices U1 and U2 should be merged.

Notice that there is a data conflict issue here: U1 has an address a1, and U2 has an address a2. How should we handle data conflicts when we merge two vertices?

In TigerGraph, this is completely up to the user or each use case’s business requirement. User-defined functions can be used in conjunction with GSQL‘s already rich semantics to decide whether to merge and then how to merge. For example, when merging multiple values into one attribute, you can keep the older value, use the newer value,  or concatenate/append them as a list of values (and remembering their sources, for data lineage/provenance management). In this example, we retain the vertex U1 and address a1 for simplicity.

2.3 A Data Merging and Entity Resolution Algorithm

The following pseudocode generalizes the steps we presented in our example above:

Using TigerGraph’s GSQL high-level loading language, expressing this business logic as a parallel-processing loading job is a straightforward task.

2.4 Querying a customer

Querying a customer using any PII field or a combination of PII fields is straightforward. For example, given any email, phone or SSN, we can quickly find the Customer vertex connected to such PII input vertex and all attributes associated with the Customer vertex. More specifically,

a GSQL query like

    getCustomer(SSN s1, Phone p1, Email e1)

takes in a list of PII input and returns all customer information related to the input PII. The only requirement to the query input is that at least one PII is present.

3. Summary

Graph databases provide an easily-understood and efficient way to merge data sets and to perform the critical entity resolution:

  1. The flexible of graph schemas can easily handle missing information and dynamic schema change.
  2. Graph-based parallel loading and transformation makes data mapping and resolution efficient and easy to manage.
  3. Graph-based inference makes chaining/merging of records (entities) simple and efficient.

TigerGraph’s powerful Native Parallel Graph technology [1] offers the best platform for customer entity resolution. Graph-based customer entity resolution is only one example of harnessing the power of a graph database. Connecting your data, and constructing a true Customer 360 or other knowledge graph is only the start. The real benefits of a fast, flexible, and scalable graph database arise from the queries, analytics, and resulting insights available to you. Graph-based customer profiling using techniques such as vertex clustering, vertex and neighbor similarity, influence ranking, temporal analysis, and churn prediction will deliver actionable insights, enriching and empowering any enterprise and putting you ahead of others.

[1] Native Parallel Graph: The Next Generation of Graph Database for Real-Time Deep Link Analytics.

 

 

You Might Also Like

TigerGraph Showcases Unrivaled Performance at Scale

TigerGraph Showcases Unrivaled Performance at Scale

January 12, 2023
How to Create a Visual Graph Analytics Application Using TigerGraph Insights in 30 mins

How to Create a Visual Graph...

November 14, 2022
Turbocharge your business intelligence with TigerGraph’s ML Workbench on TigerGraph Cloud

Turbocharge your business intelligence with TigerGraph’s...

November 14, 2022

Introducing TigerGraph 3.0

July 1, 2020

Everything to Know to Pass your TigerGraph Certification Test

June 24, 2020

Neo4j 4.0 Fabric – A Look Behind the Curtain

February 7, 2020

TigerGraph Blog

  • Categories
    • blogs
      • About TigerGraph
      • Benchmark
      • Business
      • Community
      • Compliance
      • Customer
      • Customer 360
      • Cybersecurity
      • Developers
      • Digital Twin
      • eCommerce
      • Emerging Use Cases
      • Entity Resolution
      • Finance
      • Fraud / Anti-Money Laundering
      • GQL
      • Graph Database Market
      • Graph Databases
      • GSQL
      • Healthcare
      • Machine Learning / AI
      • Podcast
      • Supply Chain
      • TigerGraph
      • TigerGraph Cloud
    • Graph AI On Demand
      • Analysts and Research
      • Customer 360 and Entity Resolution
      • Customer Spotlight
      • Development
      • Finance, Banking, Insurance
      • Keynote
      • Session
    • Video
  • Recent Posts

    • It’s Time to Harness the Power of Graph Technology [Infographic]
    • TigerGraph Showcases Unrivaled Performance at Scale
    • TigerGraph 101 An Introduction to Graph | Jan 26th @ 9am PST
    • Data Science Salon New York
    • Tech For Retail
    TigerGraph

    Product

    SOLUTIONS

    customers

    RESOURCES

    start for free

    TIGERGRAPH DB
    • Overview
    • Features
    • GSQL Query Language
    GRAPH DATA SCIENCE
    • Graph Data Science Library
    • Machine Learning Workbench
    TIGERGRAPH CLOUD
    • Overview
    • Cloud Starter Kits
    • Login
    • FAQ
    • Pricing
    • Cloud Marketplaces
    USEr TOOLS
    • GraphStudio
    • TigerGraph Insights
    • Application Workbenches
    • Connectors and Drivers
    • Starter Kits
    • openCypher Support
    SOLUTIONS
    • Why Graph?
    industry
    • Advertising, Media & Entertainment
    • Financial Services
    • Healthcare & Life Sciences
    use cases
    • Benefits
    • Product & Service Marketing
    • Entity Resolution
    • Customer 360/MDM
    • Recommendation Engine
    • Anti-Money Laundering
    • Cybersecurity Threat Detection
    • Fraud Detection
    • Risk Assessment & Monitoring
    • Energy Management
    • Network & IT Management
    • Supply Chain Analysis
    • AI & Machine Learning
    • Geospatial Analysis
    • Time Series Analysis
    success stories
    • Customer Success Stories

    Partners

    Partner program
    • Partner Benefits
    • TigerGraph Partners
    • Sign Up
    LIBRARY
    • Resources
    • Benchmark
    • Webinars
    Events
    • Trade Shows
    • Graph + AI Summit
    • Million Dollar Challenge
    EDUCATION
    • Training & Certifications
    Blog
    • TigerGraph Blog
    DEVELOPERS
    • Developers Hub
    • Community Forum
    • Documentation
    • Ecosystem

    COMPANY

    Company
    • Overview
    • Careers
    • News
    • Press Release
    • Awards
    • Legal
    • Patents
    • Security and Compliance
    • Contact
    Get Started
    • Start Free
    • Compare Editions
    • Online Demo - Test Drive
    • Request a Demo

    Product

    • Overview
    • TigerGraph 3.0
    • TIGERGRAPH DB
    • TIGERGRAPH CLOUD
    • GRAPHSTUDIO
    • TRY NOW

    customers

    • success stories

    RESOURCES

    • LIBRARY
    • Events
    • EDUCATION
    • BLOG
    • DEVELOPERS

    SOLUTIONS

    • SOLUTIONS
    • use cases
    • industry

    Partners

    • partner program

    company

    • Overview
    • news
    • Press Release
    • Awards

    start for free

    • Request Demo
    • take a test drive
    • SUPPORT
    • COMMUNITY
    • CONTACT
    • Copyright © 2023 TigerGraph
    • Privacy Policy
    • Linkedin
    • Facebook
    • Twitter

    Copyright © 2020 TigerGraph | Privacy Policy

    Copyright © 2020 TigerGraph Privacy Policy

    • SUPPORT
    • COMMUNITY
    • COMPANY
    • CONTACT
    • Linkedin
    • Facebook
    • Twitter

    Copyright © 2020 TigerGraph

    Privacy Policy

    • Products
    • Solutions
    • Customers
    • Partners
    • Resources
    • Company
    • START FREE
    START FOR FREE
    START FOR FREE
    TigerGraph
    PRODUCT
    PRODUCT
    • Overview
    • GraphStudio UI
    • Graph Data Science Library
    TIGERGRAPH DB
    • Overview
    • Features
    • GSQL Query Language
    TIGERGRAPH CLOUD
    • Overview
    • Cloud Starter Kits
    TRY TIGERGRAPH
    • Get Started for Free
    • Compare Editions
    SOLUTIONS
    SOLUTIONS
    • Why Graph?
    use cases
    • Benefits
    • Product & Service Marketing
    • Entity Resolution
    • Customer Journey/360
    • Recommendation Engine
    • Anti-Money Laundering (AML)
    • Cybersecurity Threat Detection
    • Fraud Detection
    • Risk Assessment & Monitoring
    • Energy Management
    • Network Resources Optimization
    • Supply Chain Analysis
    • AI & Machine Learning
    • Geospatial Analysis
    • Time Series Analysis
    industry
    • Advertising, Media & Entertainment
    • Financial Services
    • Healthcare & Life Sciences
    CUSTOMERS
    read all success stories

     

    PARTNERS
    Partner program
    • Partner Benefits
    • TigerGraph Partners
    • Sign Up
    RESOURCES
    LIBRARY
    • Resource Library
    • Benchmark
    • Webinars
    Events
    • Trade Shows
    • Graph + AI Summit
    • Graph for All - Million Dollar Challenge
    EDUCATION
    • TigerGraph Academy
    • Certification
    Blog
    • TigerGraph Blog
    DEVELOPERS
    • Developers Hub
    • Community Forum
    • Documentation
    • Ecosystem
    COMPANY
    COMPANY
    • Overview
    • Leadership
    • Careers  
    NEWS
    PRESS RELEASE
    AWARDS
    START FREE
    Start Free
    • Request a Demo
    • SUPPORT
    • COMMUNITY
    • CONTACT
    Dr. Jay Yu

    Dr. Jay Yu | VP of Product and Innovation

    Dr. Jay Yu is the VP of Product and Innovation at TigerGraph, responsible for driving product strategy and roadmap, as well as fostering innovation in graph database engine and graph solutions. He is a proven hands-on full-stack innovator, strategic thinker, leader, and evangelist for new technology and product, with 25+ years of industry experience ranging from highly scalable distributed database engine company (Teradata), B2B e-commerce services startup, to consumer-facing financial applications company (Intuit). He received his PhD from the University of Wisconsin - Madison, where he specialized in large scale parallel database systems

    Todd Blaschka | COO

    Todd Blaschka is a veteran in the enterprise software industry. He is passionate about creating entirely new segments in data, analytics and AI, with the distinction of establishing graph analytics as a Gartner Top 10 Data & Analytics trend two years in a row. By fervently focusing on critical industry and customer challenges, the companies under Todd's leadership have delivered significant quantifiable results to the largest brands in the world through channel and solution sales approach. Prior to TigerGraph, Todd led go to market and customer experience functions at Clustrix (acquired by MariaDB), Dataguise and IBM.