Skip to content
START FOR FREE
START FOR FREE
  • SUPPORT
  • COMMUNITY
Menu
  • SUPPORT
  • COMMUNITY
MENUMENU
  • Products
    • The World’s Fastest and Most Scalable Graph Platform

      LEARN MORE

      Watch a TigerGraph Demo

      TIGERGRAPH CLOUD

      • Overview
      • TigerGraph Cloud Suite
      • FAQ
      • Pricing

      USER TOOLS

      • GraphStudio
      • Insights
      • Application Workbenches
      • Connectors and Drivers
      • Starter Kits
      • openCypher Support

      TIGERGRAPH DB

      • Overview
      • GSQL Query Language
      • Compare Editions

      GRAPH DATA SCIENCE

      • Graph Data Science Library
      • Machine Learning Workbench
  • Solutions
    • The World’s Fastest and Most Scalable Graph Platform

      LEARN MORE

      Watch a TigerGraph Demo

      Solutions

      • Solutions Overview

      INCREASE REVENUE

      • Customer Journey/360
      • Product Marketing
      • Entity Resolution
      • Recommendation Engine

      MANAGE RISK

      • Fraud Detection
      • Anti-Money Laundering
      • Threat Detection
      • Risk Monitoring

      IMPROVE OPERATIONS

      • Supply Chain Analysis
      • Energy Management
      • Network Optimization

      By Industry

      • Advertising, Media & Entertainment
      • Financial Services
      • Healthcare & Life Sciences

      FOUNDATIONAL

      • AI & Machine Learning
      • Time Series Analysis
      • Geospatial Analysis
  • Customers
    • The World’s Fastest and Most Scalable Graph Platform

      LEARN MORE

      CUSTOMER SUCCESS STORIES

      • Ford
      • Intuit
      • JPMorgan Chase
      • READ MORE SUCCESS STORIES
      • Jaguar Land Rover
      • United Health Group
      • Xbox
  • Partners
    • The World’s Fastest and Most Scalable Graph Platform

      LEARN MORE

      PARTNER PROGRAM

      • Partner Benefits
      • TigerGraph Partners
      • Sign Up
      TigerGraph partners with organizations that offer complementary technology solutions and services.​
  • Resources
    • The World’s Fastest and Most Scalable Graph Platform

      LEARN MORE

      BLOG

      • TigerGraph Blog

      RESOURCES

      • Resource Library
      • Benchmarks
      • Demos
      • O'Reilly Graph + ML Book

      EVENTS & WEBINARS

      • Graph+AI Summit
      • Graph for All - Million Dollar Challenge
      • Events &Trade Shows
      • Webinars

      DEVELOPERS

      • Documentation
      • Ecosystem
      • Developers Hub
      • Community Forum

      SUPPORT

      • Contact Support
      • Production Guidelines

      EDUCATION

      • Training & Certifications
  • Company
    • Join the World’s Fastest and Most Scalable Graph Platform

      WE ARE HIRING

      COMPANY

      • Company Overview
      • Leadership
      • Legal Terms
      • Patents
      • Security and Compliance

      CAREERS

      • Join Us
      • Open Positions

      AWARDS

      • Awards and Recognition
      • Leader in Forrester Wave
      • Gartner Research

      PRESS RELEASE

      • Read All Press Releases
      TigerGraph Recognized in 2022 Gartner® Critical Capabilities for Cloud Database Management Systems for Analytical Use Cases
      January 12, 2023
      Read More »

      NEWS

      • Read All News

      A Shock to the System: ShockNet Predicts How Economic Shocks Could Affect the World Economy

      TigerGraph Recognized for the First Time in the 2022 Gartner® Magic Quadrant™ for Cloud Database Management Systems

  • START FREE
    • The World’s Fastest and Most Scalable Graph Platform

      GET STARTED

      • Request a Demo
      • CONTACT US
      • Try TigerGraph
      • START FREE
      • TRY AN ONLINE DEMO

Using A Graph Database for Big Data Entity Resolution

  • Changran Liu
  • April 7, 2020
  • blog, Entity Resolution, Graph Databases
  • Blog >
  • Using A Graph Database for Big Data Entity Resolution

We’re Being Deluged By Data

According to market intelligence company IDC, the “Global Datasphere” in 2018 reached 18 zettabytes. (Yes, that zettabytes!) And the problem is only going to get worse – the amount of data being generated increases relentlessly. 

The data being collected by organizations can, however, give a misleading or fragmented view of the real world. A person, for example, could appear multiple times (or have multiple digital entities) within the same database, due to typos, name changes, aggregation of different systems and so on. If we try to merge two databases, how do we match entities, when the ID systems might be different or contain errors? 

Entity resolution (ER) helps get to the truth. Entity resolution, which is the disambiguation of real-world entities in a database, is an essential data quality tool for, for example, businesses wanting to see the full customer journey, and to make the right recommendations, or for law enforcement trying to uncover criminal activity and the perpetrators. 

Graph Databases Can Help You Disambiguate

The key of entity resolution task is to draw linkage between the digital entities referring to the same real-world entities. Graph is the most intuitive, and as we will also show later, the most efficient data structure used for connecting dots. Using graph, each digital entity or the entity attribute can be represented as a vertex, and the entity-has-attribute relationships can be represented as edges. In such an entity-attribute graph, when multiple entities share the same attributes, they will be linked together by the graph (see figure 1).  By representing the entity-attribute relationship as a graph, a variety of graph algorithms (e.g. cosine similarity, Jaccard similarity, shortest path, connected component, label propagation, Louvain) can be performed for the task of entity resolution. 

Figure 1. Each vertex in the graph represents either an entity or an attribute. The edges represent the entity-has-attribute relationship. The graph linked different entities together when they share common attributes. For example, Entity 3 and Entity 5 are linked by Attr. 4 and 5. 

Solving the entity resolution problem with graph can break down into two steps, namely linking and grouping. In the linking stage, graph algorithms, such as cosine similarity, Jaccard similarity, and shortest path, can be applied for pairwise matching. Once two entities are matched, an edge can be drawn between them to represent the same-as relationship. The same-as relationship can be deterministic as well as probabilistic. In the case of probabilistic relation, an weighted edge can be used to represent the likelihood of one entity being the same as the other entity. In the grouping stage, the inserted same-as edges can greatly accelerate the computation in which graph community detection algorithms, such as connected component, label propagation, and Louvain, can be used to group the the digital entities by their real-world entities. 

Graph provides an efficient approach for the entity resolution problem. A native graph database with massive parallel computing capability is the best tool to implement the approach. Solving entity resolution problems by running queries in the database not only saves the time for exporting the data to the other computing platform, but also addresses the challenge of ever-growing data size and data update. Additionally, because of the transitivity of the same-as relationship (i.e. if entity A is the same as entity B, and entity B is the same as entity C, then entity A is also the same as entity C), entity resolution requires multiple-hop queries. Such queries require multiple join operations which are computationally expensive for relational database. A graph database, on the other hand, is the ideal solution for the big data entity resolution task. In the next section, we will show how to implement a TigerGraph solution for entity resolution. 

TigerGraph is Especially Adept at Uncovering the Truth

Let’s examine a real-world example of how TigerGraph disambiguates the passengers through booking records of flight tickets. Each booking record can include the ticket number, passenger first/last name, email address, phone number, departure, destination as well as other related information. An airline company can have millions of ticket records in the database. One passenger may use different email addresses, phone numbers, or names to book flight tickets. In this scenario, each booking record is a digital entity with multiple attributes. The entity resolution task is to link the tickets to the real-world entity, passenger. Without losing the generality, in this simplified example, we assume each record contains the ticket number, passenger name, email, and phone number (see table 1). The five tickets in this toy example were actually booked by the same person. Now let’s build a graph in TigerGraph to find this person.   

Table 1

Defining Graph Schema

We need five types of vertices in this graph. The first type of vertices is for the flight ticket (i.e. digital entity), and they can be created using the ticket number. Similarly, each of the three attributes, passenger name, email and phone, can also be represented by a type of vertice. The last type of vertex is the passenger, the real-world entity. The grouping of the digital entities will be done by connecting all the matching ticket vertices to the same passenger vertices. For the edges, we have “has” edges between tickets and the attributes, “book” edges between passengers and flight tickets, “same as” edge between passengers. 

Loading Data

With the graph schema above, the data can be loaded to create the entity-attribute graph. For each record, we will use the ticket number as well as the attributes to create vertices, and connect the ticket vertex with its attribute vertices. Figure 2 shows the graph created by the sample data in table 1. 

Figure 2. The graph created by loading the data in table 1.

Running Queries

The task of entity resolution can be split into three queries, initialization, linking and grouping.

1. Initialization:

In the initialization query, we will first assume each ticket was booked by a different passenger and the passenger has all the attributes (i.e. names, phones, emails) connected to the ticket. This is done by creating a passenger vertex for each ticket vertex, then inserting a “book” edge from the passenger vertex to ticket vertex as well as “has” edges to the attributes of the flight ticket (see figure 3). This step is important for the data lineage purpose. As we will see in the third step we will merge/remove the duplicated passengers, the initialization step allows us to keep the original relations between tickets and attributes.   

Figure 3. The initialization query assumes each ticket is bought by a different passenger and let the ticket-buyer inherit all the attributes of its ticket.  For simplicity purposes, this figure only shows the passenger-attribute relations and does not show the ticket vertices in the graph. 

2. Linking

The linking query connects the entities that satisfy the matching condition. In this example, let’s assume two flight tickets were booked by the same person when they share at least two attributes. For example, ticket 1 and 3 were booked both using the same phone and name, thus we consider they were booked by the same person (see figure 3). In this case, the linking query traverses all the passenger – attribute – passenger paths and aggregates the count of paths for each pair of passengers. Lastly, the query will insert an edge between two passenger vertices if they are linked by at least two attributes. 

(a)

(b)

Figure 4. (a) In the above example, three same-as edges will be inserted between passenger (1, 3), (2, 3) and (4, 5) because each pair shares at least two common attributes. The connections between passenger 1 and 2 are highlighted. (b) The graph of passenger-passenger relationship after inserting the same-as edges in the linking query. For simplicity purposes, this figure does not show the ticket and attributes vertices in the graph.

3. Grouping

After the linking query finishes, all the matching passengers will be connected. The entity deduplication can be achieved by merging the connected passenger vertices together. This can be done by performing the connected component algorithm on the network formed by the passenger vertices and same-as edges. Briefly, the connected component algorithm will find the connected entities by assigning the same label to the entities in the same connected group. 

The connected component algorithm can have different variation, below is one example to show the overall idea:

1. Assign a unique integer as an ID to each entity. 

2. Propagate this ID to its connected neighbors, so that each vertex will receive all its neighbors’ IDs.

3. Update the old ID by using the smallest ID value received from the neighbors.

4. Repeat above steps 2 and 3 until no ID update happens. 

When iteration stops, the vertices in each connected component will have the same ID. And this ID will be the smallest ID within the connected component. Above algorithm not only finds the connected entities, but also selects a “lead” entity within each connected group. The last step of the grouping query is to let the lead entities inherit all the attributes in the group and remove the other entities. The graph of passenger-attribute relations after the grouping query finishes is shown in figure 5. 

Figure 5. After the grouping query finishes, passenger 1, 2, 3 will be merged into passenger 1. Passenger 4 and 5 will be merged into passenger 4. 

In some use cases, multiple iterations are needed to repeat the linking – grouping steps. For example (see figure 5) after we merge passengers 1, 2, 3 into 1, passenger 1 will have all the attributes from 2, 3 and itself. So passenger 1 can connect to other entities that it did not in the last iteration, thus new groups can be formed. Therefore, the linking and grouping shall be repeated until no more same-as edge needs to be inserted. It is also worth mentioning that in all above steps the operation on each vertex can be run in parallel, and the algorithm can fit into a map reduce framework. Above solution also works for incremental updates. When having new data or streaming data loaded into the database, we don’t need to rerun the process using all the existing data. Instead, we only need to label those newly loaded data, and process the newly loaded vertices and their neighbors.

Now Is The Time

We’ve shown that graph database is a powerful tool for entity resolution especially on a large data scale. We also examined a general approach for solving entity resolution tasks with graph algorithms has been introduced, and included a representative use case (i.e. flight ticket booking)  has been discussed in detail to show how the passenger entities can be resolved by running three queries in the graph database. 

Now is an ideal time for you to get started with graph databases. One of the best ways to get started with graph analytics is with TigerGraph Cloud. It’s free and you can create an account in a few minutes. Sign up here. 

 

You Might Also Like

TigerGraph Showcases Unrivaled Performance at Scale

TigerGraph Showcases Unrivaled Performance at Scale

January 12, 2023
How to Create a Visual Graph Analytics Application Using TigerGraph Insights in 30 mins

How to Create a Visual Graph...

November 14, 2022
Turbocharge your business intelligence with TigerGraph’s ML Workbench on TigerGraph Cloud

Turbocharge your business intelligence with TigerGraph’s...

November 14, 2022

Introducing TigerGraph 3.0

July 1, 2020

Everything to Know to Pass your TigerGraph Certification Test

June 24, 2020

Neo4j 4.0 Fabric – A Look Behind the Curtain

February 7, 2020

TigerGraph Blog

  • Categories
    • blogs
      • About TigerGraph
      • Benchmark
      • Business
      • Community
      • Compliance
      • Customer
      • Customer 360
      • Cybersecurity
      • Developers
      • Digital Twin
      • eCommerce
      • Emerging Use Cases
      • Entity Resolution
      • Finance
      • Fraud / Anti-Money Laundering
      • GQL
      • Graph Database Market
      • Graph Databases
      • GSQL
      • Healthcare
      • Machine Learning / AI
      • Podcast
      • Supply Chain
      • TigerGraph
      • TigerGraph Cloud
    • Graph AI On Demand
      • Analysts and Research
      • Customer 360 and Entity Resolution
      • Customer Spotlight
      • Development
      • Finance, Banking, Insurance
      • Keynote
      • Session
    • Video
  • Recent Posts

    • It’s Time to Harness the Power of Graph Technology [Infographic]
    • TigerGraph Showcases Unrivaled Performance at Scale
    • TigerGraph 101 An Introduction to Graph | Jan 26th @ 9am PST
    • Data Science Salon New York
    • Tech For Retail
    TigerGraph

    Product

    SOLUTIONS

    customers

    RESOURCES

    start for free

    TIGERGRAPH DB
    • Overview
    • Features
    • GSQL Query Language
    GRAPH DATA SCIENCE
    • Graph Data Science Library
    • Machine Learning Workbench
    TIGERGRAPH CLOUD
    • Overview
    • Cloud Starter Kits
    • Login
    • FAQ
    • Pricing
    • Cloud Marketplaces
    USEr TOOLS
    • GraphStudio
    • TigerGraph Insights
    • Application Workbenches
    • Connectors and Drivers
    • Starter Kits
    • openCypher Support
    SOLUTIONS
    • Why Graph?
    industry
    • Advertising, Media & Entertainment
    • Financial Services
    • Healthcare & Life Sciences
    use cases
    • Benefits
    • Product & Service Marketing
    • Entity Resolution
    • Customer 360/MDM
    • Recommendation Engine
    • Anti-Money Laundering
    • Cybersecurity Threat Detection
    • Fraud Detection
    • Risk Assessment & Monitoring
    • Energy Management
    • Network & IT Management
    • Supply Chain Analysis
    • AI & Machine Learning
    • Geospatial Analysis
    • Time Series Analysis
    success stories
    • Customer Success Stories

    Partners

    Partner program
    • Partner Benefits
    • TigerGraph Partners
    • Sign Up
    LIBRARY
    • Resources
    • Benchmark
    • Webinars
    Events
    • Trade Shows
    • Graph + AI Summit
    • Million Dollar Challenge
    EDUCATION
    • Training & Certifications
    Blog
    • TigerGraph Blog
    DEVELOPERS
    • Developers Hub
    • Community Forum
    • Documentation
    • Ecosystem

    COMPANY

    Company
    • Overview
    • Careers
    • News
    • Press Release
    • Awards
    • Legal
    • Patents
    • Security and Compliance
    • Contact
    Get Started
    • Start Free
    • Compare Editions
    • Online Demo - Test Drive
    • Request a Demo

    Product

    • Overview
    • TigerGraph 3.0
    • TIGERGRAPH DB
    • TIGERGRAPH CLOUD
    • GRAPHSTUDIO
    • TRY NOW

    customers

    • success stories

    RESOURCES

    • LIBRARY
    • Events
    • EDUCATION
    • BLOG
    • DEVELOPERS

    SOLUTIONS

    • SOLUTIONS
    • use cases
    • industry

    Partners

    • partner program

    company

    • Overview
    • news
    • Press Release
    • Awards

    start for free

    • Request Demo
    • take a test drive
    • SUPPORT
    • COMMUNITY
    • CONTACT
    • Copyright © 2023 TigerGraph
    • Privacy Policy
    • Linkedin
    • Facebook
    • Twitter

    Copyright © 2020 TigerGraph | Privacy Policy

    Copyright © 2020 TigerGraph Privacy Policy

    • SUPPORT
    • COMMUNITY
    • COMPANY
    • CONTACT
    • Linkedin
    • Facebook
    • Twitter

    Copyright © 2020 TigerGraph

    Privacy Policy

    • Products
    • Solutions
    • Customers
    • Partners
    • Resources
    • Company
    • START FREE
    START FOR FREE
    START FOR FREE
    TigerGraph
    PRODUCT
    PRODUCT
    • Overview
    • GraphStudio UI
    • Graph Data Science Library
    TIGERGRAPH DB
    • Overview
    • Features
    • GSQL Query Language
    TIGERGRAPH CLOUD
    • Overview
    • Cloud Starter Kits
    TRY TIGERGRAPH
    • Get Started for Free
    • Compare Editions
    SOLUTIONS
    SOLUTIONS
    • Why Graph?
    use cases
    • Benefits
    • Product & Service Marketing
    • Entity Resolution
    • Customer Journey/360
    • Recommendation Engine
    • Anti-Money Laundering (AML)
    • Cybersecurity Threat Detection
    • Fraud Detection
    • Risk Assessment & Monitoring
    • Energy Management
    • Network Resources Optimization
    • Supply Chain Analysis
    • AI & Machine Learning
    • Geospatial Analysis
    • Time Series Analysis
    industry
    • Advertising, Media & Entertainment
    • Financial Services
    • Healthcare & Life Sciences
    CUSTOMERS
    read all success stories

     

    PARTNERS
    Partner program
    • Partner Benefits
    • TigerGraph Partners
    • Sign Up
    RESOURCES
    LIBRARY
    • Resource Library
    • Benchmark
    • Webinars
    Events
    • Trade Shows
    • Graph + AI Summit
    • Graph for All - Million Dollar Challenge
    EDUCATION
    • TigerGraph Academy
    • Certification
    Blog
    • TigerGraph Blog
    DEVELOPERS
    • Developers Hub
    • Community Forum
    • Documentation
    • Ecosystem
    COMPANY
    COMPANY
    • Overview
    • Leadership
    • Careers  
    NEWS
    PRESS RELEASE
    AWARDS
    START FREE
    Start Free
    • Request a Demo
    • SUPPORT
    • COMMUNITY
    • CONTACT
    Dr. Jay Yu

    Dr. Jay Yu | VP of Product and Innovation

    Dr. Jay Yu is the VP of Product and Innovation at TigerGraph, responsible for driving product strategy and roadmap, as well as fostering innovation in graph database engine and graph solutions. He is a proven hands-on full-stack innovator, strategic thinker, leader, and evangelist for new technology and product, with 25+ years of industry experience ranging from highly scalable distributed database engine company (Teradata), B2B e-commerce services startup, to consumer-facing financial applications company (Intuit). He received his PhD from the University of Wisconsin - Madison, where he specialized in large scale parallel database systems

    Todd Blaschka | COO

    Todd Blaschka is a veteran in the enterprise software industry. He is passionate about creating entirely new segments in data, analytics and AI, with the distinction of establishing graph analytics as a Gartner Top 10 Data & Analytics trend two years in a row. By fervently focusing on critical industry and customer challenges, the companies under Todd's leadership have delivered significant quantifiable results to the largest brands in the world through channel and solution sales approach. Prior to TigerGraph, Todd led go to market and customer experience functions at Clustrix (acquired by MariaDB), Dataguise and IBM.