Infection Chains: Discovering the Unknown using Graph Analytics

Infection Chains: Discovering the Unknown using Graph Analytics

Quick Shoutout to Simon Sinek on his Golden Circle of WHY, HOW, WHAT. To learn more on that framework check Simon out on his website


Starting with the WHY:  Your Motivation, Your Purpose

This article doesn’t cover the WHY of contact tracing analytics, but if you have 39 minutes, it’s HIGHLY encouraged that you read this article written by Tomas Peueyo on Coronavirus: How to Do Testing and Contact Tracing.

TL;DR on “Coronavirus: How to Do Testing and Contact Tracing”: Most countries, due to their limited capabilities to provide precision response to COVID-19, have to take a “hammer” approach to the slowing the spread of COVID-19—essentially locking everyone into their homes and restricting travel—this isn’t a solution that is sustainable. 

What can we do? We can solve this by leveraging a mix of strategy, technology and human intervention. In this article, we will discuss the HOW of the technological approach which will eventually lead into the WHAT an effective solution that allows the economy to stabilize and rebound. 

Navigating to the How (The Process, what specific actions are needed to take)

When looking at the HOW we don’t need to look further than South Korea. In a matter of weeks South Korea was able to enter the “Dance” phase as Tomas Peueyo put it (Inside term if you read the WHY). Essentially, South Korea was able to bend the slope of those that were infected by enacting the HOW.

The HOW can be broken down into 3 parts:


Part 1: Data Capture – Method of capturing data

Part 2: Data Modeling – Method of making your data actionable

Part 3: Data Analytics – Method of deriving insights on your data


Part 1: Capturing Data

The data (captured by South Korea authorities) used methods such as CCTV, credit card transactions, transit passes, immigration and health records.  Once this data was captured, it was distributed to healthcare, government and for public consumption.

Image Source: Information Technology–Based Tracing Strategy in Response to COVID-19 in South Korea—Privacy Controversies

South Korea provides some enriched data to demonstrate the magic that happens when you model your data and run graph analytics on it. 

The closest available “open-source” dataset (KCDC Data) originates from the Korea Center for Disease Control and Prevention, and published by Jihoo Kim

The data files we will will be exploring are:

–  PatientInfo.csv – Records on individual patients (including but not limited to): patient_id, sex, birth_year, symptom_onset_date, confirmed_date, released_date, diseasem city, provience, country, infected_by, & case.

–  PatientRoute.csv – Records an individual traveled to (including but not limited to): patient_id, latitude, longitude, location_ype

–  Case.csv – Records an individual is tied to.

Now that we have data to examine, let’s explore how we should model that data to allow us to find insights.

Part 2 Modeling the Data:

When doing data modeling you want to look at two things

  • The first being the most obvious, which is to look at the data itself. The main purpose of this is to get a basic understanding and start to ask yourself what it is you CAN and CANNOT do with the data. 
  • The second is to think of questions that could be derived from having the data arranged in a particular way. 

The first, we examined in part 1, but for the second there were two elements that stuck-out.

Element 1

The particular piece of data that stood out the most was the INFECTED_BY. This means that South Korea authorities traced back and confirmed who infected who. This creates an “infection chain” or simply put, a path(s) the virus took whilst infecting their host. 

Element 2

The other interesting data element that stood out to me was the PatientRoute data which had these four elements, INDIVIDUAL, TIMESTAMP, LATITUDE, LONGITUDE individuals traveled. 

This had me thinking this: 

“Okay, I know the South Korean government did tract down some people that infected other people, but is there a possible infection chain that remains undiscovered?

So how do you model the data to derive those insights?

Let’s take a close look at the important parts of how we would model this.

But before moving on we should quickly mention, in graph databases we have a few terms you may not have heard of. 

Schema: This describes how the data is all inner related. Another important element of this schema is that graphs take a “Join First” approach meaning that all elements are joined upon loading data. Yes that means NO JOINS 🙂  Think of the Schema as the MAP you will follow to find your data. 

Attributes: Are elements that define an object they run roll up to. A good example would be an apple that is the color red. Color(Attribute) = Red(value) would be an attribute of apple in this example.

Edge: Also known as a “relationship”, describes how things are related to one another. In graphs some relationships can have many attributes or properties. If that is the case your graph would be a Property Graph. 

Vertex: Sometimes referred to as a “node” is a data object that has multiple relationships with other data objects. These data objects can also hold multiple attributes 


Back to our graph modeling.

Patient –  (INFECTED_BY) -> Patient

Modeling data in the graph is simpler than you might think. Here below you see that we have a “Patient” (VERTEX)  and there is an “EDGE” called INFECTED_BY” connected to a “Patient”(VERTEX). 

This connection simply represents that a Patient can have a relationship with another patient and that relationship is that they’ve infected that patient. 

Looking closer this is what the edge “INFECTED_BY” looks like:

Another aspect is the vertex “Patient”. Remember above how I mentioned that a Vertex can have attributes. This shows an example of what that would look like:

Patient –  (PATIENT_TRAVELED) – TravelEvent

Remember above how we saw the data elements INDIVIDUAL, TIMESTAMP, LATITUDE, LONGITUDE that is what we will be representing now in the schema. Because we truly don’t fully understand quantum existence let’s assume that a person cannot be in two places at the same time. 

With that assumption we can create a unique ID with a TravelEvent using the lat + long + date attributes. 

Then using that unique ID let’s then use the Patient ID and assign an edge called “PATIENT_TRAVLED” if they’ve been somewhere. 

What that looks like, as it relates to the schema, is something like this:

Also let’s not forget to add the actual attributes of that vertex called TravelEvent: 

Patient – (BELONGS_TO) – InfectionCase

Another element that could lead us to discovering unknown patient connection is record Infection Case data. This data is represented in known hotspots which could be anything from a kindergarten, church, gym or a call center. 

We already have Patient modeled so this last part is adding the connection between a Patient and a known “InfectionCase”:

Again, we will want to add some attributes like the infection_case which includes the name assigned to the case:

Adding it all together we get the schema below, I’ve taken the creative liberty in adding a few items as our data allows us to that aren’t being used in this exercise. Those items include a time-tree of the vertices, Day, Month, Year and a graph location-tree with City, Province, Country.

Part 3 Analyzing the Data:

At this point we have..

Located data relevant to our use case

Explored the data

Thought about how we might want to use the data

Modeled the data.

Now we want to get to the EXCITING PART, which is making sense of the data. 

Can we do what we set out to do and discover the UNKNOWN connections in this well known database.

The answer is YES. 

Don’t mind my artistic liberties, but I’ve pulled out the patterns related to the areas we wanted to explore. This is a birdseye view of the data:

Query 1 edgeCrawl:

If you insist on knowing the exact query I’ll show you this go around. Pretty simple we are going along selecting the edges (mentioned above) and adding them to an Accumulator. Think of an accumulator as a collection of things that can do unique functions.

CREATE QUERY edgeCrawl()FOR GRAPH MyGraph { 
	* S1 = Grabbing all Patients that a Patient Infected
	* S2 = Grabbing all Patients belonging to a well known Case
	* S3 = Grabbing all Patients that were at the same place at the same time
	* Return all EDGEs
	ListAccum<EDGE> @@edgeList;
	seed = {Patient.*};
	S1 = SELECT s
	       FROM seed:s -(INFECTED_BY:e)-> :t
	       ACCUM @@edgeList += e;
	S2 = SELECT s
	         FROM seed:s -(BELONGS_TO_CASE:e)- :t
	         WHERE NOT IN ("etc", "contact with patient", "overseas inflow")
	         ACCUM @@edgeList += e;
	S3 = SELECT s
	         FROM seed:s -(PATIENT_TRAVELED:e)- :t
	         WHERE t.outdegree("PATIENT_TRAVELED") > 1
	         ACCUM @@edgeList += e;
  PRINT  @@edgeList; 

Query 1 edgeCrawl:

Right away I SEE IT… the answer we were looking for!

Can we find connections based on someone traveling to the same place and same time as another person?

Do you see it?

Who Infected Patient 1000000069? 🧐🤔

I don’t even need to tell you the answer because I know you’re smart enough to figure this puzzle out!

Massive Infection Chain

Let’s next write a query that uses the datetime attributes of Symptom Onset Date. What is interesting about this? It’s when they first appeared sick and we can calculate the incubation period of an individual. 

With that we will explore a given patient and when they experience symptoms grab all those that that patient is known to infect and traveled to during the period in which he was sick. 

BUT WAIT! THERE’S MORE! (always wanted to say that, but never had the opportunity)

Let’s grab all those people and repeat the process based on their symptom onsets, and repeat the process again and again until we run out of people that fit our criteria mentioned above.

What this does is give you the infection chain that the virus COULD HAVE traveled. 

Given the input of Patient 4100000006 this is the result we get:

Obviously it’s too hard to make out all the relationships between data elements. so I zoomed in a bit. 

Again, what do you see? BAM! Right there in front of us there are a number of patients that traveled to the same place at the same time in multiple cases. 


4100000036, 4100000043, 4100000006 



4100000056, 4100000038, 4100000007

Even though the Korean Center of Disease Control didn’t track this as a “confirmed” INFECTED_BY case it’s clearly evident that there is a high probability that it’s true.

And, WITH THIS YOUR HONOR… I rest my case (always wanted to say that as well)

Concluding on the WHAT (What do you do. The results of your WHY)

By now you know the WHY and the HOW, the WHAT is… 

What are you going to do now that you know what I know? 

Let me know in the comments below. I would love to hear your thoughts! 

Blog Encore:

Wait… is there such a thing?

Explore Graph View:

Zooming in closer you can start to see all the data come alive, you begin to see the connection from one individual to another. An interesting data point here is that you can see one patient went to Gym1 and then went to Gym2. 

Did they get infected at one gym and bring it to another on a different day?

This view I’m currently on is called “Explore Graph” which is a very helpful feature allowing to find interesting data. 

One thing you could try is to find all other Patients with a similar pattern. 

This Kit has been published in and is available for anyone to spin up discover your own insights, write your own queries, add your own data. Possibilities are endless until you run out of the free hard drive space on the free instance 🙂


Information Technology–Based Tracing Strategy in Response to COVID-19 in South Korea

KCDC Data originates from the Korea Center for Disease Control and Prevention

Next steps

Now is an ideal time for you to get started with TigerGraph. Sign up here to attend one of our weekly live demonstrations or here to request a demo personalized to the specific needs of your business.