The advent of the big-data era has not only seen the volumes of the data ever-growing, but also the complexity and variety of data, more and more real-world business decision makings are relying on understanding how data are related to, correlated, or associated with each other.
Traditional database systems like RDBMS are not designed to tackle this challenge, not even newer big-data frameworks which may have great horizontal scalability but not really on real-time processing capacity. You may think that we’re referring to Hadoop in the first place for its incapacity of real-time-ness and would argue that Spark solves this already. Please read on and we’ll show you how Ultipa Graph compares with Spark head-to-head on the following real-time credit-card or loan application decision-making scenario.
There are a few sub-scenarios during a credit card or loan application, they are:
- Scanning through the application data repository for all phone numbers that each is used by more than 5 applications.
- Sifting through entire application data for multiple applications sharing the same company, referral, mailbox, or device ID (i.e., Phone ID, IP-address and etc.)
- The previous sifting (filtering) rule can be strengthened to use AND, instead of OR operators, and to find out how many applications are sharing the same attributes.
- Finding circles. For instance, application #1 uses Phone #2, which is used by Application #3 which uses Email #4, which is also used by Application #1. In short:
- Application [X] → Phone → Application [Y] → Email → Application [X]
- This sub-scenario requirement is to understand if a 5-node (4-hop path) circle exists and can be further complicated to query for a much deeper/longer-path circle exist, such as 10+ nodes forming a circle?
- Community detection to categorize applicants into different communities will help the credit-card/loan company to better understand
To tackle the above scenarios, we have to first think about how to construct a data model that best addresses the data correlation needs in these sub-scenarios. Here is one simple way to define the schema graphically:
- Each application is considered a node;
- All attributes of an application, such as email, company, device, phone, ID# are also taken as nodes respectively;
The above schema design is entirely different from a traditional table or column-based SQL-style schema design. When we are trying to find the shortest correlations between any two applications, simply check if the two applications share a common attribute node, such as email, phone, ID#, Device-ID or company.
Real-time Pattern Matching in a large Graph Dataset
The above screenshot shows that in a large dataset of over 200 million applications and associating attribute nodes, Ultipa Graph takes a 1.x second turn-around time for sub-scenario-1 and Spark system would need at least 13-minute to return, this is a 400-to-500 times performance edge. More than just performance advantages, completing this scenario takes only 1 line of coding: