Change Password

Please enter the password
Please enter the password Length between [8, 64] ASCII characters Not identical to your email address At least 3 character types from uppercase, lowercase, numbers, and single-byte character symbols
Please enter the password
Submit

Change Nickname

Current Nickname:
Submit

Use Case: Realtime Decision Making (Anti Fraud)

The advent of the big-data era has not only seen the volumes of the data ever-growing, but also the complexity and variety of data, more and more real-world business decision makings are relying on understanding how data are related to, correlated, or associated with each other.

Traditional database systems like RDBMS are not designed to tackle this challenge, not even newer big-data frameworks which may have great horizontal scalability but not really on real-time processing capacity. You may think that we’re referring to Hadoop in the first place for its incapacity of real-time-ness and would argue that Spark solves this already. Please read on and we’ll show you how Ultipa Graph compares with Spark head-to-head on the following real-time credit-card or loan application decision-making scenario.

There are a few sub-scenarios during a credit card or loan application, they are:

  1. Scanning through the application data repository for all phone numbers that each is used by more than 5 applications.
  2. Sifting through entire application data for multiple applications sharing the same company, referral, mailbox, or device ID (i.e., Phone ID, IP-address and etc.)
  3. The previous sifting (filtering) rule can be strengthened to use AND, instead of OR operators, and to find out how many applications are sharing the same attributes.
  4. Finding circles. For instance, application #1 uses Phone #2, which is used by Application #3 which uses Email #4, which is also used by Application #1. In short:
    1. Application [X] → Phone → Application [Y] → Email → Application [X]
    2. This sub-scenario requirement is to understand if a 5-node (4-hop path) circle exists and can be further complicated to query for a much deeper/longer-path circle exist, such as 10+ nodes forming a circle?
  5. Community detection to categorize applicants into different communities will help the credit-card/loan company to better understand

To tackle the above scenarios, we have to first think about how to construct a data model that best addresses the data correlation needs in these sub-scenarios. Here is one simple way to define the schema graphically:

  • Each application is considered a node;
  • All attributes of an application, such as email, company, device, phone, ID# are also taken as nodes respectively;

The above schema design is entirely different from a traditional table or column-based SQL-style schema design. When we are trying to find the shortest correlations between any two applications, simply check if the two applications share a common attribute node, such as email, phone, ID#, Device-ID or company.

Real-time Pattern Matching in a large Graph Dataset

The above screenshot shows that in a large dataset of over 200 million applications and associating attribute nodes, Ultipa Graph takes a 1.x second turn-around time for sub-scenario-1 and Spark system would need at least 13-minute to return, this is a 400-to-500 times performance edge. More than just performance advantages, completing this scenario takes only 1 line of coding:

analyzeCollect().src({"type":"Phone"}).dest({"type":"Application"}).moreThan(5)

The code takes a little cognitive load to digest: by invoking the analyzeCollect() function, starting from all nodes that are typed of “Phone”, search for ending nodes that are typed “Application”, and count the number of applications tied to each phone number, return the list of phone numbers that have been used by >=5 applications. The result is staggering, there are over 1.8 million phone numbers repeatedly used by over 9 million applications, which are considered fraudulent.

Near-Real-time Pattern Matching in a large Graph Dataset

In sub-scenario-#2 (the first part of the above screenshot), we ran through the entire dataset for applications sharing the same device, which is a common high-risk or fraudulent case, and found that there are over 45M applications having such problems.

In the second part of the above screenshot, the query is being very restrictive as depicted by sub-scenario-#3, there are only 1 pair of applications sharing all the same attributes (Device-ID, Company, Email, Reference, Reference’s Application).

This last query is more time-consuming and computationally intense, and this is usually considered a large-scale batch processing. Ultipa Graph, running in a public cloud environment, takes about 40 seconds to return, however, this would take more than an hour for a Spark cluster to finish.

To verify the above result was correct, simply run a path-finding between the two found nodes, the resulting subgraph is shown in the next diagram: the two highlighted applications sharing 6 common node-attribute.

Correctness Verification of Sub-Scenario-#3 via Path Finding

Louvain is a quite recent addition to the graph algorithm family (invented in 2008) and its very useful to help data analysts understand how many communities are formed amongst all the data: closely connected data are considered within a community, and many communities can be formed with each containing certain number of members. It’s very popular among social network analytics, fraud detection, and other scenarios.

The caveat with Louvain is that its original algorithm is a serial one, sequentially computed and very slow, if you are applying it in an anti-fraud setup, real-time-ness is highly desired. Traditional solutions may take hours or days to calculate. In Ultipa, Louvain is rearchitected and re-engineered to be highly parallel and the result is a stunning fast parallel Louvain, with a highly-visualized built-in Louvain DV module embedded in Ultipa Manager (see next diagram) for intuitive comprehension.

Scenario-#5: Real-time Louvain Community Detection & Visualization (screenshot)

 

In many scenarios, specifically sub-scenario-#4 and 5, we see that Ultipa’s highly parallel compute engine was able to outperform other systems by 3 orders of magnitudes. The take-away here should be this:

Better Performance = Higher Throughput = Smaller Cloud/Cluster = Lowered TCO

Here is a highlighted comparison matrix between Ultipa Graph, Spark, Neo4j and Python on a large graph with 200+ million nodes/edges:

Comparison

Metrics

Spark

Ultipa

Neo4J

Python

NetworkX

OLAP

Scenario 1

780 seconds

1.6 second

Not-Tested

N/A

OLAP

Scenario 2/3

3600 seconds

44 seconds

Not-Tested

N/A

Loop-Finding

Scenario 4

N/A

30,000 QPS

30 QPS

N/A

Louvain

Community

(Scenario 5)

N/A

Not-Tested

10-min

Not-Tested

Unable to Finish

(weeks w/ unlimited resources)

 

 
Want to read more?