Overview
The Pearson correlation coefficient measures the linear correlation between two variables. The Pearson correlation coefficient between two nodes in graph is calculated by using N properties of node to form two N-dimensional vectors.
Basic Concept
Vector
Vector is one of the basic concepts in Advanced Mathematics, vectors in low dimensional spaces are relatively easy to understand and express. The following diagram shows the relationship between vectors A, B and coordinate axes in 2- and 3-dimensional spaces respectively, as well as the angle θ
between them:

When comparing two nodes in graph, N properties of node are used to form the two N-dimensional vectors.
Pearson Correlation Coefficient
The range of Pearson correlation coefficient values is [-1,1]; let r
to denote the Pearson correlation coefficient, then:
r > 0
indicates positive correlation, i.e. as one variable becomes larger, the other variable becomes larger;r < 0
indicates negative correlation, i.e. as one variable becomes larger, the other variable becomes smaller;r = 1
orr = -1
indicates that two variables can be described by a linear equation, i.e. them fall on the same line;r = 0
indicates that there is no linear correlation (but may exist some other correlations).
For two variables X= (x1, x2, ..., xn) and Y = (y1, y2, ..., yn) , Pearson correlation coefficient (r) is defined as the ratio of the covariance of them and the product of their standard deviations:

Special Case
Isolated Node, Disconnected Graph
Theoretically, the calculation of Pearson Correlation Coefficient between two nodes does not depend on the existence of edges in the graph. Regardless of whether the two nodes to be calculated are isolated nodes or whether they are in the same connected component, it does not affect the calculation of their Pearson Correlation Coefficient.
Self-loop Edge
The calculation of Pearson Correlation Coefficient has nothing to do with edges.
Directed Edge
The calculation of Pearson Correlation Coefficient has nothing to do with edges.
Command and Configuration
- Command:
algo(similarity)
- Configurations for the parameter
params()
:
Name |
Type |
Default |
Specification | Description |
---|---|---|---|---|
ids / uuids | []_id / []_uuid |
/ | Mandatory | IDs or UUIDs of the first set of nodes to be calculated |
ids2 / uuids2 | []_id / []_uuid |
/ | Optional | IDs or UUIDs of the second set of nodes to be calculated |
type | string | cosine | jaccard / overlap / cosine / pearson / euclideanDistance / euclidean | Measurement of the similarity: jaccard: Jaccard Similarity overlap: Overlap Similarity cosine: Cosine Similarity pearson: Pearson Correlation Coefficient euclideanDistance: Euclidean Distance euclidean: Normalized Euclidean Distance |
node_schema_property | []@<schema>?.<property> |
/ | Numeric node property; LTE needed; schema can be either carried or not | When type is cosine / pearson / euclideanDistance / euclidean, must specify two or more node properties to form the vector; when type is jaccard / overlap, this parameter is invalid |
limit | int | -1 | >=-1 | Number of results to return; return all results if sets to -1 |
top_limit | int | -1 | >=-1 | Only available in the selection mode, limit the length of selection results (top_list ) of each node, return the full top_list if sets to -1 |
Calculation Mode
This algorithm has two calculation modes:
- Pairing mode: when two sets of valid nodes are configured, pair each node in the first set with each node in the second set (Cartesian product), similarities are calculated for all node pairs.
- Selection mode: when only one set (the first) of valid nodes are configured, for each node in the set, calculate its similarities with all other nodes in the graph, return the results if the similarity > 0, order the results the descending similarity.
Examples
Example Graph
The example graph has product1, product2, product3 and product4 (UUIDs are 1, 2, 3 and 4 in order; edges are ignored), product node has properties price, weight, weight and height:

Task Writeback
1. File Writeback
Calculation Mode | Configuration |
Data in Each Row |
---|---|---|
Pairing mode | filename | node1 ,node2 ,similarity |
Selection mode | filename | node ,top_list |
Example: Calculate Pearson correlation coefficient between product UUID = 1 and products UUID = 2,3,4 through properties price, weight, width and height, write the algorithm results back to file
algo(similarity).params({
uuids: [1],
uuids2: [2,3,4],
node_schema_property: [price,weight,width,height],
type: "pearson"
}).write({
file:{
filename: "pearson"
}
})
Results: File pearson
product1,product2,0.998785
product1,product3,0.474384
product1,product4,0.210494
Example: Calculate Pearson correlation coefficient between products UUID = 1,2,3,4 and all other products in the graph respectively through properties price, weight, width and height, write the algorithm results back to file
algo(similarity).params({
uuids: [1,2,3,4],
node_schema_property: [price,weight,width,height],
type: "pearson"
}).write({
file:{
filename: "list"
}
})
Results: File list
product1,product2:0.998785;product3:0.474384;product4:0.210494;
product2,product1:0.998785;product3:0.507838;product4:0.253573;
product3,product2:0.507838;product1:0.474384;product4:0.474021;
product4,product3:0.474021;product2:0.253573;product1:0.210494;
2. Property Writeback
Not supported by this algorithm.
3. Statistics Writeback
This algorithm has no statistics.
Direct Return
Calculation Mode | Alias Ordinal |
Type | Description | Column Name |
---|---|---|---|---|
Pairing mode | 0 | []perNodePair | Node pair and its similarity | node1 , node2 , similarity |
Selection mode | 0 | []perNode | Node and its selection results | node , top_list |
Example: Calculate Pearson correlation coefficient between product UUID = 1 and products UUID = 2,3,4 through properties price, weight, width and height, order results in the ascending similarity
algo(similarity).params({
uuids: [1],
uuids2: [2,3,4],
node_schema_property: [price,weight,width,height],
type: "pearson"
}) as p
return p order by p.similarity
Results:
node1 | node2 | similarity |
---|---|---|
1 | 4 | 0.210494150169583 |
1 | 3 | 0.474383803132863 |
1 | 2 | 0.998785121601255 |
Example: Select the product with the highest Pearson correlation coefficient with products UUID = 1,2 respectively through properties price, weight, width and height,
algo(similarity).params({
uuids: [1,2],
type: "pearson",
node_schema_property: [price,weight,width,height],
top_limit: 1
}) as top
return top
Results:
node | top_list |
---|---|
1 | 2:0.998785, |
2 | 1:0.998785, |
Streaming Return
Calculation Mode | Alias Ordinal |
Type | Description | Column Name |
---|---|---|---|---|
Pairing mode | 0 | []perNodePair | Node pair and its similarity | node1 , node2 , similarity |
Selection mode | 0 | []perNode | Node and its selection results | node , top_list |
Example: Calculate Pearson correlation coefficient between product UUID = 3 and products UUID = 1,2,4 through properties price, weight, width and height, only return results that have similariy above 0.5
algo(similarity).params({
uuids: [3],
uuids2: [1,2,4],
node_schema_property: [price,weight,width,height],
type: "pearson"
}).stream() as p
where p.similarity > 0.5
return p
Results:
node1 | node2 | similarity |
---|---|---|
3 | 2 | 0.50783775659896 |
Example: Select the product with the highest Pearson correlation coefficient with products UUID = 1,3 respectively
algo(similarity).params({
uuids: [1,3],
node_schema_property: [price,weight,width,height],
type: "pearson",
top_limit: 1
}).stream() as top
return top
Results:
node | top_list |
---|---|
1 | 2:0.998785, |
3 | 2:0.507838, |
Real-time Statistics
This algorithm has no statistics.