k-Means - Graph Analytics & Algorithms

Change Password

Submit

Change Email

Submit

Change Nickname

Current Nickname:

Submit

Profile

Account ID:

Full Name:
Phone:
Company:
Company Email:

Change Password

Apply

You have no license application record.

Apply

Certificate	Issued at	Valid until	Serial No.	File

Serial No.	Valid until	File

Not having one? Apply now! >>>

Product	Created On	ID	Amount (USD)	Invoice

Product	Created On	ID	Amount (USD)	Invoice

No Invoice

Create Ultipa Account

I agree to the Privacy Policy and the

Data Processing Agreement .

Please agree to continue.

Already have an Ultipa account? Sign in now!

Forgot Password

Reset Password

Back to sign in

k-Means

HDC

Overview

The k-Means algorithm is a widely used clustering algorithm that aims to classify nodes in a graph into k clusters based on their similarity. The algorithm assigns each node to the cluster whose centroid is closest to it in terms of distance. The distance between a node and a centroid can be calculated using different distance metrics, such as Euclidean distance or cosine similarity.

The concept of the k-Means algorithm dates back to 1957, but it was formally named and popularized by J. MacQueen in 1967:

J. MacQueen, Some methods for classification and analysis of multivariate observations (1967)

Since then, the algorithm has found applications in various domains, including vector quantization, clustering analysis, feature learning, computer vision, and more. It is often used as a preprocessing step for other algorithms or as a standalone method for exploratory data analysis.

Concepts

Centroid

The centroid or geometric center of an object in an N-dimensional space is the mean position of all the points in all of the N coordinate directions.

In the context of clustering algorithms like k-Means, a centroid refers to the geometric center of a cluster. By specifying several node properties as node features, centroid is the representative point that summarizes the features of the nodes within the cluster. To find the centroid of a cluster, the algorithm calculates the mean feature value for each dimension across all the nodes assigned to that cluster.

The algorithm begins with k nodes as initial centroids, which can be specified manually or sampled randomly by the system.

Distance Metrics

Ultipa's k-Means algorithm computes distance between a node and a centroid through Euclidean Distance or Cosine Similarity.

Clustering Iterations

During each iterative process of k-Means, each node in the graph calculates its distance to each of the cluster centroids and is assigned to the cluster of minimum distance from it. After organizing all nodes into clusters, the centroids are updated by recalculating their values based on the nodes assigned to the respective clusters.

The iteration ends when the clustering results stabilize to certain threshold, or the number of iterations reaches the limit.

Considerations

The success of the k-Means algorithm depends on appropriately choosing the value of k and selecting appropriate distance metrics for the given problem. The selection of the initial centroids would also affect the final clustering results.
If there are two or more same centroids exist, only one of them will take effect while the other equivalent centroids form empty clusters.

Example Graph

To create this graph:

// Runs each row separately in order in an empty graphset
create().node_property(@default,"f1",float).node_property(@default,"f2",int32).node_property(@default,"f3",int32)
insert().into(@default).nodes([{_id:"A", f1:6.2, f2:49, f3:361}, {_id:"B", f1:5.1, f2:2, f3:283}, {_id:"C", f1:6.1, f2:47, f3:626}, {_id:"D", f1:10.0, f2:41, f3:346}, {_id:"E", f1:7.3, f2:28, f3:373}, {_id:"F", f1:5.9, f2:40, f3:1659}, {_id:"G", f1:1.2, f2:19, f3:669}, {_id:"H", f1:7.2, f2:5, f3:645}, {_id:"I", f1:9.4, f2:37, f3:15}, {_id:"J", f1:2.5, f2:19, f3:207}, {_id:"K", f1:5.1, f2:2, f3:283}])

Creating HDC Graph

To load the entire graph to the HDC server hdc-server-1 as hdc_kmeans:

CALL hdc.graph.create("hdc-server-1", "hdc_kmeans", {
  nodes: {"*": ["*"]},
  edges: {"*": ["*"]},
  direction: "undirected",
  load_id: true,
  update: "static",
  query: "query",
  default: false
})

hdc.graph.create("hdc_kmeans", {
  nodes: {"*": ["*"]},
  edges: {"*": ["*"]},
  direction: "undirected",
  load_id: true,
  update: "static",
  query: "query",
  default: false
}).to("hdc-server-1")

Parameters

Algorithm name: k_means

Name	Type	Spec	Default	Optional	Description
`start_ids`	[]`_id`	/	/	Yes	Specifies nodes as the initial centroids by their `_id`. The length of the array must be equal to `k`. The system will determine them if it is unset.
`start_uuids`	[]`_uuid`	/	/	Yes	Specifies nodes as the initial centroids by their `_uuid`. The length of the array must be equal to `k`. The system will determine them if it is unset.
`k`	Integer	`[1, \|V\|]`	`1`	No	Specifies the number of desired clusters (`\|V\|` is the total number of nodes in the graph).
`distance_type`	Integer	`1`, `2`	`1`	Yes	Specifies the type of the distance metric; sets to `1` for Euclidean Distance, and `2` for Cosine Similarity.
`node_schema_property`	[]"`<@schema.?><property>`"	/	/	No	Numeric node properties used as features; at least two properties are required.
`loop_num`	Integer	≥1	/	No	The maximum number of iterations. The algorithm will terminate after completing all rounds.
`return_id_uuid`	String	`uuid`, `id`, `both`	`uuid`	Yes	Includes `_uuid`, `_id`, or both to represent nodes in the results.

File Writeback

CALL algo.k_means.write("hdc_kmeans", {
  params: {
    return_id_uuid: "id",
    start_ids: ["A", "B", "E"],
    k: 3,
    distance_type: 2,
    node_schema_property: ["f1", "f2", "f3"],
    loop_num: 3
  },
  return_params: {
    file: {
      filename: "communities.txt"
    }
  }
})

algo(k_means).params({
  projection: "hdc_kmeans",
  return_id_uuid: "id",
  start_ids: ["A", "B", "E"],
  k: 3,
  distance_type: 2,
  node_schema_property: ["f1", "f2", "f3"],
  loop_num: 3
}).write({
  file: {
    filename: "communities.txt"
  }
})

Result:

community id:ids
0:I
1:F,H,B,K,G
2:J,D,A,E,C

Full Return

CALL algo.k_means("hdc_kmeans", {
  params: {
    return_id_uuid: "id",
    start_ids: ["A", "B", "E"],
    k: 3,
    distance_type: 1,
    node_schema_property: ["f1", "f2", "f3"],
    loop_num: 3
  },
  return_params: {}
}) YIELD k3
RETURN k3

exec{
  algo(k_means).params({
    return_id_uuid: "id",
    start_ids: ["A", "B", "E"],
    k: 3,
    distance_type: 1,
    node_schema_property: ["f1", "f2", "f3"],
    loop_num: 3
  }) as k3
  return k3
} on hdc_kmeans

Result:

community	_ids
0	["D","B","A","E","K"]
1	["J","I"]
2	["F","H","C","G"]

Stream Return

CALL algo.k_means("hdc_kmeans", {
  params: {
    return_id_uuid: "id",
    k: 2,
    node_schema_property: ["f1", "f2", "f3"],
    loop_num: 5
  },
  return_params: {
    stream: {}
  }
}) YIELD k2
RETURN k2

exec{
  algo(k_means).params({
    return_id_uuid: "id",
    k: 2,
    node_schema_property: ["f1", "f2", "f3"],
    loop_num: 5
  }).stream() as k2
  return k2
} on hdc_kmeans

Result:

community	_ids
0	["J","D","B","A","E","K","I"]
1	["F","H","C","G"]

ID
Product
Status
Cores
Maximum Shard Services
Maximum Total Cores for Shard Service
Maximum HDC Services
Maximum Total Cores for HDC Service
Applied Validity Period(days)
Effective Date
Expired Date
Mac Address
Reason for Application
Review Comment