K-Means Clustering and its Real usecase in the Security Domain

Divyansh garg
3 min readJul 19, 2021

📌 What is Clustering ?

clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense) to each other than to those in other groups (clusters). The goal of this unsupervised machine learning technique is to find similarities in the data point and group similar data points together. In simple words, the aim is to segregate groups with similar traits and assign them into clusters.

📌 K-means clustering

K-means clustering algorithm computes the centroids and iterates until we it finds optimal centroid. It assumes that the number of clusters are already known. It is also called flat clustering algorithm. The number of clusters identified from data by algorithm is represented by ‘K’ in K-means.

📌 Working of K-Means Algorithm

We can understand the working of K-Means clustering algorithm with the help of following steps −

Step 1 − First, we need to specify the number of clusters, K, need to be generated by this algorithm.

Step 2 − Next, randomly select K data points and assign each data point to a cluster. In simple words, classify the data based on the number of data points.

Step 3 − Now it will compute the cluster centroids.

Step 4 − Next, keep iterating the following until we find optimal centroid which is the assignment of data points to the clusters that are not changing any more

  • 4.1 − First, the sum of squared distance between data points and centroids would be computed.
  • 4.2 − Now, we have to assign each data point to the cluster that is closer than other cluster (centroid).
  • 4.3 − At last compute the centroids for the clusters by taking the average of all data points of that cluster.

📌 USES OF K-MEANS CLUSTERING IN SECURITY DOMAINS

  1. Crime document classification

Cluster documents in multiple categories based on tags, topics, and the content of the document. This is a very standard classification problem and k-means is a highly suitable algorithm for this purpose. The initial processing of the documents is needed to represent each document as a vector and uses term frequency to identify commonly used terms that help classify the document. the document vectors are then clustered to help identify similarity in document groups.

2. Call record detail analysis

A call detail record is the information captured by telecom companies during the call, SMS, and internet activity of a customer. This information provides greater insights about the customer’s needs when used with customer demographics. We can cluster customer activities for 24 hours by using the unsupervised k-means clustering algorithm. It is used to understand segments of customers with respect to their usage by hours.

3. Insurance Fraud Detection

Machine learning has a critical role to play in fraud detection and has numerous applications in automobile, healthcare, and insurance fraud detection. utilizing past historical data on fraudulent claims, it is possible to isolate new claims based on its proximity to clusters that indicate fraudulent patterns. since insurance fraud can potentially have a multi-million dollar impact on a company, the ability to detect frauds is crucial.

4. Identifying crime localities

With data related to crimes available in specific localities in a city, the category of crime, the area of the crime, and the association between the two can give quality insight into crime-prone areas within a city or a locality.

5. Cyber-Profiling Criminals

Digital profiling is the way toward gathering information from people and gatherings to recognize huge co-relations. The possibility of digital profiling is gotten from criminal profiles, which give data on the examination division to arrange the sorts of lawbreakers who were at the crime location.

Thanks for Reading!!!😄

--

--