K-means Clustering and Cyber Security

4 min readSep 26, 2021

Clustering

Clustering is the process of dividing the data space or data points into a number of groups, such that data points in the same groups are more similar to other data points in the same group, and dissimilar to the data points in other groups. A cluster refers to a collection of data points aggregated together because of certain similarities.

K-Means Clustering

K means is one of the most popular Unsupervised Machine Learning Algorithms Used for Solving Classification Problems. K Means segregates the unlabeled data into various groups, called clusters, based on having similar features, common patterns.

In other words, the K-means algorithm identifies k number of centroids, and then allocates every data point to the nearest cluster, while keeping the centroids as small as possible.

The ‘means’ in the K-means refers to averaging of the data; that is, finding the centroid.

Algorithm steps Of K Means Algorithm

The working of the K-Means algorithm is explained in the below steps:

Step 1: Select the value of K, to decide the number of clusters to be formed.

Step 2: Select random K points which will act as centroids.

Step 3: Assign each data point, based on their distance from the randomly selected points (Centroid), to the nearest/closest centroid which will form the predefined clusters.

Step 4: place a new centroid of each cluster.

Step 5: Repeat step no.3, which reassign each data point to the new closest centroid of each cluster.

Step 6: If any reassignment occurs, then go to step-4 else go to Step 7.

Step 7: FINISH

Applications of K-Means Clustering:

K-means can be applied to data that has a smaller number of dimensions, is numeric, and is continuous. such as document clustering, identifying crime-prone areas, customer segmentation, insurance fraud detection, public transport data analysis, clustering of IT alerts…etc.

K- Means Clustering and Cyber Security

In Cyber Security domain, K- Means clustering is used in various systems. For example, Malware Detection System, Analyzing Logs from Proxy Server and Captive Portal, Spam Filtering, Cyber Profiling, etc.

MALWARE DETECTION SYSTEM

Malware detection refers to the process of detecting the presence of malware on a host system or of distinguishing whether a specific program is malicious or benign. Malware detection technique plays vital role in detecting malware attack that can give high impact towards the cyber world. By using clustering, unsupervised machine learning is able to detect malware attack by identifying the behavior of the malware.

Clustering detection model by using K-Means clustering approach to detect malware behavior of data based on the features of the malware. Clustering techniques that use unsupervised algorithm in machine learning plays an important role in grouping similar malware characteristics by studying the behavior of the malware which results in, model is capable to cluster normal and suspicious data into two separate groups with high detection rate which is more than 90 percent accuracy.

Cyber Profiling

The idea of cyber profiling is derived from criminal profiles, which provide information on the investigation division to classify the types of criminals who were at the crime scene. Profiling is more specifically based on what is known and not known about the criminal .

Profiling is information about an individual or group of individuals that are accumulated, stored, and used for various purposes, such as by monitoring their behavior through their internet activity .

Spam Filtering

Electronic mail (email) has become an essential element for Internet users. The unwanted emails are known as spam email. These emails are sent in bulk to large number of recipients. This increased volume of spam email results a most common problem i.e. maintaining email inbox. Spam Email is major issue for internet community because it causes wastage of resources and also pollutes our environment. To prevent these adverse effects of spam email, spam filtering is essential task.

K-means Clustering is an effective way of identifying spam. The way that it works is by looking at the different sections of the email (header, sender, and content). The data is then grouped together. These groups can then be classified to identify which are spam. Including clustering in the classification process improves the accuracy of the filter to 97%.

Thanks For Reading