K-Means Clustering in Security Domain

Samriddhi Mishra
4 min readAug 12, 2021

K-Means Clustering is an Unsupervised Learning algorithm, which groups the unlabeled dataset into different clusters. It allows us to cluster the data into different groups and is a convenient way to discover the categories of groups in the unlabeled dataset on its own without the need for any training.

It is a centroid-based algorithm, where each cluster is associated with a centroid. The main aim of this algorithm is to minimize the sum of distances between the data point and their corresponding clusters.

The k-means clustering algorithm mainly performs two tasks:

  • Determines the best value for K center points or centroids by an iterative process.
  • Assigns each data point to its closest k-center. Those data points which are near to the particular k-center, create a cluster.
Elbow Method
Silhouette Method

One of the most challenging tasks in this clustering algorithm is to choose the right values of k. What should be the right k-value? How to choose the k-value? Let us find the answer to these questions. If you are choosing the k values randomly, it might be correct or may be wrong. If you will choose the wrong value then it will directly affect your model performance. So there are two methods by which you can select the right value of k.

1) Elbow Method.
2) Silhouette Method.

Use-Cases in Security Domain

Cyber Profiling:

The idea of cyber profiling is derived from criminal profiles, which provide information on the investigation division to classify the types of criminals who were at the crime scene. Profiling is more specifically based on what is known and not known about the criminal. Profiling is information about an individual or group of individuals that are accumulated, stored, and used for various purposes, such as by monitoring their behavior through their internet activity . Difficulties in implementing cyber profiling is on the diversity of user data and behavior when online is sometimes different from actual behavior. Given the privilege in personal behavior, inductive generalizations can be very reliable but can also lead to a misunderstanding of behavior analysis. Therefore the cyber-profiling process is via a combination of deductive and inductive methods. For investigation, the cyber-profiling process gives a good, contributing to the field of forensic computer science. Cyber Profiling is one of the efforts made by the investigator, to know the alleged offenders through the analysis of data patterns that include aspects of technology, investigation, psychology, and sociology.

The flow of application of the K-Means algorithm in the profiling process

Establishment of Intrusion Detection Model:

Four general intrusion detection model is set up, the first to use collection system, guarantee the connection records in the process of use, and can get clustering analysis of data sets, and then with the help of clustering algorithm distribution connection records, distinguish normal and abnormal connection records. In this study, the k-means algorithm was used to complete cluster analysis. The clustering algorithm results in more clustering, so there are some connection records in each cluster. According to the properties of a given connection record, the properties can be used to determine the two kinds of abnormal clustering and normal clustering. The exception clustering represents the clustering of the abnormal connection records, and the normal clustering represents the clustering of the normal connection records.

The vast amount of data generated in the Internet era undoubtedly challenges the technology of large-scale data processing and data mining. In this paper, we study the problem of network security by using a k-means clustering algorithm in data mining. Analyses the network security problems and performance better intrusion detection system in network security analysis simulation, let more people know the network intrusion behavior produces a variety of ways and means. In this way, we can ensure the security of the network information in the network information leak serious today.

--

--