Directory Image
This website uses cookies to improve user experience. By using our website you consent to all cookies in accordance with our Privacy Policy.

K - Means Clustering - Beginners Guide

Author: Billy Peterson
by Billy Peterson
Posted: Jul 31, 2021

The ‘K Means clustering Machine Learning algorithm' will be discussed in this blog post. K Means clustering, unlike the KNN Algorithm, is an Unsupervised Learning algorithm. The target output is not involved in unsupervised learning, which means the system receives no training. Furthermore, the system must learn on its own by identifying and responding to structural properties in the input patterns. The unsupervised learning method uses unlabeled data to produce output that is only dependent on observations. Unsupervised learning produces a moderately accurate yet reliable result. The various forms of unsupervised learning algorithms are shown below.

What Is Clustering?

Clustering is a popular exploratory data analysis tool for gaining an understanding of the data's structure. It is the task of identifying subgroups in data so that data points within the same subgroup (cluster) are extremely similar while data points within different clusters are very dissimilar. To put it another way, we strive to discover homogeneous subgroups within the data so that data points in each cluster are as comparable as feasible based on a similarity measure like euclidean-based distance or correlation-based distance. The choice of the similarity measure to utilise depends on the application.

Clustering analysis can be performed on the basis of features (finding subgroups of samples based on features) or on the basis of samples (finding subgroups of features based on samples). Clustering based on features will be discussed here. Market segmentation, where we try to discover customers with similar behaviours or traits, picture segmentation/compression, where we try to group comparable regions together, document clustering based on subjects, and so on, all involve clustering.

Clustering, unlike supervised learning, is an unsupervised learning method because there is no ground truth to compare the clustering algorithm's output to the true labels to evaluate its success. We just want to look into the data's structure by dividing the data points into distinct subgroups.

In this blog, we'll solely look at Kmeans, which is one of the most popular clustering algorithms due to its ease of use.

K-Means Algorithm:

The K Means algorithm is an iterative technique that attempts to split a dataset into K separate non-overlapping subgroups (clusters), each of which contains only one data point. It attempts to make intra-cluster data points as comparable as possible while maintaining clusters as distinct (far) as possible. It distributes data points to clusters in such a way that the sum of the squared distances between them and the cluster's centroid (arithmetic mean of all the data points in that cluster) is as small as possible. Within clusters, the less variance there is, the more homogenous (similar) the data points are.

Learn how to build K Means clustering in Python.

The following is how the k means algorithm works:

  • K is the number of clusters to specify.

  • Initialize the centroids by shuffling the dataset and then picking K data points at random for the centroids without replacing them.

  • Continue iterating until the centroids do not change. i.e. the clustering of data points does not change.

  • Calculate the total of all data points' squared distances from all centroids.

  • Assign each data point to the cluster that is closest to it (centroid).

  • Calculate the cluster centroids by averaging all of the data points that correspond to each cluster.

Applications:

The k means the technique is widely utilized in a wide range of applications, including market segmentation, document clustering, image segmentation, and compression, among others. When we do cluster analysis, we normally want to achieve one of two things:

  • Get a good sense of the structure of the data we're working with.

  • If we assume there is a wide variance in the behaviors of distinct subgroups, we will cluster-then-predict, where different models will be developed for different subgroups. Clustering patients into distinct subgroups and developing a model for each subgroup to predict the likelihood of having a heart attack is an example of this.

Conclusion:

K Means clustering is one of the most popular clustering methods, and it's frequently the first thing people do when they're working on a clustering problem to gain a sense of the dataset's structure. k-means' purpose is to divide data into discrete, non-overlapping groupings. When the clusters have a spherical shape, it performs admirably.

About the Author

Teaching at Favtutor - an online tutoring platform. Java, Python, C++, R, Php, Data Science, Machine Learning.

Rate this Article
Leave a Comment
Author Thumbnail
I Agree:
Comment 
Pictures
Author: Billy Peterson

Billy Peterson

Member since: Jul 28, 2021
Published articles: 14

Related Articles