🎓K-Means Clustering

Lovely Shrivas

--

📙What is Unsupervised Learning ?

Unsupervised Learning is a type of machine learning algorithm used to draw inferences from datasets consisting input data without labelled responses. Training data is collection of information without any label.

📘What is Clustering?

“Clustering is a process of dividing the datasets into groups, consisting of similar data points”. It means grouping of objects based on the information found in the data, describing the objects based on the information found in the data, describing the objects or their relationship.

Why is Clustering Used?

The goal of clustering is to determine the intrinsic grouping in a set of unlabeled data. Sometimes, Partitioning is the goal.

Where it is used?

  • Market segmentation
  • Document clustering
  • Image segmentation
  • Image compression
  • Vector quantization
  • Cluster analysis
  • Feature learning or dictionary learning
  • Identifying crime-prone areas
  • Insurance fraud detection
  • Public transport data analysis
  • Clustering of IT assets
  • Customer segmentation
  • Identifying Cancerous data
  • Used in search engines
  • Drug Activity Prediction

Types of Clustering?

  1. Exclusive Clustering (K-Means)
  2. Overlapping Clustering (C-Means)
  3. Hierarchical Clustering

📗K-MEANS Clustering?

The process by which objects are classified into a predefined no. of groups so that they are as much dissimilar as possible from one group to another group, but as much similar as possible within same group.

K-means clustering is a very famous and powerful unsupervised machine learning algorithm. It is used to solve many complex unsupervised machine learning problems.

K-Means clustering algorithm is defined as an unsupervised learning method having an iterative process in which the dataset are grouped into k number of predefined non-overlapping clusters or subgroups, making the inner points of the cluster as similar as possible while trying to keep the clusters at distinct space it allocates the data points to a cluster so that the sum of the squared distance between the clusters centroid and the data point is at a minimum, at this position the centroid of the cluster is the arithmetic mean of the data points that are in the clusters.

K-MEANS Algorithm working

K- Means Clustering Algorithm needs the following inputs:

  • K = number of subgroups or clusters
  • Sample or Training Set = {x1, x2, x3,………xn}

STEPS:-

  1. First we need to decide the no. of clusters to be made(guessing).
  2. Then we provide centroids of all the clusters(guessing).
  3. The algorithm calculates Euclidean distance of the points from each centroid and assigns the point to the closest cluster.
  4. Next the centroids are calculated again, when we have our new cluster.
  5. The distance of the points from the center of clusters are calculated again and points are assigned to the closest cluster.
  6. And then again the new centroid for the cluster is calculated.
  7. These steps are repeated until we have a repetition in centroids or new centroids are very close to the previous once.

How to decide the no. of clusters?

This can be done by two methods:

Elbow Method:- First of all, compute the sum of squared error (SSE) for some values of K . The SSE is define as the sum of the squared distance between each member of the cluster and its centroid, Mathematically.

Elbow is one of the most famous methods by which you can select the right value of k and boost your model performance. We also perform the hyperparameter tuning to chose the best value of k. Let us see how this elbow method works.

It is an empirical method to find out the best value of k. it picks up the range of values and takes the best among them. It calculates the sum of the square of the points and calculates the average distance.

Purpose Method:- In this method, the data is divided based on different metrics, and after then it is judged how well it performed for that case. For example, the arrangement of the shirts in the men’s clothing department in a mall is done on the criteria of the sizes. It can be done on the basis of price and the brands also. The best suitable would be chosen to give the optimal number of clusters, i.e. the value of K.

Pros of K- Means Clustering Algorithm

  • It is fast
  • Robust
  • Comparatively efficient
  • Flexible
  • Easy to interpret
  • Better computational cost
  • Enhances Accuracy

Cons of K- Means Clustering Algorithm

  • Cannot handle outliers and noisy data
  • Do not work for the non-linear data set
  • Lacks consistency
  • Sensitive to scale
  • If very large data sets are encountered, then the computer may crash.
  • Prediction issues

--

--

Lovely Shrivas
Lovely Shrivas

Written by Lovely Shrivas

Hi.....I have keen intrest in technologies and love to learn and explore them. Hope my content is helpful to all of you.if any issue please let me know.

No responses yet