Table of Content
Agglomerative technique (top-down hierarchy of clusters) or Divisive technique (bottom-up hierarchy of clusters) are other names for hierarchical clustering.
When merging records or clusters, start by treating each data point as a separate cluster and proceed until all records have been combined into a single large cluster.
- Start with 'n' number of clusters where 'n' is the number of data points
- Merge two records, or a record and a cluster, or two clusters at each step based on the distance criteria and linkage functions.
- Start by considering that all data points belong to one single cluster and keep splitting into two groups each time, until we reach a stage where each data point is a single cluster.
- Divisive Clustering is more efficient than Agglomerative Clustering.
- Split the clusters with the largest SSE value.
- Splitting criterion can be Ward's criterion or Gini-index in case of categorical data.
- Stopping criterion can be used to determine the termination criterion
After executing the algorithm and examining the Dendrogram, a selection of clusters is made. A dendrogram is a collection of data points that resembles a multi-level nested partitioned tree of clusters.
Click here to learn Data Science in Hyderabad
Disadvantages of Hierarchical Clustering
Work done previously cannot be undone and cannot work well on large datasets.
Types of Hierarchical Clustering
- BIRCH - Balanced Iterative Reducing and Clustering using Hierarchies
- CURE - Clustering Using REpresentatives
- CHAMELEON - Hierarchical Clustering using Dynamic Modeling. This is a graph partitioning approach used in clustering complex structures.
- Probabilistic Hierarchical Clustering
- Generative Clustering Model
Click here to learn Data Science in Bangalore
Density-Based Clustering: DBSCAN
- Clustering based on a local cluster criterion
- Can discover clusters of random shapes and can handle outliers
- Density parameters should be provided for stopping condition
DBSCAN - Density-Based Spatial Clustering of Applications with Noise
Works on the basis of two parameters:
Eps - Maximum Radius of the neighbourhood
MinPts - Minimum number of points in the Eps-neighbourhood of a point
It works on the principle of density
Click here to learn Data Analytics in Bangalore
Ordering of Points to Identify Cluster Structure
Works on the principle of varying density of clusters
2 Aspects for Optics
“Plot the number of clusters for the image if it was subject to Optics clustering”.
Click here to learn Data Analytics in Hyderabad
Grid-Based Clustering Methods
Create a grid structure by dividing the data space into a fixed number of cells.
From the grid's cells, identify clusters.
uneven data distribution is challenging to handle.
is plagued by dimensionality, making it challenging to cluster high-dimensional data.
Click here to learn Artificial Intelligence in Bangalore
CLIQUE - CLustering in QUEst - This is both density-based as well as grid-based subspace clustering algorithm.
Three broad categories of measurement in clustering:
Click here to learn Artificial Intelligence in Hyderabad
Used to compare the clustering output against subject matter expertise (ground truth)
Four criteria for External Methods are:
Cluster Homogeneity - More the purity, better is the cluster formation.
Cluster Completeness - Ground truth of objects and cluster assigned objects to belong to the same cluster.
Ragbag better than Alien - Assigning heterogeneous object is very different from the remaining points of a cluster to a cluster will be penalized more than assigning it into a rag bag/miscellaneous/other category
Small cluster preservation - Splitting a large cluster into smaller clusters is much better than splitting a small cluster into smaller clusters.
Click here to learn Machine Learning in Hyderabad
Most Common External Measures
- Matching-based measures
- Maximum Matching
- F-measure (Precision & Recall)
- Entropy-based measures
- Entropy of Clustering
- Entropy of Partitioning
- Conditional Entropy
- Mutual Information
- Normalized Mutual Information (NMI)
- Pairwise measures
- True Positive
- False Negative
- False Positive
- True Negative
- Jaccard Coefficient
- Rand Statistic
- Fowlkes - Mallow Measure
- Correlation measures
- Discretized Huber Static
- Normalized Discretized Huber Static
Goodness of clustering and an example of same is Silhouette coefficient
Most common internal measures:
- Beta-CV measure
- Normalized Cut
- Relative measure - Silhouette Coefficient
Click here to learn Machine Learning in Bangalore
Compare the results of clustering obtained by different parameter settings of the same algorithm.
Clustering Assessment Methods
- Spatial Histogram
- Distance Distribution
- Hopkins Statistic
Finding K value in clustering
- Bootstrapping Approach
- Empirical Method
- Elbow Method
- Cross-Validation Method
Data Science Training Institutes in Other Locations
Agra, Ahmedabad, Amritsar, Anand, Anantapur, Bangalore, Bhopal, Bhubaneswar, Chengalpattu, Chennai, Cochin, Dehradun, Malaysia, Dombivli, Durgapur, Ernakulam, Erode, Gandhinagar, Ghaziabad, Gorakhpur, Gwalior, Hebbal, Hyderabad, Jabalpur, Jalandhar, Jammu, Jamshedpur, Jodhpur, Khammam, Kolhapur, Kothrud, Ludhiana, Madurai, Meerut, Mohali, Moradabad, Noida, Pimpri, Pondicherry, Pune, Rajkot, Ranchi, Rohtak, Roorkee, Rourkela, Shimla, Shimoga, Siliguri, Srinagar, Thane, Thiruvananthapuram, Tiruchchirappalli, Trichur, Udaipur, Yelahanka, Andhra Pradesh, Anna Nagar, Bhilai, Borivali, Calicut, Chandigarh, Chromepet, Coimbatore, Dilsukhnagar, ECIL, Faridabad, Greater Warangal, Guduvanchery, Guntur, Gurgaon, Guwahati, Hoodi, Indore, Jaipur, Kalaburagi, Kanpur, Kharadi, Kochi, Kolkata, Kompally, Lucknow, Mangalore, Mumbai, Mysore, Nagpur, Nashik, Navi Mumbai, Patna, Porur, Raipur, Salem, Surat, Thoraipakkam, Trichy, Uppal, Vadodara, Varanasi, Vijayawada, Vizag, Tirunelveli, Aurangabad
Navigate to Address
360DigiTMG - Data Science, Data Scientist Course Training in Bangalore
No 23, 2nd Floor, 9th Main Rd, 22nd Cross Rd, 7th Sector, HSR Layout, Bengaluru, Karnataka 560102