Clustering is a method that involves grouping a set of objects based on their characteristics and their similarities. It is a technique of unsupervised learning and is common for statistical data analysis used in many fields.
Data Clustering Models
There are different ways to implement this partitioning. Each model uses distinct algorithms, differentiating by their organization and type of relationship among them. The most important ones are:
- Centralized: Each cluster is represented by a single vector mean. The object value is compared to these mean values
- Distributed: The cluster is built using statistical distributions
- Connectivity: Connectivity on these models is based on a distance function between elements
- Group: Algorithms have only group information
- Graph: Cluster organization and relationship between members is defined by a graph linked structure
- Density: Members of the cluster are grouped by regions, where observations are dense and similar
Clustering Algorithms in Data Mining
Data mining is a set of techniques and technologies that allow people and large databases to find repetitive patterns, trends or rules, to explore the behavior of the data in a certain context. Clustering can be applied to a data set in order to partition the information. The algorithm will always depend on the features of the data set and what we want to do with it.
In the centroid-based grouping method, every cluster is referenced by a vector of values. Each object is part of the cluster whose value difference is minimal, compared to other clusters. The number of clusters should be pre-defined, and this is the biggest problem of this kind of algorithms. This methodology is the closest to the classification subject and is vastly used for optimization problems.
The distributed methodology combines objects whose values belong to the same distribution. Due to its random nature of value generation, this process needs a well-defined and complex model to interact in a better way with real data.
On this type of algorithm, every object is related to its neighbors. Based on this assumption, clusters are created with nearby objects and have hierarchical representations. The distance function varies on the focus of the analysis.
The density-based algorithms create clusters according to the high density of members of a data set, in a determined location. It adds some distance notion to a density standard level to group members in clusters. These kinds of processes may have less performance in detecting the limit areas of the group.
Cluster Analysis Main Applications
One of the most important applications is related to image processing and detecting distinct kinds of pattern in image data, which is extremely effective in biology research, distinguishing objects, and identifying patterns.
The personal data combined with location, interests, actions, and an infinite number of indicators, can be examined and analyzed with this methodology, providing very important information that can be used in market research, web analytics, and marketing strategies among others.