№2, 2022
COMPARATIVE ANALYSIS OF CLUSTER VALIDITY INDICES IN TERMS OF CONSISTENCY
Cluster analysis is one of the key issues in Data Mining, the most important stage of the Knowledge Discovery from Data (KDD) process, and is widely used. There are 3 main tasks of cluster analysis: determining the optimal number of clusters, clustering algorithms and evaluating the quality of clustering. One of the most important steps in cluster analysis is to evaluate the quality of clustering. A number of indices have been proposed to assess the outcome of clustering. The analysis shows that these indices, which are used to assess the quality of clustering, often show inconsistent results. Therefore, extensive research has recently been conducted on the study of indices and new indices are proposed. The article examines a number of internal and external evaluation indices. Different size data sets are taken and k-means, k-medoids, agglomerative hierarchical, BIRCH and OPTICS algorithms are applied to them. A number of internal and external evaluation indices are used to assess the results of the experiment, and the results are analyzed comparatively. Experiments show that Ac, Pr, Rc and F-m indices show similar results in in group determining in a given clustering structure (pp.24-39).
- Aliguliyev, R. M. (2009). “Performance evaluation of density-based clustering methods”. Information Sciences, 179(20): 3583-3602. https://doi.org/10.1016/j.ins.2009.06.012
- Arbelaitz, O., Gurrutxaga, I., Muguerza, J., Pérez, J. M., & Perona, I. (2013). An extensive comparative study of cluster validity indices. Pattern recognition, 46(1), 243-256.
https://doi.org/10.1016/j.patcog.2012.07.021 - Andrieu, C., De Freitas, N., Doucet, A., & Jordan, M. I. (2003). An introduction to MCMC for machine learning. Machine learning, 50(1), 5-43.
https://doi.org/10.1023/A:1020281327116ss - Ankerst, M., Breunig, M., Kriegel, H., Sander, J. (1999 ) OPTICS:ordering points to identify the clustering structure. ACM SIGMOD Record , pp 49–60.
https://doi.org/10.1145/304181.304187 - Alguliyev, R. M., Aliguliyev, R. M., & Abdullayeva, F. J. (2019). Privacy-preserving deep learning algorithm for big personal data analysis. Journal of Industrial Information Integration, 15, 1-14. https://doi.org/10.1016/j.jii.2019.07.002
- Berkhin, P. (2006). A survey of clustering data mining techniques. In Grouping multidimensional data (pp. 25-71). Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-28349-8_2
- Chana I. A., & Arora S. (2014). Survey of clustering techniques for big data analysis. 5th International Conference - Confluence the Next Generation Information Technology Summit, 59-65.
https://ieeexplore.ieee.org/abstract/document/6949256/ - Chandra, E., & Anuradha, V. P. (2011). A survey on clustering algorithms for data in spatial database management systems. International Journal of Computer Applications, 24(9), 19-26. 10.5120/2969-3975
- Aggarwal, C. C., & Reddy, C. K. (2013). Data Clustering: Algorithms and Applications, ser.
https://dokumen.tips/data-analytics/data-clustering-algorithms-and-applications.html?page=1 - Mammadova, L. R. (2021). Finding the optimal number of clusters in hierarchical clustering. https://2021.nscf.ru/TesisAll/05_AI_MachineLearning/260_MammadovaLeRa.pdf
- Ahmadov, E. (2021). Comparative Analysis of K-Means, K-Means++ and Mini Batch K-Means Algorithms in Python Environment. Problems of information technology, 5(2), 3-16 (in Azeri).
https://www.sciencegate.app/document/10.25045/jpit.v12.i2.11 - Rodriguez, M. Z., Comin, C. H., Casanova, D., Bruno, O.M., Amancio, D. R., Costa, L. D. F., & Rodrigues, F. A. (2019). Clustering algorithms: A comparative approach. 14(1). https://www.researchgate.net/publication/311925975_Clustering_Algorithms_A_Comparative_Approach
- Mammadova, L. (2021). Some external evaluation indices for clustering. The 2nd international scientific conferences of students and young researchers dedicated to the 98th anniversary of the National Leader of Azerbaijan Heydar Aliyev, pp. 445-446 (in Azeri).
http://www.bhos.edu.az/nodupload/editor/files/Tezisler_2021_17x24sm_Final%20(1).pdf - Han, J., Pei, J., & Kamber, M. (2011). Data mining: concepts and techniques. Elsevier., 457-460.
https://www.google.com/books?hl=ru&lr=&id=pQws07tdpjoC&oi=fnd&pg=PP1&dq=Data+mining:+concepts+and+techniques.&ots=tAIx1-mz_Y&sig=jZzODUwidvm-Xau17UWmMTMKJog - Alguliev, R. M., & Aliguliyev, R. M. (2005). Fast genetic algorithm for clustering of text documents, Artificial Intelligence 3, 698–707 (in Russian).
https://doi.org/10.1155/2011/416308 - Aliguliyev, R. M. (2007). Automatic document summa-rization by sentence extraction, Journal of Computational Technologies 12, 5–15.
https://cyberleninka.ru/article/n/automatic-document-summarization-by-sentence-extraction - Aliguliyev, R. M. (2006). A novel partitioning-based clustering method and generic document summariza-tion. In 2006 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology Workshops (pp. 626-629). IEEE.
https://ieeexplore.ieee.org/abstract/document/4053329 - Sajana, T., Rani, C. S., & Narayana, K. V. (2016). A survey on clustering techniques for big data mining. Indian journal of Science and Technology, 9(3), 1-12. 10.17485/ijst/2016/v9i3/75971
- Zhang, T., Ramakrishnan, R. & Livny, M. (1997).BIRCH: A New Data Clustering Algorithm and Its Applications. Data Mining and Knowledge Discovery 1, 141–182 (1997). https://doi.org/10.1023/A:1009783824328
- Parimala, M., Lopez, D., & Senthilkumar, N. C. (2011). A survey on density based clustering algorithms for mining large spatial databases. International Journal of advanced science and technology, 31(1), 59-66.
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.643.6121&rep=rep1&type=pdf - Zhang, Q., & Couloigner, I. (2005). A new and efficient k-medoid algorithm for spatial clustering. In International conference on computational science and its applications (pp. 181-189). Springer, Berlin, Heidelberg. https://link.springer.com/chapter/10.1007/11424857_20
- UCI Machine Learning Repository https://archive.ics.uci.edu/ml/datasets.php
- Rafsanjani, M. K., Varzaneh, Z. A., & Chukanlo, N. E. (2012). A survey of hierarchical clustering algorithms. The Journal of Mathematics and Computer Science, pp. 229-240.
https://www.researchgate.net/publication/281377211_A_Survey_Of_Hierarchical_Clustering_Algorithms - Hamerly, G. J. (2003). Learning structure and concepts in data through data clustering. University of California, San Diego.
https://www.proquest.com/openview/60c11b639be3a8be3f419b66322251a5/1?pq-origsite=gscholar& cbl=18750&diss=y - Rand, W. M. (1971). Objective criteria for the evaluation of clustering methods. Journal of the American Statistical association, 66(336), 846-850.
https://www.tandfonline.com/doi/abs/10.1080/01621459.1971.10482356 - Kvalseth, T. O. (1987). Entropy and correlation: Some comments. IEEE Transactions on Systems, Man, and Cybernetics, 17(3), 517-519. https://ieeexplore.ieee.org /abstract/document/4309069/
- Rosenberg, A., & Hirschberg, J. (2007, June). V-measure: A conditional entropy-based external cluster evaluation measure. In Proceedings of the 2007 joint conference on empirical methods in natural language processing and computational natural language learning (EMNLP-CoNLL) (pp. 410-420).
https://aclanthology.org/D07-1043.pdf - Murtagh, F. (1983). A survey of recent advances in hierarchical clustering algorithms. The computer journal, 26(4), 354-359.
https://academic.oup.com /comjnl/article-abstract/26/4/354/377434 - Rousseeuw, P. J. (1987). Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. Journal of computational and applied mathematics, 20, 53-65.
https://www.sciencedirect.com/science/article/pii/0377042787901257 - Caliński, T., & Harabasz, J. (1974). A dendrite method for cluster analysis. Communications in Statistics-theory and Methods, 3(1), 1-27.
https://www.tandfonline.com/doi/abs/10.1080/03610927408827101 - Davies, D. L., & Bouldin, D. W. (1979). A cluster separation measure. IEEE transactions on pattern analysis and machine intelligence, (2), 224-227. https://ieeexplore.ieee.org/abstract/document/4766909/