№2, 2021

COMPARATIVE ANALYSIS OF K-MEANS, K-MEANS++ AND MINI BATCH K-MEANS ALGORITHMS IN PYTHON ENVIRONMENT

Elton Y. Ahmadov

This article discusses the application of k-means algorithm and its modifications to datasets of different dimensions in the Python environment. At the same time, the current state, opportunities, shortcomings, problems of the traditional k-means clustering algorithm and its modifications are studied and suggestions for their solution are given. The k-means ++ algorithm eliminates the disadvantage of the traditional k-means method’s random selection of starting centers. Using the mini-batch k-means algorithm, big data is analyzed by dividing it into packets, which accelerates the process of analyzing large and complex data. A hybrid PCA and elbow method are proposed to reduce the dimensionality during data clustering and to find the optimal number of clusters. To evaluate the effectiveness of this approach, algorithms are tested on several sets of data of different sizes. The silhouette and Davis-Boldin indices are used to evaluate the efficiency of the algorithms. The results of the experiment show that the proposed approach is more efficient when clustering big data. The proposed hybrid PCA and elbow method create new opportunities for solving problems that require large computing resources in the process of analysing large, multidimensional data (pp.119-128).

Keywords: data mining, clustering, k-means, k-means++, mini batch k-means, elbow, PCA.
DOI : 10.25045/jpit.v12.i2.11
References
  • Han J., Kamber M., Pei J. Data Mining: Concepts and Techniques, 3rd edition, Morgan Kaufmann, 2011, 744 p.
  • Sanse K., Sharma M. Clustering methods for Big data analysis // International Journal of Advanced Research in Computer Engineering & Technology, 2015, vol.4, no.3, pp.642-648.
  • Chen C.L.P., Zhang C-Y. Data-ıntensıve applıcatıons, challenges, technıques and technologıes: a survey on bıg data // Information Sciences, 2014, vol.275, pp.314-347.
  • Alguliyev R.M, Aliguliyev R.M., Sukhostat L.V. Parallel batch k-means for Big data clustering // Computers & Industrial Engineering, 2021, vol.152.
  • Aliguliyev R.M., Hajırahimova M.Sh., Aliyeva A.Sh. Biġ Data-nın aktual elmi-nazari problemlari // Informasiya Jamiyyati Problemlari, 2016, no.2, pp.37-49.
  • Alguliyev R., Aliguliyev R., Bagirov A., Karimov R. Batch clustering algorithm for big data sets / 2016 IEEE 10th International Conference on Application of Information and Communication Technologies, 2016, pp.79-82.
  • Alguliyev R.M., Aliguliyev R.M., Sukhostat L.V. Weighted consensus clustering and its application to big data // Expert Systems with Applications, 2020, vol.150.
  • Alguliyev R.M., Aliguliyev R.M., Sukhostat L.V. Efficient algorithm for big data clustering on single machine // CAAI Transactions on Intelligence Technology, 2020, vol.5, no.1, pp.9-14.
  • Alıguliyev R., Tahirzada Sh. “Boyuk hajmli fardi malumatların analizi uchun iterativ chakili k-means algoritmi” / “Informasiya tahlukasizliyinin aktual multidissiplinar elmi-praktiki problemlari” V respublika konfransı, 29 noyabr 2019-ju il.