№1, 2024

IMPROVED PARALLEL BIG DATA CLUSTERING BASED ON K-MEDOIDS AND K-MEANS ALGORITHMS

Rasim Alguliyev, Ramiz Aliguliyev, Lyudmila Sukhostat

In recent years, the amount of data created worldwide has grown exponentially. The increase in computational complexity when working with "Big data" leads to the need to develop new approaches for their clustering. The problem of massive data amounts clustering can be solved using parallel processing. Dividing the data into batches helps to perform clustering in a reasonable time. In this case, the reliability of the obtained result for each block will affect the performance of the entire dataset. The main idea of the proposed approach is to apply the k-medoids and k-means algorithms to parallel Big data clustering. The advantage of this hybrid approach is that it is based on the central object in the cluster and is less sensitive to outliers than k-means clustering. Experiments are conducted on real datasets, namely YearPredictionMSD and Phone Accelerometer. The proposed approach is compared with the k-means and MiniBatch k-means algorithms. Experimental results proved that the proposed parallel implementation of k-medoids with the k-means algorithm shows greater accuracy and works faster than the k-means algorithm (pp.18-25).

Keywords: k-means, k-medoids, Big data, Parallel clustering, Batch clustering
References
  • Aggarwal, C. C. & Reddy, C. K. (2014). Data clustering: algorithms and applications. New York: CRC Press.
    https://doi.org/10.1201/9781315373515
  • Ahmadov, E. Y. (2023). Comparative analysis of k-means and fuzzy c-means algorithms on demographic data using the PCA method. Problems of Information Technology, 14(1), 15-22. 
    https://doi.org/10.25045/jpit.v14.i1.03
  • Alguliyev, R., Aliguliyev, R., Karimov, R., & Bagirov, A. (2016). Batch clustering algorithm for big data sets. In 10th IEEE International Conference on Application of Information and Communication Technologies (AICT), Baku, Azerbaijan, October 2016 (pp. 79–82).
    https://doi.org/10.1109/ICAICT.2016.799165 
  • Bagirov, A. M., Taheri, S., & Ordin, B. (2022). An adaptive k-medians clustering algoritm. Problems of Information Technology, 13(2), 3-15. 
    https://doi.org/10.25045/jpit.v13.i2.01
  • Bahmani, B., Moseley, B., Vattani, A., Kumar, R., & Vassilvitskii, S. (2012). Scalable k-means++. Proceedings of the VLDB Endowment, 5(7), 622-633.
    https://doi.org/10.14778/2180912.2180915
  • Bertin-Mahieux, T., Ellis, D. P. W., Whitman, B., & Lamere, P. (2011). The Million Song Dataset. 2011 International Society for Music Information Retrieval Conference (ISMIR) (pp. 591-596).
  • Boutsidis, C., Drineas, P., & Mahoney, M. W. (2009). Unsupervised feature selection for the k-means clustering problem. 22nd International Conference on Neural Information Processing Systems (NIPS), Vancouver, Canada, December 2009 (pp. 153-161).
  • Han, J., Kamber, M., & Tung, A. K. H. (2001). Spatial clustering methods in data mining: A survey. In H.J. Miller, J. Han (Eds), Geographic data mining and knowledge discovery (pp. 1-29).
  • Hu, H., Liu, J., Zhang, X., & Fang, M. (2023). An Effective and Adaptable K-means Algorithm for Big Data Cluster Analysis. Pattern Recognition, 139, 109404. https://doi.org/10.1016/j.patcog.2023.109404
  • Ikotun, A. M., Ezugwu, A. E., Abualigah, L., Abuhaija, B., & Heming, J. (2023). K-means clustering algorithms: A comprehensive review, variants analysis, and advances in the era of big data. Information Sciences, 622, 178-210. https://doi.org/10.1016/j.ins.2022.11.139
  • Jain, A. K. (2010). Data clustering: 50 years beyond k-means. Pattern Recognition Letters, 31(8), 651–666.
    https://doi.org/10.1016/j.patrec.2009.09.011
  • Karmitsa, N., Bagirov, A.M., & Taheri, S. (2018). Clustering in large data sets with the limited memory bundle method, Pattern Recognition, 83, 245–249.
    https://doi.org/10.1016/j.patcog.2018.05.028
  • Kaufman, L. & Rousseeuw, P. J. (1990). Finding groups in data: An introduction to cluster analysis. New York: Wiley.
  • Lenssen, L. & Schubert, E. (2024). Medoid silhouette clustering with automatic cluster number selection. Information Systems, 120, 102290.
    https://doi.org/10.1016/j.is.2023.102290
  • Li, M., Zhang, Y., Liu, S., Liu, Z., & Zhu, X. (2023). Simple multiple kernel k-means with kernel weight regularization. Information Fusion, 100, 101902.
    https://doi.org/10.1016/j.inffus.2023.101902
  • Lichman, M. (2018). University of California. UCI Machine Learning Repository.
    http://archive.ics.uci.edu/ml Accessed 7 July 2018.
  • Mussabayev, R., Mladenovic, N., Jarboui, B., & Mussabayev, R. (2023). How to Use K-means for Big Data Clustering? Pattern Recognition, 137, 109269.
    https://doi.org/10.1016/j.patcog.2022.109269
  • Meng, Y., Liang, J., Cao, F., & He, Y. (2018). A new distance with derivative information for functional k-means clustering algorithm, Information Sciences, 463-464, 166-185.
    https://doi.org/10.1016/j.ins.2018.06.035
  • Parker, J. K. & Hall, L. O. (2014). Accelerating fuzzy-c means using an estimated subsample size. IEEE Transactions on Fuzzy Systems, 22(5), 1229-1244.
    https://doi.org/10.1109/TFUZZ.2013.2286993 
  • Ping, Y., Li, H., Hao, B., Guo, C., & Wang, B. (2024). Beyond k-Means++: Towards better cluster exploration with geometrical information. Pattern Recognition, 146, 110036.
    https://doi.org/10.1016/j.patcog.2023.110036 
  • Sculley, D. (2010). Web-scale k-means clustering. 2010 ACM International conference on World Wide Web (WWW), North Carolina, USA, April 2010 (pp. 1177-1178).
    https://doi.org/10.1145/1772690.1772862
  • Schubert, E. & Rousseeuw, P. J. (2021). Fast and eager k-medoids clustering: O(k) runtime improvement of the PAM, CLARA, and CLARANS algorithms, Information Systems, 101, 101804.
    https://doi.org/10.1016/j.is.2021.101804
  • Stisen, A., Blunck, H., Bhattacharya, S., Prentow, T. S., Kjærgaard, M. B., Dey, A., Sonne, T., & Jensen, M. M. (2015). Smart devices are different: assessing and mitigating mobile sensing heterogeneities for activity recognition. In ACM Conference on Embedded Networked Sensor Systems (SenSys), Seoul, South Korea, November 2015 (pp. 127-140).
    https://doi.org/10.1145/2809695.2809718
  • Thompson, S. K. (1987). Sample size for estimating multinomial proportions, The American Statistician. 41, 42-46.
    https://doi.org/10.2307/2684318
  • Wang, Y., Krishna Saraswat, S., & Elyasi Komari, I. (2022). Big data analysis using a parallel ensemble clustering architecture and an unsupervised feature selection approach. Journal of King Saud University - Computer and Information Sciences, 35(1), 270-282.
    https://doi.org/10.1016/j.jksuci.2022.11.016
  • Zein, A. A., Dowaji, S., & Al-Khayatt, M. I. (2023). Clustering-based method for big spatial data partitioning. Measurement: Sensors, 27, 100731.
    https://doi.org/10.1016/j.measen.2023.100731
  • Zhao, W. L., Deng, C. H., & Ngo, C. W. (2018). k-means: A revisit. Neurocomputing, 291, 195-206.
    https://doi.org/10.1016/j.neucom.2018.02.072
  • Zhang, H., Li, J., Zhang, J., & Dong, Y. (2024). Speeding up k-means clustering in high dimensions by pruning unnecessary distance computations. Knowledge-Based Systems, 284, 111262.
    https://doi.org/10.1016/j.knosys.2023.111262 
  • Zhang, J., Wolfram, D., & Ma, F. (2023). The impact of big data on research methods in information science. Data and Information Management, 7(2), 100038.
    https://doi.org/10.1016/j.dim.2023.100038 
  • Zhu, X., Sun, J., He, Z., Jiang, J., & Wang, Z. (2023). Staleness-Reduction Mini-Batch K-Means. IEEE Transactions on Neural Networks and Learning Systems, 1-13.
    https://doi.org/10.1109/TNNLS.2023.3279122