IMPROVED PARALLEL BIG DATA CLUSTERING BASED ON K-MEDOIDS AND  K-MEANS ALGORITHMS

Rasim Alguliyev, Ramiz Aliguliyev, Lyudmila Sukhostat

doi:http://doi.org/10.25045/jpit.v15.i1.03

About the Journal Editorial Board For authors Publication ethics Archive Abstracting & Indexing Contact us

№1, 2024

IMPROVED PARALLEL BIG DATA CLUSTERING BASED ON K-MEDOIDS AND K-MEANS ALGORITHMS

Rasim Alguliyev, Ramiz Aliguliyev, Lyudmila Sukhostat

r.alguliev@gmail.com; r.aliguliyev@gmail.com; lsuhostat@hotmail.com

In recent years, the amount of data created worldwide has grown exponentially. The increase in computational complexity when working with "Big data" leads to the need to develop new approaches for their clustering. The problem of massive data amounts clustering can be solved using parallel processing. Dividing the data into batches helps to perform clustering in a reasonable time. In this case, the reliability of the obtained result for each block will affect the performance of the entire dataset. The main idea of the proposed approach is to apply the k-medoids and k-means algorithms to parallel Big data clustering. The advantage of this hybrid approach is that it is based on the central object in the cluster and is less sensitive to outliers than k-means clustering. Experiments are conducted on real datasets, namely YearPredictionMSD and Phone Accelerometer. The proposed approach is compared with the k-means and MiniBatch k-means algorithms. Experimental results proved that the proposed parallel implementation of k-medoids with the k-means algorithm shows greater accuracy and works faster than the k-means algorithm (pp.18-25).

Keywords: k-means, k-medoids, Big data, Parallel clustering, Batch clustering

DOI:

http://doi.org/10.25045/jpit.v15.i1.03

View article(1615)

References

Aggarwal, C. C. & Reddy, C. K. (2014). Data clustering: algorithms and applications. New York: CRC Press.
https://doi.org/10.1201/9781315373515
Ahmadov, E. Y. (2023). Comparative analysis of k-means and fuzzy c-means algorithms on demographic data using the PCA method. Problems of Information Technology, 14(1), 15-22.
https://doi.org/10.25045/jpit.v14.i1.03
Alguliyev, R., Aliguliyev, R., Karimov, R., & Bagirov, A. (2016). Batch clustering algorithm for big data sets. In 10th IEEE International Conference on Application of Information and Communication Technologies (AICT), Baku, Azerbaijan, October 2016 (pp. 79–82).
https://doi.org/10.1109/ICAICT.2016.799165
Bagirov, A. M., Taheri, S., & Ordin, B. (2022). An adaptive k-medians clustering algoritm. Problems of Information Technology, 13(2), 3-15.
https://doi.org/10.25045/jpit.v13.i2.01
Bahmani, B., Moseley, B., Vattani, A., Kumar, R., & Vassilvitskii, S. (2012). Scalable k-means++. Proceedings of the VLDB Endowment, 5(7), 622-633.
https://doi.org/10.14778/2180912.2180915
Bertin-Mahieux, T., Ellis, D. P. W., Whitman, B., & Lamere, P. (2011). The Million Song Dataset. 2011 International Society for Music Information Retrieval Conference (ISMIR) (pp. 591-596).
Boutsidis, C., Drineas, P., & Mahoney, M. W. (2009). Unsupervised feature selection for the k-means clustering problem. 22^nd International Conference on Neural Information Processing Systems (NIPS), Vancouver, Canada, December 2009 (pp. 153-161).
Han, J., Kamber, M., & Tung, A. K. H. (2001). Spatial clustering methods in data mining: A survey. In H.J. Miller, J. Han (Eds), Geographic data mining and knowledge discovery (pp. 1-29).
Hu, H., Liu, J., Zhang, X., & Fang, M. (2023). An Effective and Adaptable K-means Algorithm for Big Data Cluster Analysis. Pattern Recognition, 139, 109404. https://doi.org/10.1016/j.patcog.2023.109404
Ikotun, A. M., Ezugwu, A. E., Abualigah, L., Abuhaija, B., & Heming, J. (2023). K-means clustering algorithms: A comprehensive review, variants analysis, and advances in the era of big data. Information Sciences, 622, 178-210. https://doi.org/10.1016/j.ins.2022.11.139
Jain, A. K. (2010). Data clustering: 50 years beyond k-means. Pattern Recognition Letters, 31(8), 651–666.
https://doi.org/10.1016/j.patrec.2009.09.011
Karmitsa, N., Bagirov, A.M., & Taheri, S. (2018). Clustering in large data sets with the limited memory bundle method, Pattern Recognition, 83, 245–249.
https://doi.org/10.1016/j.patcog.2018.05.028
Kaufman, L. & Rousseeuw, P. J. (1990). Finding groups in data: An introduction to cluster analysis. New York: Wiley.
Lenssen, L. & Schubert, E. (2024). Medoid silhouette clustering with automatic cluster number selection. Information Systems, 120, 102290.
https://doi.org/10.1016/j.is.2023.102290
Li, M., Zhang, Y., Liu, S., Liu, Z., & Zhu, X. (2023). Simple multiple kernel k-means with kernel weight regularization. Information Fusion, 100, 101902.
https://doi.org/10.1016/j.inffus.2023.101902
Lichman, M. (2018). University of California. UCI Machine Learning Repository.
http://archive.ics.uci.edu/ml Accessed 7 July 2018.
Mussabayev, R., Mladenovic, N., Jarboui, B., & Mussabayev, R. (2023). How to Use K-means for Big Data Clustering? Pattern Recognition, 137, 109269.
https://doi.org/10.1016/j.patcog.2022.109269
Meng, Y., Liang, J., Cao, F., & He, Y. (2018). A new distance with derivative information for functional k-means clustering algorithm, Information Sciences, 463-464, 166-185.
https://doi.org/10.1016/j.ins.2018.06.035
Parker, J. K. & Hall, L. O. (2014). Accelerating fuzzy-c means using an estimated subsample size. IEEE Transactions on Fuzzy Systems, 22(5), 1229-1244.
https://doi.org/10.1109/TFUZZ.2013.2286993
Ping, Y., Li, H., Hao, B., Guo, C., & Wang, B. (2024). Beyond k-Means++: Towards better cluster exploration with geometrical information. Pattern Recognition, 146, 110036.
https://doi.org/10.1016/j.patcog.2023.110036
Sculley, D. (2010). Web-scale k-means clustering. 2010 ACM International conference on World Wide Web (WWW), North Carolina, USA, April 2010 (pp. 1177-1178).
https://doi.org/10.1145/1772690.1772862
Schubert, E. & Rousseeuw, P. J. (2021). Fast and eager k-medoids clustering: O(k) runtime improvement of the PAM, CLARA, and CLARANS algorithms, Information Systems, 101, 101804.
https://doi.org/10.1016/j.is.2021.101804
Stisen, A., Blunck, H., Bhattacharya, S., Prentow, T. S., Kjærgaard, M. B., Dey, A., Sonne, T., & Jensen, M. M. (2015). Smart devices are different: assessing and mitigating mobile sensing heterogeneities for activity recognition. In ACM Conference on Embedded Networked Sensor Systems (SenSys), Seoul, South Korea, November 2015 (pp. 127-140).
https://doi.org/10.1145/2809695.2809718
Thompson, S. K. (1987). Sample size for estimating multinomial proportions, The American Statistician. 41, 42-46.
https://doi.org/10.2307/2684318
Wang, Y., Krishna Saraswat, S., & Elyasi Komari, I. (2022). Big data analysis using a parallel ensemble clustering architecture and an unsupervised feature selection approach. Journal of King Saud University - Computer and Information Sciences, 35(1), 270-282.
https://doi.org/10.1016/j.jksuci.2022.11.016
Zein, A. A., Dowaji, S., & Al-Khayatt, M. I. (2023). Clustering-based method for big spatial data partitioning. Measurement: Sensors, 27, 100731.
https://doi.org/10.1016/j.measen.2023.100731
Zhao, W. L., Deng, C. H., & Ngo, C. W. (2018). k-means: A revisit. Neurocomputing, 291, 195-206.
https://doi.org/10.1016/j.neucom.2018.02.072
Zhang, H., Li, J., Zhang, J., & Dong, Y. (2024). Speeding up k-means clustering in high dimensions by pruning unnecessary distance computations. Knowledge-Based Systems, 284, 111262.
https://doi.org/10.1016/j.knosys.2023.111262
Zhang, J., Wolfram, D., & Ma, F. (2023). The impact of big data on research methods in information science. Data and Information Management, 7(2), 100038.
https://doi.org/10.1016/j.dim.2023.100038
Zhu, X., Sun, J., He, Z., Jiang, J., & Wang, Z. (2023). Staleness-Reduction Mini-Batch K-Means. IEEE Transactions on Neural Networks and Learning Systems, 1-13.
https://doi.org/10.1109/TNNLS.2023.3279122