№1, 2022

MODIFICATION OF THE DBSCAN ALGORITHM FOR BIG DATA CLUSTERING

Aygul F. Fakhraddingizi

The development of Information and Communication Technologies (ICT) has led to the rapid growth of digital information and the consequent emergence of the concept of large-scale data. Therefore, there is a need to delve into large-scale data and its essence, the possibilities and problems of analytical technologies. Clustering is one of the main methods of analyzing big data. The main purpose of clustering is to separate data into clusters according to certain characteristics. When clusters come in different sizes, densities, and shapes, the problem of detection arises. The article explores the density-based DBSCAN clustering algorithm for working with big data. One of the main features of this algorithm is to create an effective cluster by detecting the noise points in big data. During the implementation of the algorithm, real databases containing noise points were used. Metrics such as adjusted rand index, homogeneity, Davis-Boldin index were used to evaluate the results of the experiment. The proposed method was more effective than the traditional DBSCAN algorithm in detecting noise points (pp.28-37).

Keywords: Big data, Clustering algorithms, Density-based clustering, DBSCAN algorithm
DOI : 10.25045/jpit.v13.i1.04
References
  • Alguliyev, R. M., Aliguliyev, R. M., & Sukhostat, L. V. (2019). Efficient algorithm for big data clustering on single machine. CAAI Transactions on Intelligence Technology, 5(1), 9-14. https://doi.org/10.1049/trit.2019.0048
  • Alguliyev R.M., Aliguliyev R.M., Abdullayeva F.J. (2019). Privacy-preserving deep learning algorithm for big personal data analysis. Journal of Industrial Information Integration, 15, 1-14. https://doi.org/10.1016/j.jii.2019.07.002
  • Alguliyev, R., Aliguliyev, R., & Sukhostat, L. (2017). Anomaly detection in Big data based on clustering. Statistics, Optimization & Information Computing5(4), 325-340. https://doi.org/10.19139/soic.v5i4.365
  • Alguliyev, R., & Imamverdiyev, Y. (2014). Big data: Big promises for information security. In 2014 IEEE 8th International Conference on Application of Information and Communication Technologies (AICT) (pp. 1-4). IEEE. 10.1109/ICAICT.2014.7035946
  • Alguliyev R., & Hajirahimova M. (2014). "BIG DATA" PHENOMENON: CHALLENGES AND OPPORTUNITIES. Problems of information technology, 5(2), 3-16. https://jpit.az/en/journals/120
  • Aliguliyev R., Hajirahimova M., & Aliyeva A. (2016). Current scientific and theoretical problems of Big data. Problems of information society, (2), 37-49 (Əliquliyev, R. M., Hacırəhimova, M. Ş., & Əliyeva, A. S. (2016). Big data-nin aktual elmi-nəzəri problemləri. İnformasiya cəmiyyəti problemləri, (2), (37-49). https://jpis.az/az/journals/138
  • Andrieu, C., De Freitas, N., Doucet, A., & Jordan, M. I. (2003). An introduction to MCMC for machine learning. Machine learning50(1), 5-43. https://doi.org/10.1023/A:1020281327116
  • Berkhin, P. (2006). A survey of clustering data mining techniques. In Grouping multidimensional data (pp. 25-71). Springer, Berlin, Heidelberg.https://doi.org/10.1007/3-540-28349-8_2
  • Big data (2008). http://www.nature.com/news/specials/bigdata/index.html
  • Cassisi, C., Ferro, A., Giugno, R., Pigola, G., & Pulvirenti, A. (2013). Enhancing density-based clustering: Parameter reduction and outlier detection. Information Systems38(3), 317-330. https://doi.org/10.1016/j.is.2012.09.001
  • Chana I.A., & Arora S. (2014). Survey of clustering techniques for big data analysis. 5th International Conference - Confluence the Next Generation Information Technology Summit, 59-65. 10.3233/JIFS-202503
  • Chandra, E., & Anuradha, V. P. (2011). A survey on clustering algorithms for data in spatial database management systems. International Journal of Computer Applications, 24(9), 19-26.
  • Dharni, C., & Bnasal, M. (2013). An improvement of DBSCAN Algorithm to analyze cluster for large datasets. In 2013 IEEE international conference in MOOC, innovation and technology in education (MITE) (pp. 42-46). IEEE. 10.1109/MITE.2013.6756302
  • Duan, L., Xu, L., Guo, F., Lee, J., & Yan, B. (2007). A local-density based spatial clustering algorithm with noise. Information systems, 32(7), 978-986. https://doi.org/10.1016/j.is.2006.10.006
  • El-Sonbaty, Y., Ismail, M. A., & Farouk, M. (2004). An efficient density based clustering algorithm for large databases. In 16th IEEE international conference on tools with artificial intelligence (pp. 673-677). IEEE. 10.1109/ICTAI.2004.27
  • Ester, M., Kriegel, H. P., Sander, J., & Xu, X. (1996). A density-based algorithm for discovering clusters in large spatial databases with noise. In kdd (Vol. 96, No. 34, pp. 226-231).
  • Fahad, A., Alshatri, N., Tari, Z., Alamri, A., Khalil, I., Zomaya, A. Y., ... & Bouras, A. (2014). A survey of clustering algorithms for big data: Taxonomy and empirical analysis. IEEE transactions on emerging topics in computing, 2(3), 267-279. 10.1109/TETC.2014.2330519
  • Fakhraddingizi A. (2019). Fundamental issues of data security in big data technologies. Actual multidisciplinary scientific-practical problems of information security, V republic conference, 226-228. (Fəxrəddinqızı A. (2019). Big data texnologiyalarında verilənlərin təhlükəsizliyinin əsas məsələləri. İnformasiya təhlükəsizliyinin aktual multidissiplinar elmi-praktiki problemləri V respublika konfransı, 226-228).
  • Gaonkar, M. N., & Sawant, K. (2013). AutoEpsDBSCAN: DBSCAN with Eps automatic for large dataset. International Journal on Advanced Computer Theory and Engineering, 2(2), 11-16. https://archive.ics.uci.edu/ml/datasets.phphttps://www.kaggle.com/https://www.kdnuggets.com/2017/04/42-vs-big-data-data-science.html
  • Hubert, L., & Arabie, P. (1985). Comparing partitions. Journal of classification, 2(1), 193-218. https://doi.org/10.1007/BF01908075
  • Karypis, G., Han, E. H., & Kumar, V. (1999). Chameleon: Hierarchical clustering using dynamic modeling. Computer, 32(8), 68-75. 10.1109/2.781637
  • Kaufman, L., & Rousseeuw, P. J. (2009). Finding groups in data: an introduction to cluster analysis (Vol. 344). John Wiley & Sons. https://books.google.az/books
  • Kailing, K., Kriegel, H. P., & Kröger, P. (2004, April). Density-connected subspace clustering for high-dimensional data. In Proceedings of the 2004 SIAM international conference on data mining (pp. 246-256). Society for Industrial and Applied Mathematics. 10.1137/1.9781611972740.23
  • Liu, P., Zhou, D., & Wu, N. (2007). VDBSCAN: varied density based spatial clustering of applications with noise. In 2007 International conference on service systems and service management (pp. 1-4). IEEE. 10.1109/ICSSSM.2007.4280175
  • Moreira, A., Santos, M. Y., & Carneiro, S. (2005). Density-based clustering algorithms–DBSCAN and SNN.  University of Minho-Portugal, 1-18. http://get.dsi.uminho.pt/local/download/SNN&DBSCAN.pdf
  • Parimala, M., Lopez, D., & Senthilkumar, N. C. (2011). A survey on density based clustering algorithms for mining large spatial databases. International Journal of Advanced Science and Technology, 31(1), 59-66. 10.1.1.643.6121
  • Rahmah, N., & Sitanggang, I. S. (2016). Determination of optimal epsilon (eps) value on dbscan algorithm to clustering data on peatland hotspots in sumatra. In IOP conference series: earth and environmental science (Vol. 31, No. 1, p. 012012). IOP Publishing. 10.1088/1755-1315/31/1/012012
  • Sajana, T., Rani, C. S., & Narayana, K. V. (2016). A survey on clustering techniques for big data mining. Indian journal of Science and Technology, 9(3), 1-12. 10.17485/ijst/2016/v9i3/75971
  • Shah, G. H. (2012). An improved DBSCAN, a density based clustering algorithm with parameter selection for high dimensional data sets. In 2012 Nirma university international conference on engineering (NUiCONE) (pp. 1-6). IEEE. 10.1109/NUICONE.2012.6493211
  • Sharma, S., Sharma, A. K., & Soni, D. (2017). Enhancing DBSCAN algorithm for data mining. In 2017 International Conference on Energy, Communication, Data Analytics and Soft Computing (ICECDS) (pp. 1634-1638). IEEE. 10.1109/ICECDS.2017.8389724
  • Uncu, O., Gruver, W. A., Kotak, D. B., Sabaz, D., Alibhai, Z., & Ng, C. (2006, October). GRIDBSCAN: GRId density-based spatial clustering of applications with noise. In 2006 IEEE International Conference on Systems, Man and Cybernetics (Vol. 4, pp. 2976-2981). IEEE.  10.1109/ICSMC.2006.384634
  • Xiong, Z., Chen, R., Zhang, Y., & Zhang, X. (2012). Multi-density dbscan algorithm based on density levels partitioning. Journal of Information and Computational Science, 9(10), 2739-2749.
  • Zhou, L., Pan, S., Wang, J., & Vasilakos, A. V. (2017). Machine learning on big data: Opportunities and challenges. Neurocomputing, 237, 350-361. https://doi.org/10.1016/j.neucom.2017.01.026
  • Zhao, W., Ma, H., & He, Q. (2009). Parallel k-means clustering based on mapreduce. In IEEE international conference on cloud computing (pp. 674-679). Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-10665-1_71