MODIFICATION OF THE DBSCAN ALGORITHM FOR BIG DATA CLUSTERING - Problems of Information Technology

MODIFICATION OF THE DBSCAN ALGORITHM FOR BIG DATA CLUSTERING - Problems of Information Technology

MODIFICATION OF THE DBSCAN ALGORITHM FOR BIG DATA CLUSTERING - Problems of Information Technology

MODIFICATION OF THE DBSCAN ALGORITHM FOR BIG DATA CLUSTERING - Problems of Information Technology

MODIFICATION OF THE DBSCAN ALGORITHM FOR BIG DATA CLUSTERING - Problems of Information Technology
MODIFICATION OF THE DBSCAN ALGORITHM FOR BIG DATA CLUSTERING - Problems of Information Technology
AZERBAIJAN NATIONAL ACADEMY OF SCIENCES

№1, 2022

MODIFICATION OF THE DBSCAN ALGORITHM FOR BIG DATA CLUSTERING

Aygul F. Fakhraddingizi

The development of Information and Communication Technologies (ICT) has led to the rapid growth of digital information and the consequent emergence of the concept of large-scale data. Therefore, there is a need to delve into large-scale data and its essence, the possibilities and problems of analytical technologies. Clustering is one of the main methods of analyzing big data. The main purpose of clustering is to separate data into clusters according to certain characteristics. When clusters come in different sizes, densities, and shapes, the problem of detection arises. The article explores the density-based DBSCAN clustering algorithm for working with big data. One of the main features of this algorithm is to create an effective cluster by detecting the noise points in big data. During the implementation of the algorithm, real databases containing noise points were used. Metrics such as adjusted rand index, homogeneity, Davis-Boldin index were used to evaluate the results of the experiment. The proposed method was more effective than the traditional DBSCAN algorithm in detecting noise points (pp.28-37).

Keywords: Big data, Clustering algorithms, Density-based clustering, DBSCAN algorithm
DOI : 10.25045/jpit.v13.i1.04
References

Alguliyev, R. M., Aliguliyev, R. M., & Sukhostat, L. V. (2019). Efficient algorithm for big data clustering on single machine. CAAI Transactions on Intelligence Technology, 5(1), 9-14. https://doi.org/10.1049/trit.2019.0048

Alguliyev R.M., Aliguliyev R.M., Abdullayeva F.J. (2019). Privacy-preserving deep learning algorithm for big personal data analysis. Journal of Industrial Information Integration, 15, 1-14. https://doi.org/10.1016/j.jii.2019.07.002

Alguliyev, R., Aliguliyev, R., & Sukhostat, L. (2017). Anomaly detection in Big data based on clustering. Statistics, Optimization & Information Computing5(4), 325-340. https://doi.org/10.19139/soic.v5i4.365

Alguliyev, R., & Imamverdiyev, Y. (2014). Big data: Big promises for information security. In 2014 IEEE 8th International Conference on Application of Information and Communication Technologies (AICT) (pp. 1-4). IEEE. 10.1109/ICAICT.2014.7035946

Alguliyev R., & Hajirahimova M. (2014). "BIG DATA" PHENOMENON: CHALLENGES AND OPPORTUNITIES. Problems of information technology, 5(2), 3-16. https://jpit.az/en/journals/120

Aliguliyev R., Hajirahimova M., & Aliyeva A. (2016). Current scientific and theoretical problems of Big data. Problems of information society, (2), 37-49 (Əliquliyev, R. M., Hacırəhimova, M. Ş., & Əliyeva, A. S. (2016). Big data-nin aktual elmi-nəzəri problemləri. İnformasiya cəmiyyəti problemləri, (2), (37-49). https://jpis.az/az/journals/138

Andrieu, C., De Freitas, N., Doucet, A., & Jordan, M. I. (2003). An introduction to MCMC for machine learning. Machine learning50(1), 5-43. https://doi.org/10.1023/A:1020281327116

Berkhin, P. (2006). A survey of clustering data mining techniques. In Grouping multidimensional data (pp. 25-71). Springer, Berlin, Heidelberg.https://doi.org/10.1007/3-540-28349-8_2

Big data (2008). http://www.nature.com/news/specials/bigdata/index.html

Cassisi, C., Ferro, A., Giugno, R., Pigola, G., & Pulvirenti, A. (2013). Enhancing density-based clustering: Parameter reduction and outlier detection. Information Systems38(3), 317-330. https://doi.org/10.1016/j.is.2012.09.001

Chana I.A., & Arora S. (2014). Survey of clustering techniques for big data analysis. 5th International Conference - Confluence the Next Generation Information Technology Summit, 59-65. 10.3233/JIFS-202503

Chandra, E., & Anuradha, V. P. (2011). A survey on clustering algorithms for data in spatial database management systems. International Journal of Computer Applications, 24(9), 19-26.

Dharni, C., & Bnasal, M. (2013). An improvement of DBSCAN Algorithm to analyze cluster for large datasets. In 2013 IEEE international conference in MOOC, innovation and technology in education (MITE) (pp. 42-46). IEEE. 10.1109/MITE.2013.6756302

Duan, L., Xu, L., Guo, F., Lee, J., & Yan, B. (2007). A local-density based spatial clustering algorithm with noise. Information systems, 32(7), 978-986. https://doi.org/10.1016/j.is.2006.10.006

El-Sonbaty, Y., Ismail, M. A., & Farouk, M. (2004). An efficient density based clustering algorithm for large databases. In 16th IEEE international conference on tools with artificial intelligence (pp. 673-677). IEEE. 10.1109/ICTAI.2004.27

Ester, M., Kriegel, H. P., Sander, J., & Xu, X. (1996). A density-based algorithm for discovering clusters in large spatial databases with noise. In kdd (Vol. 96, No. 34, pp. 226-231).

Fahad, A., Alshatri, N., Tari, Z., Alamri, A., Khalil, I., Zomaya, A. Y., ... & Bouras, A. (2014). A survey of clustering algorithms for big data: Taxonomy and empirical analysis. IEEE transactions on emerging topics in computing, 2(3), 267-279. 10.1109/TETC.2014.2330519

Fakhraddingizi A. (2019). Fundamental issues of data security in big data technologies. Actual multidisciplinary scientific-practical problems of information security, V republic conference, 226-228. (Fəxrəddinqızı A. (2019). Big data texnologiyalarında verilənlərin təhlükəsizliyinin əsas məsələləri. İnformasiya təhlükəsizliyinin aktual multidissiplinar elmi-praktiki problemləri V respublika konfransı, 226-228).

Gaonkar, M. N., & Sawant, K. (2013). AutoEpsDBSCAN: DBSCAN with Eps automatic for large dataset. International Journal on Advanced Computer Theory and Engineering, 2(2), 11-16. https://archive.ics.uci.edu/ml/datasets.phphttps://www.kaggle.com/https://www.kdnuggets.com/2017/04/42-vs-big-data-data-science.html

Hubert, L., & Arabie, P. (1985). Comparing partitions. Journal of classification, 2(1), 193-218. https://doi.org/10.1007/BF01908075

Karypis, G., Han, E. H., & Kumar, V. (1999). Chameleon: Hierarchical clustering using dynamic modeling. Computer, 32(8), 68-75. 10.1109/2.781637

Kaufman, L., & Rousseeuw, P. J. (2009). Finding groups in data: an introduction to cluster analysis (Vol. 344). John Wiley & Sons. https://books.google.az/books

Kailing, K., Kriegel, H. P., & Kröger, P. (2004, April). Density-connected subspace clustering for high-dimensional data. In Proceedings of the 2004 SIAM international conference on data mining (pp. 246-256). Society for Industrial and Applied Mathematics. 10.1137/1.9781611972740.23

Liu, P., Zhou, D., & Wu, N. (2007). VDBSCAN: varied density based spatial clustering of applications with noise. In 2007 International conference on service systems and service management (pp. 1-4). IEEE. 10.1109/ICSSSM.2007.4280175

Moreira, A., Santos, M. Y., & Carneiro, S. (2005). Density-based clustering algorithms–DBSCAN and SNN.  University of Minho-Portugal, 1-18. http://get.dsi.uminho.pt/local/download/SNN&DBSCAN.pdf

Parimala, M., Lopez, D., & Senthilkumar, N. C. (2011). A survey on density based clustering algorithms for mining large spatial databases. International Journal of Advanced Science and Technology, 31(1), 59-66. 10.1.1.643.6121

Rahmah, N., & Sitanggang, I. S. (2016). Determination of optimal epsilon (eps) value on dbscan algorithm to clustering data on peatland hotspots in sumatra. In IOP conference series: earth and environmental science (Vol. 31, No. 1, p. 012012). IOP Publishing. 10.1088/1755-1315/31/1/012012

Sajana, T., Rani, C. S., & Narayana, K. V. (2016). A survey on clustering techniques for big data mining. Indian journal of Science and Technology, 9(3), 1-12. 10.17485/ijst/2016/v9i3/75971

Shah, G. H. (2012). An improved DBSCAN, a density based clustering algorithm with parameter selection for high dimensional data sets. In 2012 Nirma university international conference on engineering (NUiCONE) (pp. 1-6). IEEE. 10.1109/NUICONE.2012.6493211

Sharma, S., Sharma, A. K., & Soni, D. (2017). Enhancing DBSCAN algorithm for data mining. In 2017 International Conference on Energy, Communication, Data Analytics and Soft Computing (ICECDS) (pp. 1634-1638). IEEE. 10.1109/ICECDS.2017.8389724

Uncu, O., Gruver, W. A., Kotak, D. B., Sabaz, D., Alibhai, Z., & Ng, C. (2006, October). GRIDBSCAN: GRId density-based spatial clustering of applications with noise. In 2006 IEEE International Conference on Systems, Man and Cybernetics (Vol. 4, pp. 2976-2981). IEEE.  10.1109/ICSMC.2006.384634

Xiong, Z., Chen, R., Zhang, Y., & Zhang, X. (2012). Multi-density dbscan algorithm based on density levels partitioning. Journal of Information and Computational Science, 9(10), 2739-2749.

Zhou, L., Pan, S., Wang, J., & Vasilakos, A. V. (2017). Machine learning on big data: Opportunities and challenges. Neurocomputing, 237, 350-361. https://doi.org/10.1016/j.neucom.2017.01.026

Zhao, W., Ma, H., & He, Q. (2009). Parallel k-means clustering based on mapreduce. In IEEE international conference on cloud computing (pp. 674-679). Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-10665-1_71