MODIFICATION OF THE DBSCAN ALGORITHM FOR BIG DATA CLUSTERING

Aygul F. Fakhraddingizi

doi:http://doi.org/10.25045/jpit.v13.i1.04

About the Journal Editorial Board For authors Publication ethics Archive Abstracting & Indexing Contact us

№1, 2022

MODIFICATION OF THE DBSCAN ALGORITHM FOR BIG DATA CLUSTERING

Aygul F. Fakhraddingizi

aygul.fexreddin@gmail.com

The development of Information and Communication Technologies (ICT) has led to the rapid growth of digital information and the consequent emergence of the concept of large-scale data. Therefore, there is a need to delve into large-scale data and its essence, the possibilities and problems of analytical technologies. Clustering is one of the main methods of analyzing big data. The main purpose of clustering is to separate data into clusters according to certain characteristics. When clusters come in different sizes, densities, and shapes, the problem of detection arises. The article explores the density-based DBSCAN clustering algorithm for working with big data. One of the main features of this algorithm is to create an effective cluster by detecting the noise points in big data. During the implementation of the algorithm, real databases containing noise points were used. Metrics such as adjusted rand index, homogeneity, Davis-Boldin index were used to evaluate the results of the experiment. The proposed method was more effective than the traditional DBSCAN algorithm in detecting noise points (pp.28-37).

Keywords: Big data, Clustering algorithms, Density-based clustering, DBSCAN algorithm

DOI:

http://doi.org/10.25045/jpit.v13.i1.04

View article(5231)

References

Alguliyev, R. M., Aliguliyev, R. M., & Sukhostat, L. V. (2019). Efficient algorithm for big data clustering on single machine. CAAI Transactions on Intelligence Technology, 5(1), 9-14. https://doi.org/10.1049/trit.2019.0048
Alguliyev R.M., Aliguliyev R.M., Abdullayeva F.J. (2019). Privacy-preserving deep learning algorithm for big personal data analysis. Journal of Industrial Information Integration, 15, 1-14. https://doi.org/10.1016/j.jii.2019.07.002
Alguliyev, R., Aliguliyev, R., & Sukhostat, L. (2017). Anomaly detection in Big data based on clustering. Statistics, Optimization & Information Computing, 5(4), 325-340. https://doi.org/10.19139/soic.v5i4.365
Alguliyev, R., & Imamverdiyev, Y. (2014). Big data: Big promises for information security. In 2014 IEEE 8th International Conference on Application of Information and Communication Technologies (AICT) (pp. 1-4). IEEE. 10.1109/ICAICT.2014.7035946
Alguliyev R., & Hajirahimova M. (2014). "BIG DATA" PHENOMENON: CHALLENGES AND OPPORTUNITIES. Problems of information technology, 5(2), 3-16. https://jpit.az/en/journals/120
Aliguliyev R., Hajirahimova M., & Aliyeva A. (2016). Current scientific and theoretical problems of Big data. Problems of information society, (2), 37-49 (Əliquliyev, R. M., Hacırəhimova, M. Ş., & Əliyeva, A. S. (2016). Big data-nin aktual elmi-nəzəri problemləri. İnformasiya cəmiyyəti problemləri, (2), (37-49). https://jpis.az/az/journals/138
Andrieu, C., De Freitas, N., Doucet, A., & Jordan, M. I. (2003). An introduction to MCMC for machine learning. Machine learning, 50(1), 5-43. https://doi.org/10.1023/A:1020281327116
Berkhin, P. (2006). A survey of clustering data mining techniques. In Grouping multidimensional data (pp. 25-71). Springer, Berlin, Heidelberg.https://doi.org/10.1007/3-540-28349-8_2
Big data (2008). http://www.nature.com/news/specials/bigdata/index.html
Cassisi, C., Ferro, A., Giugno, R., Pigola, G., & Pulvirenti, A. (2013). Enhancing density-based clustering: Parameter reduction and outlier detection. Information Systems, 38(3), 317-330. https://doi.org/10.1016/j.is.2012.09.001
Chana I.A., & Arora S. (2014). Survey of clustering techniques for big data analysis. 5th International Conference - Confluence the Next Generation Information Technology Summit, 59-65. 10.3233/JIFS-202503
Chandra, E., & Anuradha, V. P. (2011). A survey on clustering algorithms for data in spatial database management systems. International Journal of Computer Applications, 24(9), 19-26.
Dharni, C., & Bnasal, M. (2013). An improvement of DBSCAN Algorithm to analyze cluster for large datasets. In 2013 IEEE international conference in MOOC, innovation and technology in education (MITE) (pp. 42-46). IEEE. 10.1109/MITE.2013.6756302
Duan, L., Xu, L., Guo, F., Lee, J., & Yan, B. (2007). A local-density based spatial clustering algorithm with noise. Information systems, 32(7), 978-986. https://doi.org/10.1016/j.is.2006.10.006
El-Sonbaty, Y., Ismail, M. A., & Farouk, M. (2004). An efficient density based clustering algorithm for large databases. In 16th IEEE international conference on tools with artificial intelligence (pp. 673-677). IEEE. 10.1109/ICTAI.2004.27
Ester, M., Kriegel, H. P., Sander, J., & Xu, X. (1996). A density-based algorithm for discovering clusters in large spatial databases with noise. In kdd (Vol. 96, No. 34, pp. 226-231).
Fahad, A., Alshatri, N., Tari, Z., Alamri, A., Khalil, I., Zomaya, A. Y., ... & Bouras, A. (2014). A survey of clustering algorithms for big data: Taxonomy and empirical analysis. IEEE transactions on emerging topics in computing, 2(3), 267-279. 10.1109/TETC.2014.2330519
Fakhraddingizi A. (2019). Fundamental issues of data security in big data technologies. Actual multidisciplinary scientific-practical problems of information security, V republic conference, 226-228. (Fəxrəddinqızı A. (2019). Big data texnologiyalarında verilənlərin təhlükəsizliyinin əsas məsələləri. İnformasiya təhlükəsizliyinin aktual multidissiplinar elmi-praktiki problemləri V respublika konfransı, 226-228).
Gaonkar, M. N., & Sawant, K. (2013). AutoEpsDBSCAN: DBSCAN with Eps automatic for large dataset. International Journal on Advanced Computer Theory and Engineering, 2(2), 11-16. https://archive.ics.uci.edu/ml/datasets.php, https://www.kaggle.com/, https://www.kdnuggets.com/2017/04/42-vs-big-data-data-science.html
Hubert, L., & Arabie, P. (1985). Comparing partitions. Journal of classification, 2(1), 193-218. https://doi.org/10.1007/BF01908075
Karypis, G., Han, E. H., & Kumar, V. (1999). Chameleon: Hierarchical clustering using dynamic modeling. Computer, 32(8), 68-75. 10.1109/2.781637
Kaufman, L., & Rousseeuw, P. J. (2009). Finding groups in data: an introduction to cluster analysis (Vol. 344). John Wiley & Sons. https://books.google.az/books
Kailing, K., Kriegel, H. P., & Kröger, P. (2004, April). Density-connected subspace clustering for high-dimensional data. In Proceedings of the 2004 SIAM international conference on data mining (pp. 246-256). Society for Industrial and Applied Mathematics. 10.1137/1.9781611972740.23
Liu, P., Zhou, D., & Wu, N. (2007). VDBSCAN: varied density based spatial clustering of applications with noise. In 2007 International conference on service systems and service management (pp. 1-4). IEEE. 10.1109/ICSSSM.2007.4280175
Moreira, A., Santos, M. Y., & Carneiro, S. (2005). Density-based clustering algorithms–DBSCAN and SNN. University of Minho-Portugal, 1-18. http://get.dsi.uminho.pt/local/download/SNN&DBSCAN.pdf
Parimala, M., Lopez, D., & Senthilkumar, N. C. (2011). A survey on density based clustering algorithms for mining large spatial databases. International Journal of Advanced Science and Technology, 31(1), 59-66. 10.1.1.643.6121
Rahmah, N., & Sitanggang, I. S. (2016). Determination of optimal epsilon (eps) value on dbscan algorithm to clustering data on peatland hotspots in sumatra. In IOP conference series: earth and environmental science (Vol. 31, No. 1, p. 012012). IOP Publishing. 10.1088/1755-1315/31/1/012012
Sajana, T., Rani, C. S., & Narayana, K. V. (2016). A survey on clustering techniques for big data mining. Indian journal of Science and Technology, 9(3), 1-12. 10.17485/ijst/2016/v9i3/75971
Shah, G. H. (2012). An improved DBSCAN, a density based clustering algorithm with parameter selection for high dimensional data sets. In 2012 Nirma university international conference on engineering (NUiCONE) (pp. 1-6). IEEE. 10.1109/NUICONE.2012.6493211
Sharma, S., Sharma, A. K., & Soni, D. (2017). Enhancing DBSCAN algorithm for data mining. In 2017 International Conference on Energy, Communication, Data Analytics and Soft Computing (ICECDS) (pp. 1634-1638). IEEE. 10.1109/ICECDS.2017.8389724
Uncu, O., Gruver, W. A., Kotak, D. B., Sabaz, D., Alibhai, Z., & Ng, C. (2006, October). GRIDBSCAN: GRId density-based spatial clustering of applications with noise. In 2006 IEEE International Conference on Systems, Man and Cybernetics (Vol. 4, pp. 2976-2981). IEEE. 10.1109/ICSMC.2006.384634
Xiong, Z., Chen, R., Zhang, Y., & Zhang, X. (2012). Multi-density dbscan algorithm based on density levels partitioning. Journal of Information and Computational Science, 9(10), 2739-2749.
Zhou, L., Pan, S., Wang, J., & Vasilakos, A. V. (2017). Machine learning on big data: Opportunities and challenges. Neurocomputing, 237, 350-361. https://doi.org/10.1016/j.neucom.2017.01.026
Zhao, W., Ma, H., & He, Q. (2009). Parallel k-means clustering based on mapreduce. In IEEE international conference on cloud computing (pp. 674-679). Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-10665-1_71