Comparative Analysis of Supervised Learning and Unsupervised Anomaly Detection in Security Log Analysis for Post-Incident Digital Forensic Investigation

Iwan Indramana; Asto Purwanto

doi:10.59261/jbt.v7i2.605

Authors

Iwan Indramana STMIK Indonesia Mandiri
Asto Purwanto STMIK Indonesia Mandiri

DOI:

https://doi.org/10.59261/jbt.v7i2.605

Keywords:

anomaly detection, digital forensics, logistic regression, machine learning, security log analysis

Abstract

Background: Attempts to perform post-incident digital forensic investigation on large-scale security logs generated by enterprise firewalls and servers introduce a range of challenges. As data grows larger and more complex, it is no longer feasible to conduct manual analysis. Methodologically, there has been only limited empirical work directly comparing supervised and unsupervised paradigms for use in a post-incident forensic framework on operational-scale, real-world logs.

Objective: This paper compares the classification performance of supervised and unsupervised machine learning methods for forensic analysis of security logs, as well as the prioritization of various security anomalies using both approaches.

Methods: Analysis of a dataset containing more than 359,000 firewall and server logs obtained over a 30-day period. Labeled events were used to implement a supervised model, Logistic Regression; Isolation Forest is an unsupervised anomaly detection method, which performs best among the models trained on normal baseline logs. Evaluation metrics included accuracy, precision, recall, ROC-AUC, and ranking-based anomaly assessment.

Results: Logistic Regression — accuracy (0.99), ROC-AUC (0.9998), precision/recall for suspicious events (1.00, 0.99) — demonstrated near-perfect discriminability of labeled behavioral features within a 24-hour period. Isolation Forest: 86% overall accuracy, 93% precision, 59% recall; excellent forensic triage property: confirmed suspicious events among the top 200 anomaly-ranked entries: 197 of 200 (92.5%). Sensitivity analysis of the contamination parameter showed that ranking precision at the top 200 remained stable within the 0.05 to 0.30 range (Fig. 7A, 7B), demonstrating the robustness of rank-based prioritization despite variability in global recall across contamination values.

Conclusion: Our results demonstrate high predictive performance for supervised classification and efficient forensic triage through low false-positive rates in unsupervised anomaly detection of both time-series logs and free-text security event logs.

References

Aggarwal, C. C. (2016). An introduction to outlier analysis. In Outlier analysis (pp. 1–34). Springer. https://doi.org/10.1007/978-3-319-47578-3

Algarni, A. M., Thayananthan, V., & Malaiya, Y. K. (2021). Quantitative assessment of cybersecurity risks for mitigating data breaches in business systems. Applied Sciences, 11(8), 3678. https://doi.org/10.3390/app11083678

Breunig, M. M., Kriegel, H.-P., Ng, R. T., & Sander, J. (2000). LOF: identifying density-based local outliers. Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data, 93–104. https://doi.org/10.1145/342009.335388

Buczak, A. L., & Guven, E. (2015). A survey of data mining and machine learning methods for cyber security intrusion detection. IEEE Communications Surveys & Tutorials, 18(2), 1153–1176. https://doi.org/10.1109/COMST.2015.2494502

Casey, E. (2011). Digital evidence and computer crime: Forensic science, computers, and the internet. Academic press.

Chandola, V., Banerjee, A., & Kumar, V. (2009). Anomaly detection: A survey. ACM Computing Surveys (CSUR), 41(3), 1–58. https://doi.org/10.1145/1541880.1541882

Du, M., Li, F., Zheng, G., & Srikumar, V. (2017). Deeplog: Anomaly detection and diagnosis from system logs through deep learning. Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security, 1285–1298. https://doi.org/10.1145/3133956.3134015

Garcia-Teodoro, P., Diaz-Verdejo, J., Maciá-Fernández, G., & Vázquez, E. (2009). Anomaly-based network intrusion detection: Techniques, systems and challenges. Computers & Security, 28(1–2), 18–28. https://doi.org/10.1016/j.cose.2008.08.003

Garfinkel, S. L. (2010). Digital forensics research: The next 10 years. Digital Investigation, 7, S64–S73. https://doi.org/10.1016/j.diin.2010.05.009

Goldstein, M., & Uchida, S. (2016). A comparative evaluation of unsupervised anomaly detection algorithms for multivariate data. PloS One, 11(4), e0152173. https://doi.org/10.1371/journal.pone.0152173

Hariri, S., Kind, M. C., & Brunner, R. J. (2019). Extended isolation forest. IEEE Transactions on Knowledge and Data Engineering, 33(4), 1479–1489. https://doi.org/10.1109/TKDE.2019.2947676

He, C. Z., Frost, T., & Pinsker, R. E. (2020). The impact of reported cybersecurity breaches on firm innovation. Journal of Information Systems, 34(2), 187–209. https://doi.org/10.2308/isys-18-053

Liu, F. T., Ting, K. M., & Zhou, Z.-H. (2008). Isolation forest. 2008 Eighth Ieee International Conference on Data Mining, 413–422. https://doi.org/10.1109/ICDM.2008.17

Nayerifard, T., Amintoosi, H., Bafghi, A. G., & Dehghantanha, A. (2023). Machine learning in digital forensics: a systematic literature review. ArXiv Preprint ArXiv:2306.04965. https://doi.org/10.48550/arXiv.2306.04965

Pang, G., Shen, C., Cao, L., & Hengel, A. Van Den. (2021). Deep learning for anomaly detection: A review. ACM Computing Surveys (CSUR), 54(2), 1–38. https://doi.org/10.1145/3439950

Pengl, J., & Li, C.-W. (2022). Security breaches and modifications on cybersecurity disclosures. Accounting and Management Information Systems, 21(3), 452–470.

Reith, M., Carr, C., & Gunsch, G. (2002). An examination of digital forensic models. International Journal of Digital Evidence, 1(3), 1–12.

Shaikh, F. A., & Siponen, M. (2023). Information security risk assessments following cybersecurity breaches: The mediating role of top management attention to cybersecurity. Computers & Security, 124, 102974.

Sharafaldin, I., Lashkari, A. H., & Ghorbani, A. A. (2018). Toward generating a new intrusion detection dataset and intrusion traffic characterization. ICISSp, 1(2018), 108–116. https://doi.org/10.5220/0006639801080116

Vinayakumar, R., Alazab, M., Soman, K. P., Poornachandran, P., Al-Nemrat, A., & Venkatraman, S. (2019). Deep learning approach for intelligent intrusion detection system. IEEE Access, 7, 41525–41550. https://doi.org/10.1109/ACCESS.2019.2895334