Published October 29, 2024 | Version v1
Conference proceeding Open

A Semi-Supervised Approach to Anomaly Detection for Tax Compliance

Description

A semi-supervised autoencoder model is applied to predict anomalies in tax form data consisting of hundreds of sparse, irregularly distributed features. Historical tax data generally contains a small number of labeled examples (i.e., returns with known compliance outcomes) and a much larger number of unlabeled examples (i.e., returns with unknown compliance status). Compliance outcomes can be measured by various success metrics; these metrics tend to have right-tailed distributions, with higher values indicating outcomes that are more beneficial to the IRS. Building on recent literature related to semi-supervised anomaly detection and our previous work applying unsupervised autoencoders to tax data, we introduce new techniques for incorporating labeled data into model training and leveraging both binary (i.e., compliant vs non-compliant) and continuous (i.e., success metrics) outcomes to update the model. These techniques provide flexibility regarding the level of influence that the labeled data has on the training process – ranging from no influence at all to near total influence – while addressing the sparsity of the data. We also apply novel ensemble methods to improve detection of anomalies. Testing across different datasets and population segments shows favorable performance for the semi-supervised autoencoder compared to existing operational models.

Files

A Semi-Supervised Approach to Anomaly Detection for Tax Compliance_vF.pdf

Additional details

References

  • A. S. Parker, D. Gewurz, and W. J. J. Roberts. "Quality and Validity Testing of Sparse Form Data using Gaussian Mixture Models," JSM Proceedings, Social Statistics Section, 2018.
  • A. S. Parker, D. Gewurz, and W. J. J. Roberts. "Recommender Algorithms for Form Anomaly Detection," JSM Proceedings, Government Statistics Section, 2020.
  • C. Acton, L. Corman, J. Bono, D. Gewurz, C. Walsh, and E. Schulz. "Anomaly Detection on Sparse Data with Autoencoders," JSM Proceedings, 2023, DOI: 10.5281/zenodo.10001050.
  • N. Merrill and A. Eskandarian. "Modified Autoencoder Training and Scoring for Robust Unsupervised Anomaly Detection in Deep Learning," IEEE Access, vol. 8, pp. 101824-101833, DOI: 10.1109/ACCESS.2020.2997327.
  • L. Ruff, R. Vandermeulen, N. Görnitz, A. Binder, E. Müller, K-R. Müller, and M. Kloft. "Deep Semi-Supervised Anomaly Detection," International Conference on Learning Representations, 2020, URL: https://doi.org/10.48550/arXiv.1906.02694.