Published June 1, 2017 | Version 10007534
Journal article Open

FCNN-MR: A Parallel Instance Selection Method Based on Fast Condensed Nearest Neighbor Rule

Description

Instance selection (IS) technique is used to reduce
the data size to improve the performance of data mining methods.
Recently, to process very large data set, several proposed methods
divide the training set into some disjoint subsets and apply IS
algorithms independently to each subset. In this paper, we analyze
the limitation of these methods and give our viewpoint about how to
divide and conquer in IS procedure. Then, based on fast condensed
nearest neighbor (FCNN) rule, we propose a large data sets instance
selection method with MapReduce framework. Besides ensuring the
prediction accuracy and reduction rate, it has two desirable properties:
First, it reduces the work load in the aggregation node; Second
and most important, it produces the same result with the sequential
version, which other parallel methods cannot achieve. We evaluate the
performance of FCNN-MR on one small data set and two large data
sets. The experimental results show that it is effective and practical.

Files

10007534.pdf

Files (254.2 kB)

Name Size Download all
md5:0eeff9daab64c7c39239f1ea4bd47d88
254.2 kB Preview Download

Additional details

References

  • S. Garcła, J. Luengo, and F. Herrera, "Data preprocessing in data mining," Computer Science, vol. 72, 2015.
  • B. J. Han, "Data mining. concepts and techniques. 3rd ed," Data Mining Concepts Models Methods & Algorithms Second Edition, vol. 5, no. 4, pp. 1 – 18, 2000.
  • T. Cover and P. Hart, "Nearest neighbor pattern classification," IEEE Transactions on Information Theory, vol. 13, no. 1, pp. 21–27, 1967.
  • D. R. Wilson and T. R. Martinez, "Reduction techniques for instance-based learning algorithms," Machine Learning, vol. 38, no. 3, pp. 257–286, 2000.
  • M. Kudo and J. Sklansky, "Comparison of algorithms that select features for pattern classifiers," Pattern Recognition, vol. 33, no. 1, pp. 25–41, 2000.
  • H. Liu and H. Motoda, "Feature extraction construction and selection: A data mining perspective," Springer International, vol. 94, no. 448, p. 014004, 1999.
  • I. Triguero, J. Derrac, S. Garcia, and F. Herrera, "A taxonomy and experimental study on prototype generation for nearest neighbor classification," Systems Man & Cybernetics Part C Applications & Reviews IEEE Transactions on, vol. 42, no. 1, pp. 86–100, 2012.
  • J. Hamidzadeh, R. Monsefi, and H. S. Yazdi, "Irahc: Instance reduction algorithm using hyperrectangle," Pattern Recognition, vol. 48, no. 5, pp. 1878–1889, 2015.
  • Haro-Garc, A. Aida, Garc, and N. A-Pedrajas, "A divide-and-conquer recursive approach for scaling up instance selection algorithms," Data Mining and Knowledge Discovery, vol. 18, no. 3, pp. 392–418, 2009. [10] I. Triguero, D. Peralta, J. Bacardit, S. Garcła, and F. Herrera, "Mrpr: A mapreduce solution for prototype reduction in big data classification," Neurocomputing, vol. 150, no. 150, p. 331C345, 2015. [11] J. Zhai, X. Wang, and X. Pang, "Voting-based instance selection from large data sets with mapreduce and random weight networks," Information Sciences, vol. 367, pp. 1066–1077, 2016. [12] H. Liu and H. Motoda, "On issues of instance selection." Data Mining and Knowledge Discovery, vol. 6, no. 2, pp. 115–130, 2002. [13] J. R. Cano, F. Herrera, and M. Lozano, "Stratification for scaling up evolutionary prototype selection," Pattern Recognition Letters, vol. 26, no. 7, pp. 953–963, 2005. [14] J. Dean and S. Ghemawat, "Mapreduce: Simplified data processing on large clusters." in Conference on Symposium on Opearting Systems Design & Implementation, 2004, pp. 107–113. [15] F. Angiulli, "Fast condensed nearest neighbor rule," in International Conference, 2005, pp. 25–32. [16] Angiulli, "Fast nearest neighbor condensation for large data sets classification," IEEE Transactions on Knowledge & Data Engineering, vol. 19, no. 11, pp. 1450–1464, 2007. [17] B. P. E. Hart, "The condensed nearest neighbor rule," in IEEE Trans. Information Theory, 1968. [18] C. H. Chou, B. H. Kuo, and F. Chang, "The generalized condensed nearest neighbor rule as a data reduction method," vol. 2, pp. 556–559, 2006. [19] G. W. Gates, "The reduced nearest neighbor rule," IEEE Transactions on Information Theory, vol. 18, no. 3, pp. 431 – 433, 1972. [20] D. L. Wilson, "Asymptotic properties of nearest neighbor rules using edited data," IEEE Transactions on Systems Man & Cybernetics, vol. 2, no. 3, pp. 408–421, 1972. [21] W. C. Lin, C. F. Tsai, S. W. Ke, C. W. Hung, and W. Eberle, "Learning to detect representative data for large scale instance selection," Journal of Systems & Software, vol. 106, no. C, pp. 1–8, 2015. [22] A. Onan, "A fuzzy-rough nearest neighbor classifier combined with consistency-based subset evaluation and instance selection for automated diagnosis of breast cancer," Expert Systems with Applications, vol. 42, no. 20, pp. 6844–6852, 2015. [23] J. A. Olvera-Lpez, J. A. Carrasco-Ochoa, and J. F. Martłnez-Trinidad, "A new fast prototype selection method based on clustering," Pattern Analysis and Applications, vol. 13, no. 2, pp. 131–141, 2010. [24] J. R. Cano, F. Herrera, and M. Lozano, "Using evolutionary algorithms as instance selection for data reduction in kdd: an experimental study," IEEE Transactions on Evolutionary Computation, vol. 7, no. 6, pp. 561–575, 2004. [25] D. Borthakur, "The hadoop distributed file system: Architecture and design," Hadoop Project Website, vol. 11, no. 11, pp. 1 – 10, 2007. [26] D. B. Skalak, "Prototype and feature selection by sampling and random mutation hill climbing algorithms," Machine Learning Proceedings, pp. 293–301, 1994. [27] V. S. Devi and M. N. Murty, "An incremental prototype set building technique," Pattern Recognition, vol. 35, no. 2, pp. 505–513, 2002.