Abstract
N-ary error correcting output codes (ECOC) decompose a multi-class problem into simpler multi-class problems by splitting the classes into N subsets (meta-classes) to form an ensemble of N-class classifiers and combine them to make predictions. It is one of the most accurate ensemble learning methods for traditional classification tasks. Deep learning has gained increasing attention in recent years due to its successes on various tasks such as image classification and speech recognition. However, little is known about N-ary ECOC with deep neural networks (DNNs) as base learners, probably due to the long computation time. In this paper, we show by experiments that N-ary ECOC with DNNs as base learners generally exhibits superior performance compared with several state-of-the-art ensemble learning methods. Moreover, our work contributes to a more efficient setting of the two crucial hyperparameters of N-ary ECOC: the value of N and the number of base learners to train. We also explore valuable strategies for further improving the accuracy of N-ary ECOC.
This is a preview of subscription content, access via your institution.
References
Ahmed, E., Jones, M., Marks, T.K. (2015). An improved deep learning architecture for person re-identification. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3908–3916.
Ahonen, T., Hadid, A., Pietikainen, M. (2006). Face description with local binary patterns: Application to face recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 28(12), 2037–2041.
Allwein, E.L., Schapire, R.E., Singer, Y. (2000). Reducing multiclass to binary: A unifying approach for margin classifiers. Journal of Machine Learning Research, 1, 113–141.
Breiman, L. (1996). Bagging predictors. Machine learning, 24(2), 123–140.
Buckland, M., & Gey, F. (1994). The relationship between recall and precision. Journal of the American Society for Information Science, 45(1), 12.
Buda, M., Maki, A., Mazurowski, M.A. (2017). A systematic study of the class imbalance problem in convolutional neural networks. arXiv:1710.05381.
Chawla, N.V. (2009). Data mining for imbalanced datasets: An overview. In: Data mining and knowledge discovery handbook, pp. 875–886. Springer.
Cortes, C., & Vapnik, V. (1995). Support-vector networks. Machine Learning, 20(3), 273–297.
Dietterich, T.G. (2002). Ensemble Learning. The handbook of brain theory and neural networks, 2nd edn. Cambridge: MIT Press.
Dietterich, T.G., & Bakiri, G. (1995). Solving multiclass learning problems via error-correcting output codes. Journal of Artificial Intelligence Research, 2(1), 263–286.
Freund, Y., & Schapire, R.E. (1997). A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences, 55 (1), 119–139.
Goodfellow, I., Bengio, Y., Courville, A. (2016). Deep learning. London: MIT Press.
Han, J., Pei, J., Kamber, M. (2011). Data mining: Concepts and techniques, 3rd edn., Waltham, Morgan Kaufmann.
Hastie, T., Rosset, S., Zhu, J., Zou, H. (2009). Multi-class AdaBoost. Statistics and Its Interface, 2(3), 349–360.
Ho, T.K. (1995). Random Decision Forests. In: Proceedings of the 3rd international conference on document analysis and recognition, pp. 278–282.
Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., Guadarrama, S., Darrell, T. (2014). Caffe: convolutional architecture for fast feature embedding. In: Proceedings of the 22nd ACM international conference on multimedia, pp. 675–678.
Krizhevsky, A., & Hinton, G. (2009). Learning multiple layers of features from tiny images. Master’s thesis, department of computer science. Canada: University of Toronto.
Krizhevsky, A., Sutskever, I., Hinton, G.E. (2012). Imagenet classification with deep convolutional neural networks. In: Proceedings of the advances in neural information processing systems, pp. 1097–1105.
LeCun, Y., Bottou, L., Bengio, Y., Haffner, P. (1998). Gradient-based learning applied to document recognition. In: Proceedings of the IEEE, pp. 2278–2324.
Lee, S., Purushwalkam, S., Cogswell, M., Crandall, D., Batra, D. (2015). Why M heads are better than one: Training a diverse ensemble of deep networks. arXiv:1511.06314.
Lin, M., Chen, Q., Yan, S. (2014). Network in network. In: Proceedings of the international conference on learning representations.
Masko, D., & Hensman, P. (2015). The impact of imbalanced training data for convolutional neural networks. Bachelor’s thesis, school of computer science and communication. Sweden: KTH Royal Institute of Technology.
Müller, H., Michoux, N., Bandon, D., Geissbuhler, A. (2004). A review of content-based image retrieval systems in medical applications - clinical benefits and future directions. International Journal of Medical Informatics, 73(1), 1–23.
Netzer, Y., Wang, T., Coates, A., Bissacco, A., Wu, B., Ng, A.Y. (2011). Reading digits in natural images with unsupervised feature learning. In: Proceedings of the NIPS workshop on deep learning and unsupervised feature learning.
Nilsback, M.E., & Zisserman, A. (2008). Automated flower classification over a large number of classes. In: Proceedings of the 6th Indian conference on computer vision, graphics & image processing, pp. 722–729.
Opitz, M., Possegger, H., Bischof, H. (2016). Efficient model averaging for deep neural networks. In: Proceedings of the 13th Asian conference on computer vision (ACCV’16).
Pearson, K. (1895). Note on regression and inheritance in the case of two parents. Proceedings of the Royal Society of London, 58, 240–242.
Quinlan, J.R. (1986). Induction of decision trees. Machine Learning, 1(1), 81–106.
Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., et al. (2015). Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 115 (3), 211–252.
Schapire, R.E. (1990). The strength of weak learnability. Machine learning, 5(2), 197–227.
Seiffert, C., Khoshgoftaar, T.M., Van Hulse, J., Napolitano, A. (2008). Resampling or reweighting: A comparison of boosting implementations. In: Proceedings of the IEEE international conference on tools with artificial intelligence, pp. 445–451.
Skalak, D.B. (1996). The sources of increased accuracy for two proposed boosting algorithms. In: Proceedings of the american association for artificial intelligence workshop on integrating multiple learned models.
Van Rijsbergen, C. (1979). Information Retrieval. London: Butterworth.
Vriesmann, L.M, de Souza Britto Jr, A., De Oliveira, L.E.S., Sabourin, R., Ko, A.H.R. (2012). Improving a dynamic ensemble selection method based on oracle information. International Journal of Innovative Computing and Applications, 4(3-4), 184–200.
Wei, X.S., Gao, B.B., Wu, J. (2015). Deep spatial pyramid ensemble for cultural event recognition. In: Proceedings of the IEEE international conference on computer vision workshops, pp. 38–44.
Xie, J., Xu, B., Chuang, Z. (2013). Horizontal and vertical ensemble with deep representation for classification. In: Proceedings of the ICML workshop on representation learning.
Yosinski, J., Clune, J., Bengio, Y., Lipson, H. (2014). How transferable are features in deep neural networks?. In: Proceedings of the advances in neural information processing systems, pp. 3320–3328.
Yule, G.U. (1900). On the association of attributes in statistics: With illustrations from the material of the childhood society. Philosophical Transactions of the Royal Society of London Series A, 194, 257–319.
Zhou, J.T., Tsang, I.W., Ho, S.S., Muller, K.R. (2016). N-ary error correcting coding scheme. arXiv:1603.05850.
Zhou, Z.H. (2012). Ensemble methods: Foundations and algorithms, Chapman & Hall/CRC, Boca Raton.
Author information
Affiliations
Corresponding author
Appendix
Appendix
A. Supporting Evidences for the Effect of the Third Factor in Hypothesis 1
To verify the effect of merging classes on the feature learning of the DNN base learners of N-ary ECOC, we train the AlexNet (Krizhevsky et al. 2012) using the ILSVRC2012 dataset with two different encoding strategies. The first one is ECOC encoding, in which the degree of merging classes is large. We follow Yosinski et al. (2014) to split the 1,000 classes into two meta-classes. Each meta-class contains approximately half (551 classes versus 449 classes) of the 1,000 original classes. The second one is RI encoding (keeping the original 1,000 classes), in which the degree of merging classes is small.
Figure 5 shows the visualizations of the learned features of the first convolutional layer at different iterations (5,000, 50,000 and 450,000) for the two encoding strategies. There are six plots, in which Figs. 5a, b, c and d, e, f correspond to ECOC and RI encodings, respectively. For each plot, there are 96 squared blocks, each of which represents a learned feature map (also called filter) of size 11 (width) ×11 (height) ×3 (channel).
We see from Fig. 5 that a DNN with two meta-classes (Fig. 5a, b, c) tends to learn less representative features compared with a DNN with 1,000 classes (Fig. 5d, e and f). For instance, if we compare Fig. 5a with Fig. 5d, we see that most of the learned features of the DNN with ECOC encoding at the 5,000th iteration look like noisy features (e.g., the lower half of the rows of the learned features in Fig. 5a). However, the DNN with RI encoding at the 5,000th iteration has learned many representative features (e.g., the edges and the color patterns in Fig. 5d).
If we compare Fig. 5b with Fig. 5e, i.e., at the 50,000th iteration, we see that the DNN with ECOC encoding (Fig. 5b) has learned more representative features (compared with itself at the 5,000th iteration in Fig. 5a), while the DNN with RI encoding (Fig. 5e) keeps refining its learned features.
If we compare Fig. 5c with Fig. 5f, i.e., at the 450,000th iteration (the end of the training), we see that the learned features of the DNN with ECOC encoding (Fig. 5c) are not as representative as those of the DNN with RI encoding (Fig. 5f) in terms of the complexity of the patterns of the learned features. For instance, most of the color patches of the DNN with ECOC encoding represent some kind of gradient maps to the edge direction, while many of the color patches of the DNN with RI encoding represent more complex gradient maps such as a color patch surrounded by other different color patches (e.g., the features in the 6th row and the 5th column, the 6th row and the 7th column, and the last row and the 5th column of Fig. 5f), which are useful in detecting a sharp change in color in almost all directions.
To verify the effect of merging classes on the generalization ability of the DNN base learners of N-ary ECOC, we show the average base model accuracies (BAs) of the tested DNN models (as listed in Table 2 in Section 3.1) with respect to N in Fig. 6, as BA shows how well the learned DNNs generalize to the test data. We see from Fig. 6 that, for the DNNs that follow Pattern 1 (MNIST-LeNet, SVHN, CIFAR10-QUICK and CIFAR10-FULL)), the BA generally decreases as N increases. Moreover, the base learners with N < N_{C} have BAs which are higher than the BA of RI (N = N_{C}). Two possible reasons behind the above observations are, for a dataset with a small number of classes (i.e., N_{C} < 25), the effect of merging classes is small, and the BA is practically determined by the value of N (the smaller N, the easier the classification task).
On the other hand, for the DNNs that follow Pattern 2 (CIFAR100-NIN, CIFAR100-QUICK and FLOWER102-AlexNet), the BA first decreases a large amount (more than 10%), then slowly increases as N continues to increase. Moreover, most of the base learners with N < N_{C} have BAs which are much lower than the BA of RI. Intuitively, an N-class (N < 100) classification problem is easier than a 100-class classification problem. However, the results are the opposite. One possible reason behind the above observations is that, for a dataset with a large number of classes (e.g., N_{C} ≥ 100), when N is small (N ≪ N_{C}), N_{C} − N is large (i.e., the degree of merging classes is large), thus the BA is generally low compared with the BA of RI. Note that the DNN with N = 2 (a binary classification problem) corresponds to the easiest classification task (i.e., a random guess will be 50% correct if the class distribution is balanced) among all possible values for N (2 ≤ N ≤ N_{C}), and thus it can still have a high BA. When N begins to increase, the accuracy of a random guess drops sharply, for instance, 33% for N = 3 and 25% for N = 4, even if the class distribution is balanced. Thus there is a drop for the BA. When N further increases, the effect of merging classes is weakened, thus the BA begins to increase and finally saturates.
The above observations suggest that both the feature learning and the generalization ability of the DNN base learners of N-ary ECOC are largely influenced by the effect of merging classes.
B. Our extended disagreement measure
As defined in Section 4.3, m is the number of examples in the test dataset, C_{i} and C_{j} are two base learners of N-ary ECOC, y_{k,i} and y_{k,j} are the predictions of C_{i} and C_{j}, respectively, on the k th test example. We propose to map the predicted meta-class label y_{k,i} (each label takes a value from 0, ..., N − 1) to the original class label (each label takes a value from 0, ..., N_{C} − 1), and then construct a one-by-N_{C} bit vector \(y_{k,i}^{\prime }\) to indicate if the original class labels are included in the predicted meta-class y_{k,i}. Thus the mapping from y_{k,i} to \(y_{k,i}^{\prime }\) is \(\lbrace 0,\ ...,\ N - 1 \rbrace \rightarrow \lbrace 0,\ ...,\ N_{C} - 1\rbrace ^{N_{C}}\).
Take Table 1 (b) in Section 2 as an example, suppose the base learners trained with f_{0} and f_{2} both predict the meta-class 0, for the k th test example, i.e., y_{k,0} = 0 and y_{k,2} = 0. Since the meta-class 0 of f_{0} consists of the classes c_{0} and c_{8} while the meta-class 0 of f_{2} consists of the classes c_{2} and c_{6}, the corresponding \(y_{k,0}^{\prime }\) and \(y_{k,2}^{\prime }\) are \(y_{k,0}^{\prime }\) = [1 0 0 0 0 0 0 0 1 0] and \(y_{k,2}^{\prime }\) = [0 0 1 0 0 0 1 0 0 0]. That is, the c_{0} and c_{8} positions of \(y_{k,0}^{\prime }\) as well as the c_{2} and c_{6} positions of \(y_{k,2}^{\prime }\) are 1, while all other positions are 0.
Then we estimate the Hamming distance between \(y_{k,i}^{\prime }\) and \(y_{k,j}^{\prime }\), i.e., \((y_{k,i}^{\prime } - y_{k,j}^{\prime } ) (y_{k,i}^{\prime } - y_{k,j}^{\prime } )^{T}\), as the distance between y_{k,i} and y_{k,j}. For instance, in the above example, the distance between the predictions of the meta-class 0 by DNNs trained with f_{0} and f_{2} is 4 as four bit positions of \(y_{k,i}^{\prime }\) and \(y_{k,j}^{\prime }\) are different.
The advantage of our proposal is that it fits our intuition. That is, the more overlap exists in the classes included in the predicted meta-classes, the smaller the distance. In a general case that the predicted meta-classes of two base learners share common classes, the distance measured by our proposal will be smaller than the above example, in which no common class exists. For instance, if the predicted meta-classes are 2 and 3 for f_{0} and f_{2}, respectively (note that the meta-class 2 of f_{0} and the meta-class 3 of f_{2} share a common class c_{1}), then the distance is 2 as only two bit positions of \(y_{k,i}^{\prime }\) and \(y_{k,j}^{\prime }\) are different. In the extreme case that the classes included in the predicted meta-classes are identical, the distance is 0. For instance, both the meta-class 0 of f_{0} and the meta-class 4 of f_{2} consist of the classes c_{0} and c_{8}.
Finally, we propose to estimate the disagreement measure of the base learners of N-ary ECOC by
where \(y_{k,i}^{\prime }\) and \(y_{k,j}^{\prime }\) are the mapped back predictions of the base learners C_{i} and C_{j}, respectively, on the k th test example.
By averaging the disagreement measures over all pairs of base learners based on Skalak (1996), we obtain
Rights and permissions
About this article
Cite this article
Zhao, K., Matsukawa, T. & Suzuki, E. Experimental validation for N-ary error correcting output codes for ensemble learning of deep neural networks. J Intell Inf Syst 52, 367–392 (2019). https://doi.org/10.1007/s10844-018-0516-5
Received:
Revised:
Accepted:
Published:
Issue Date:
Keywords
- N-ary error correcting output codes
- Ensemble learning
- Deep neural networks
- Image classification