Data & Knowledge Engineering

Capturing the dependences among circular variables within supervised classification models is a challenging task. In this paper, we propose four different supervised Bayesian classification algorithms where the predictor variables follow all circular wrapped Cauchy distributions. For this purpose, we introduce four wrapped Cauchy classifiers. The bivariate wrapped Cauchy distribution is the only bivariate circular distribution whose marginals and conditionals are also wrapped Cauchy distributions, a property that makes it possible to define these models easily. Furthermore, the wrapped Cauchy tree-augmented naive Bayes classifier requires the definition of a conditional circular mutual information measure between variables that follow wrapped Cauchy distributions. Synthetic data is used to illustrate, compare and evaluate the classification algorithms (including a comparison with the Gaussian TAN classifier, decision tree, random forest, multinomial logistic regression, support vector machine and simple neural network), leading to satisfactory predictive results. We also use a real neuromorphological dataset obtained from juvenile rat somatosensory cortex cells, where we measure the bifurcation angles of the dendritic basal arbors.


Introduction
Circular data is ubiquitous, present in many different areas such as biology, geology, medicine, oceanography, geophysics, meteorology, astronomy, ecology, neuroscience and geography. Some examples are directions of flight of homing pigeons [1], characterization of the phenology of species [2], formation of feldspar laths in basalt rocks [3], paleomagnetism in red slits and claystones [4,5], also in political sciences studying the gun crimes occurred in an specific period of time [6], directional word vectors in text mining [7], wildfire orientation in order to prevent fire propagation [8], wind and waves direction analysis [9,10], study and prediction of protein dihedral angles structure [11,12], and neuronal basal dendritic bifurcation angles analysis [13,14] among many others. The natural periodicity of circular data sometimes makes traditional statistics methods ineffective, since they ignore this characteristic. For instance, when dealing with circular data, 0 • and 360 • are considered as the same point, whereas if considered non-circular data, they are different points. Thus, circular data analysis is distinct from and more challenging than non-circular data. It should be noted that circular data has been studied extensively [3,15,16].
Probabilistic graphical models [17] are useful tools for data modeling that connect probability theory with graph theory. There are many advantages of using probabilistic graphical models, such as the fact that they are easily interpreted, they handle missing data effectively and they treat inference and learning tasks together. Bayesian networks [18] are one of the most commonly used probabilistic graphical models due to their factorization and domain representation properties. Bayesian networks have the where is the mean angle and the concentration parameter. in Eq. (1) is unimodal and symmetric about unless = 0, which yields the circular uniform distribution (i.e., ( ) = 1∕2 ).
As far as we know, there is no other bivariate circular distribution for which conditional and marginal distributions belong to the same family. Therefore, wrapped Cauchy is the best choice given no better alternative, as the requirements for the classifier structures that we will develop are of at most a tree-structure (i.e., only bivariate, marginal and conditional densities are required). Furthermore, we require the definition of a conditional circular mutual information measure between variables that follow wrapped Cauchy distributions.

Parameter estimation
Working with the density given by Eq. (2), numerical methods have to be used to find the parameter estimates, since there is no closed-form expression for the maximum likelihood estimates. Kato and Pewsey [35] demonstrated that the method of moments [37] is more efficient; it is computationally very fast, easy to implement and with closed form formulas for the parameter estimates.

Wrapped Cauchy classifiers
Let Θ = ( 1 , … , ) be a vector of circular predictor random variables or features, and let be a discrete class variable which takes values (labels) in the set ( ). Given a sample of labeled instances (Θ 1 , 1 ), … , (Θ , ), the supervised classification problem consists of developing a model capable of assigning a class label to a new object based on the values of its features.
Bayesian network classifiers [13] have been used to solve classification problems with linear data, because of their easy representation of the problem domain and the efficient computation of the algorithms associated with Bayesian networks techniques. Our novel purpose is to develop the circular domain counterpart of four well-known Bayesian network classifiers (naive Bayes, selective naive Bayes, semi-naive Bayes and tree-augmented naive Bayes) when the underlying variables follow wrapped Cauchy distributions.

Wrapped Cauchy naive Bayes
The wrapped Cauchy naive Bayes (wCNB) classifier is the simplest of the four Bayesian network classifier models that we present in this paper, where is the parent of all circular features and these are assumed to be conditionally independent among them given ( Fig. 1) The wCNB determines the class value * for a new instance using a maximum a posteriori decision rule * = arg max ∈ ( ) ( = |Θ = ).
Since each predictor variable given = follows a wrapped Cauchy distribution with location parameter , and concentration parameter , , we can express Eq. (6) as where , = 1 .

Wrapped Cauchy selective naive Bayes
Sometimes there are several predictor variables that do not contribute to classification (i.e., they are redundant), and naive Bayes classifier is affected by such variables [38]. Determining which of them are unnecessary via the use of feature subset selection (FSS) techniques [39] could increase the accuracy of the classification model significantly [40]. Wrapped Cauchy selective naive Bayes (wCsNB) is a classification model with a structure similar to that of wCNB, but not all the variables are necessarily used by the classifier. FSS techniques were previously employed in a circular classification model with von Mises and von Mises-Fisher distributions in [28], where a filter-wrapper algorithm is applied to rank the variables according to the mutual information between them and the class, and therefore, using the ranking provided by the filter step, the variables are selected to induce a new classifier until the best model is achieved.
We also use a filter-wrapper algorithm at this point. The filter step is based on the computation of the mutual information (MI) between each circular variable and the class. There is no equation to compute the MI between circular variables and discrete variables. Therefore, we approach the problem using Monte Carlo methods, as in [28]; we model the conditional density functions of | = as wrapped Cauchy distributions. Hence where is the number of instances the fitted wrapped Cauchy density function of the conditional density function of given = , and̂( = ) the relative frequency of instances that belong to class in the training set.
The predictive variables are then ranked according to their MI values. The wrapper step consists of creating a new classifier by deciding whether or not to include the ranked predictive variables from the filter step. Each iteration of the wrapper step induces a new classifier adding the next predictive variable from the list. If no accuracy improvement is achieved by including the next predictive variable from the ranked list, then the wrapper step finishes. This model is similar to the WcNB, but including only the selected wrapped Cauchy variables (set ) (Fig. 2) and therefore As for the wCNB, the wCsNB determines the class value * for a new instance using a maximum a posteriori decision rule * = arg max Likewise for Eq. (7), we can express Eq. (9) as and , = 2 , cos( , − , ) .

Wrapped Cauchy semi-naive Bayes
Usually, the assumption of conditional independence between predictive variables given the class variable is dismissed. The semi-naive Bayes classification model [41] goes one step further and considers dependencies between predictive variables.
Our proposal for this model, called wrapped Cauchy semi-naive Bayes (wCsmNB) classifier, takes into account the possible dependence between predictive wrapped Cauchy variables by introducing new features obtained as the Cartesian product of two of the original circular predictor variables. Thus we work with a bivariate wrapped Cauchy distribution. These new features remains conditionally independent given the class variable.
Given with = 1, … , , representing the th feature (original or new features) To determine those original variables that are candidates to create new features from the Cartesian product between them, we develop an adaptation of the forward sequential selection and joining (FSSJ) algorithm [42] described in Algorithm 1. It is important to note that once the new features are created by joining two original features, these new features cannot be used to create others. However, these new features can be separated in order to use one of the two original features to create another new feature by joining with a different original feature that had not yet been added to the model. This algorithm may result in a selection of variables that provide the best achievable solution, before all of the original variables are included in the model (Fig. 3). Again, as for the previous models presented in this section, the wCmNB determines the class value * for a new instance using a maximum a posteriori decision rule * = arg max ∈ ( ) ( = |Θ = ).
Algorithm 1 Adaptation of the FSSJ algorithm of [42] 1: Let be the variable list, initialized as = ∅. 2: Given 1 , 2 , ..., circular wrapped Cauchy predictor variables from a variable list , move the first variable from to . 3: Move the next variable from to , considering: • Joining the variable to another variable currently in . If the latter variable was previously joined to another variable from , remove this from and add it to , and consider adding it later. • Add the variable as conditionally independent of the other variables given to the current classifier.

4: Repeat
Step 3 until the best model is achieved

Wrapped Cauchy tree-augmented naive Bayes
The tree-augmented naive Bayes (TAN) classifier [43] is a well-known Bayesian classifier with a tree-structure network for predictive features. Wrapped Cauchy tree-augmented naive Bayes (wCTAN) classifier is a variation of the TAN classifier with the novelty of the allowance of the use of wrapped Cauchy circular variables for predictive features. wCTAN assumes that the class variable has no parents, and the rest of the variables have at most one other variable as parent apart from C (Fig. 4).
The process for building a wCTAN is summarized in the following three steps: • Step 1: The structure of the tree for predictive features is learned using Algorithm 2. We use the conditional circular mutual information, denoted as CMI( , | ), which is defined as ) .
where the marginal density functions given the class, ( | ) and ( | ), and the joint density function given the class, ( , | ), have been previously estimated from data. This structure learning algorithm (Algorithm 2) is based on score and search, where structure learning is posed as an optimization problem, using a maximum weighted spanning tree algorithm (where the weights are given by the CMI), a variant of the Chow Liu algorithm [44].

Algorithm 2 Adaptation of the Chow Liu algorithm of [44]
1: Given 1 , 2 , ..., wrapped Cauchy variables, estimate the bivariate joint density function ( , | ) for all pairs of variables, and the marginals ( | ), for each ∈ ( ), , = 1, ..., 2: Using these, compute all conditional CMI( , | ) values, (i.e., the ( − 1)∕2 edge weights) and order them 3: Assign the largest two edges to the undirected tree to be represented 4: Examine the next-largest edge, and add it to the tree unless it forms a loop, in which case discard it and examine the next largest edge 5: Repeat Step 4 until − 1 edges have been selected (and the spanning undirected tree is finished)

For
Step 1 in Algorithm 2, the estimate of the bivariate and marginal densities are performed for each using the methods explained in Section 2. Like the traditional mutual information measure for linear variables, the CMI( , | ) denotes the entropy reduction of ( ) when the value of ( ) is known given , and represents the weight that links and .  4 ). The associated tree-structured Bayesian network has 4 as its root node.
Once we have learned the undirected structure, a root node must be selected in order to determine the root of the tree by following the structure learned by Algorithm 2. Depending on the selected root node and given the undirected tree structure with nodes, there are possible resulting directed trees.
• Step 2: We add a class node to the network structure. We connect this class node to every other node with an arc from ( Fig. 4). • Step 3: Finally, we complete the classification model with the estimation of the parameters for each node given its parent node(s).
Therefore the conditional probability of given the predictors is where is the wrapped Cauchy parent of variable and is the root node of the tree. Similar to the approach used in the rest of the models presented in this paper, the maximum a posteriori decision rule is used to determine the predicted class * * = arg max ∈ ( ) ( = |Θ = ).

Experimental results
In this section, we report experiments carried out to show the behavior of each proposed classification model in Section 3. We include the comparison among the four circular classifiers and also with some of the best-known classification algorithms for linear data, such as decision tree (DTree), random forest (Rfor), multinomial logistic regression (MLG), support vector machine (SVM), simple neural network (Nnet) and the Gaussian tree-augmented naive Bayes classifier (GTAN) for continuous data, with the structure learned with the algorithm in [45] where predictor variables given the class value are assumed to follow Gaussian distributions.
The experiments were run using R software [46]. To generate the artificial datasets to test the models, we used the ''Circular'' R package for the simulation of circular data, and to implement the structure of the wCTAN classifier, we have adapted the ''bnclassify'' R package [47]. Simulating data that follows wrapped Cauchy distributions is easy and computationally very fast. Given the parameters, the ''Circular'' R package simulates wrapped Cauchy data by wrapping the simulation of a Cauchy distribution whose location parameter is the same as the wrapped Cauchy location parameter and the scale parameter is the negative logarithm of the wrapped Cauchy concentration parameter. If the wrapped Cauchy concentration parameter is equal to 1, then the value of the simulation will be the location parameter, whereas if the concentration parameter is equal to 0, the simulation is performed from a Uniform distribution in [0, 2 ).
In order to test the algorithms, we enforced dependence between nodes giving values of | | in [0.5, 1). The remaining parameters were assigned randomly to each node with − < < and 0 < < 1. For each classifier, we simulated 10 datasets each with 1000, 200 and 50 instances and 3, 5, 10, 20, 30, 45, 65, and 100 wrapped Cauchy predictor variables and a discrete class variable with 3, 6, 10, 15 and 20 different labels, so we simulated 1200 different datasets for each type of classifier. A 10-fold cross-validation was used to estimate the classification accuracy. Results for Bayesian network classifiers are shown in Table 1, while results for traditional linear classification algorithms are shown in Table 2.
We also applied the non-parametric Friedman test to detect statistically significant differences among our classification models as a whole set [48]. When the null hypothesis was rejected, we proceeded with post-hoc tests. We chose the Nemenyi test [49], as suggested by [50]. The significance level for all tests was 0.05.
Since multiple classifiers are compared, it is useful to represent the results of the post-hoc tests visually. The graph proposed by Demšar [50] is a simple diagram to easily represent these results. The top line is the axis on which we plot the average Friedman test ranks of the classifiers. The lowest (best) ranks are to the right, and we therefore consider the classifiers to the right as better. For the comparison results of all classifiers against each other, those that are not significantly different ( -value ≥ 0.05 in the Nemenyi post-hoc test) are connected.   Table 3 Mean ± standard deviation accuracy of wCNB, wCsNB, wCsmNB, wCTAN, GTAN, DTree, Rfor, MLG, SVM and Nnet classifiers for different number of variables. Results are averaged from the classification performance from Tables 1 and 2 with 3

Comparison of classification models
In this section, we compare the performance of the wCNB, wCsNB, wCsmNB and wCTAN models, as well as the DTree, Rfor, MLG, SVM, Nnet and the GTAN algorithms, which ignores the circular nature of the data. We analyze the results of the simulation with 1000 instances. Additionally, we analyze the Bayesian network classifiers performance for 50 and 200 instances. Table 3 shows the mean± standard deviation accuracy for each classifier for different number of variables. Each mean± standard deviation accuracy values was obtained from the results of 50 independent 10-fold cross-validation procedures varying the number of labels of the class variable (3, 6, 10, 15 and 20 different labels) with 1000 instances.
The statistical analysis after Friedman test rejection ( -value = 0.000000005) reveals (Fig. 5A) that, varying the number of variables, the best classifiers are wCsmNB, wCTAN, wCNB, Rfor and wCsNB with no statistically significant differences among them , whereas the DTree, GTAN, Nnet and MLG classifiers are the worst, presenting significant differences with respect to the rest of the classifiers but for the SVM (which does not present statistical differences with the Nnet and MLG)and demonstrating that treating circular data as linear-continuous is not effective. The SVM also presents statistical differences when compared with the wCsmNB, wCTAN and wCNB classifiers, which outperforms the SVM results. Nevertheless, there were no significant differences between the SVM and the remaining circular classifier (i.e., wCsNB) and the Rfor classifier.
Performing the same statistical analysis among Bayesian network classifiers with 50 and 200 instances yields similar results. The Friedman test null hypothesis that there is no significant difference is rejected for both ( -value = 0.00021 and -value = 0.00004, respectively). The post-hoc analysis displays quite similar results to the 1000 instances one; in both cases, there are no statistically significant differences among the wCsmNB, wCTAN and wCNB classifiers, which are the best. Nevertheless, for 50 and 200 instances, there are no significant differences among GTAN and wCsNB classifiers. Furthermore, there are significant differences between the wCsNB and the wCsmNB classifier for the analysis with 50 instances, whereas for 200 instances, statistical differences were seen between the wCsNB classifier and both the wCsmNB and the wCTAN.
We also calculated the mean accuracy for each classifier for different number of labels in the class variable (see Table 4). Each mean accuracy value was obtained from the results of 60 independent 10-fold cross-validation procedures varying the number of variables to be used: 3, 5, 10, 20, 30 and 45 with 1000 instances. We do not include the results of the experiments with more than 45 variables due to the high mean accuracy values obtained in most of the classifiers from Tables 1 and 2, which would bias the results.
Since the Friedman test null hypothesis that there is no significant difference was rejected ( -value = 0.0000011), we performed the corresponding Nemenyi post-hoc analysis. Statistical test results (Fig. 5B) reveal that based on changing the number of labels, the best classifiers are the circular Bayesian network classifiers (i.e., wCsmNB, wCTAN, wCNB and wCsNB), with no statistically significant differences between them. Again, DTree GTAN, Nnet and MLG are the worst, with no significant differences among them. These classifiers shows significant differences with wCsmNB, wCTAN, wCNB, Rfor and wCsNB, whereas SVM only presents significant differences with the wCsmNB, wCTAN, wCNB, DTree and GTAN classifiers.
The analysis for Bayesian network classifiers with 50 and 200 instances again yielded quite similar results to those obtained for 1000 instances. After Friedman test rejections ( -value = 0.0325 for 50 instances, and -value = 0.00066 for 200 instances), post-hoc tests for 200 instances reveal the same statistically significant differences as for 1000 instances, where there is no statistical differences among the wCsmNB, wCTAN and wCNB classifiers, which are the best. For 50 instances, wCsmNB, wCTAN and wCNB are also the best classifiers together with the wCsNB, with no statistically significant differences among them. Likewise, for 1000 instances, GTAN is the worst for the analysis with 50 instances as well as the 200 instances, with no significant differences with the wCsNB classifier.

Real data example
We applied our classifiers to a dataset of 3027 combinations of dendritic bifurcation angles coming from the basal arbors of 288 3D pyramidal neurons in layers II, III, IV, Va, Vb and VI (48 neurons per layer) of the 14-day-old (P14) rat hind limb somatosensory (S1HL) neocortex, recently published in [14] (Fig. 6).
We used the Bayesian network classification models presented in Section 3 and wrapped Cauchy distributions to model the bifurcation angles produced by the splitting of the dendritic segments of basal dendritic trees. The dendritic bifurcation angles are an important part of the geometry of pyramidal cell arbors. Since it is thought that these angles determine the space to be filled by the dendritic wiring, understanding and modeling them are crucial for advances in neuroscience to replicate brain functioning and structure in order to make further on how the brain processes information. This is important not only to understand it biologically (i.e., thoughts, emotions, feelings) but also technological, making essential contributions to new computing. Moreover, brain knowledge is basic for treating brain diseases such as Parkinson or Alzheimer.
Predicting which layer a neuron belongs to is an important task to help understand any neural circuit, and it represents part of the picture regarding the identification and characterization of all its components. To the best of our knowledge, there is no any supervised classification model that predicts the layer using circular predictive variables. Thus, we developed a classification model to predict which layer a given neuron belongs to, i.e., ( ) = ( , , , , , ). Following the notation used in [14], 1 will correspond to the first bifurcation angle (Order 1) generated for the first split of the dendritic segments starting from the soma. The second angle generated by the next consecutive splits will be represented as variable 2 (Order 2), etc. (Fig. 7). Angles of orders higher than six which were relatively scarce were not included in the model. For each set of angles of the same order, a wrapped Cauchy distribution was fitted (Table 5). We performed a goodness-of-fit test by transformation on the circle of the variables into circular uniform variables via 2 ( 1 ), … , 2 ( 6 ), where is the cumulative distribution function, and applied Kuiper's test [51] for circular uniformity with a significance level of = 0.05.
Note that in Table 5 the circular mean tends to decrease as the order increases. A neuroscientific explanation for this behavior relates to the fact that it is the first bifurcation orders that determine the volume of space to be filled by the dendritic trees [14]. This   regulates the dendritic branching development rules that seem to determine the synaptic connectivity of pyramidal neurons. We also observe that the concentration values are high (around 0.91) and quite similar in every bifurcation order. This fact demonstrates that the dendritic structure (in terms of bifurcation angles) is determined by the location parameter.
Since not all dendritic arbors present angles of all orders, one classifier for the whole dataset is not suitable. Therefore, for each classification model proposed in this paper, we created a battery of five classifiers depending on the maximum bifurcation order of the arbor, when this is higher than 1 (Fig. 8). Before predicting class * , we have to check the maximum bifurcation order of the instance to be classified. For the wCTAN and GTAN structures (which require a root node in addition to the class node) we select as root node 2 for every classifier of the battery. We performed 10 fold cross-validation procedures in order to obtain the mean classification accuracy values for each classifier and maximum bifurcation order (Table 6).
We observe in Table 6 that the wCsNB classifier leads to the best results for arbors with a maximum branching order of 2 and 3 . Furthermore, for arbors with a maximum branching order of 4 , 5 or 6 , the wCsmNB seems to perform best in terms of classification accuracy. The wCTAN and wCNB classifiers also report acceptable values in comparison with the highest ones for each maximum bifurcation arbor, although the wCsNB or wCsmNB classification models always perform better for this neuronal dataset. Comparing the accuracy results with the random label assignation (i.e., 1∕6 = 0.16), we observe that all of these results are over 0.16. In addition, for every case, the GTAN classifier exhibits the lowest accuracy values, below 0.16 except for 4 . This classifier was especially inaccurate for arbors that had maximum branching order of 6; the mean accuracy value was 0.047 for such cases.

Table 6
Mean ± standard deviation of layers II, III, IV, Va, Vb and VI classification accuracy results of the battery of classifiers for each type of classifier applied over the dataset of dendritic bifurcation angles coming from the basal arbors of 288 3D pyramidal neurons of P14 rat S1HL neocortex. Bolded results are best performing classifiers. We applied the Friedman non-parametric test to detect statistically significant differences in the results provided by our algorithms. Since the null hypothesis that there is no significant difference was rejected ( -value = 0.004), we used Nemenyi post-hoc test to determine which pairwise of algorithms was the cause of the Friedman test rejection. In Fig. 9, the statistically significant differences between our classifiers are represented as a Demšar diagram. We noted that there are no statistically significant differences between our classification algorithms except for two cases; between the wCTAN and wCsNB and between GTAN and the wCsmNB.
Therefore, we can conclude that (i) apart from the difficulty identifying the layer a case belongs to, it seems reasonable to use any of our four proposed circular classifiers for this neuronal dataset, since there are no any statistically significant differences between them and (ii) GTAN is never recommended.

Conclusions and future work
Introducing the first set of supervised Bayesian classification models capable of dealing with circular wrapped Cauchy predictive variables was the main objective of this paper. We have presented four models and their algorithms, designed to perform classification. We demonstrated using synthetic data that these models could perform classification accurately given circular datasets. We also provided evidence of the improvement of the circular classifiers over linear classifiers for datasets of circular nature that follow wrapped Cauchy distributions.
We performed statistical comparisons among the classifiers using synthetic data with 50, 200 and 1000 instances. Based on the results, we realized that the wCsmNB, the wCTAN and the wCNB are the best classification models for circular data that follows wrapped Cauchy distributions, with no statistically significant differences among them. The linear classifier never outperformed any of the wrapped Cauchy classifiers For each of our new proposals, we evaluated a battery of classifiers using a real-world neuroscience dataset, in order to predict the layer that an instance belongs to. Results revealed that all of our four classification models are suitable. Performing Friedman test and its corresponding Nemenyi post-hoc test after rejection, we realized that there are no any statistically significant differences between wCNB, wCsNB, wCsmNB and wCTAN for this dataset. Wrapped Cauchy classifiers always outperformed their linear (Gaussian) counterparts.
The models shown in this paper are limited to no more than bivariate relationships. In future work, we intend to develop multivariate models in order to extend the Bayesian network classifiers for circular data to other more-sophisticated Bayesian network models (like k-dependence Bayesian network classifiers) capable of representing and taking into account multivariate relationships between circular variables -a difficult task due to the non-closed nature of the circular families that are known to date.