A generative semi-supervised classifier for datasets with unknown classes

Classification has been tackled by a large number of algorithms, predominantly following a supervised learning setting. Surprisingly little research has been devoted to the problem setting where a dataset is only partially labeled, including even instances of entirely unlabeled classes. Algorithmic solutions that are suited for such problems are especially important in practical scenarios, where the labelling of data is prohibitively expensive, or the understanding of the data is lacking, including cases, where only a subset of the classes is known. We present a generative method to address the problem of semi-supervised classification with unknown classes, whereby we follow a Bayesian perspective. In detail, we apply a two-step procedure based on Bayesian classifiers and exploit information from both a small set of labeled data in combination with a larger set of unlabeled training data, allowing that the labeled dataset does not contain samples from all present classes. This represents a common practical application setup, where the labeled training set is not exhaustive. We show in a series of experiments that our approach outperforms state-of-the-art methods tackling similar semi-supervised learning problems. Since our approach yields a generative model, which aids the understanding of the data, it is particularly suited for practical applications.


INTRODUCTION
The task of classification has been tackled by a large number of supervised machine learning approaches, yielding very accurate and impressive results for different types of datasets. These methods prove to be successful if the modelled scenario fulfills an assumption that is implicitly required for the training data: the availability of a sufficiently large amount of labeled samples for training, covering the whole variety of all possible classes. While elementary approaches only handle binary problems (i.e., identify one class against another), especially more recent techniques using deep learning have beaten the state-of-the-art benchmarks regarding classification accuracy on thousands of different categories, but at the price of increasing the need for labeled training data even further. Contemporary big data applications deliver this huge amount of data, but often lack consistency and correct labeling in advance.
In this work, we propose a semi-supervised classifier that is able to deal with unknown classes and that is generative. Semi-supervised classification ranges between classification in a supervised setting and semi-supervised clustering: we assume that we obtain labeled as well as unlabeled data for training and aim to identify specific classes. In contrast to supervised learning, we have unlabeled data available; in contrast to semi-supervised clustering, where the supervised information is usually introduced by partition-based or must-link/cannot-link constraints, we do not allow reordering, merging or splitting of the predefined classes. As a further difference to clustering, we evaluate our results with the F1-measure instead of clustering quality metrics.
A conventional classifier would assume that each class, which needs to be identified, is represented by a sufficiently large number of labeled instances in the training set. In our case, we assume that some classes might not be present in the labeled part of the training dataset at all, but only in the unlabeled part -we refer to them as unknown classes. This restriction is especially challenging, if the number of labeled training data is small compared to the number of unlabeled training data. Similar constraints appear in one-class classification or PU learning (learning from positive and unlabeled examples), see e.g. [8]. Both of these fields assume that classes exist, which are not covered by the labeled part of the training set, but restrict to one class of interest (which is usually represented in the labeled training data by a sufficiently large number of instances). In contrast, we consider that multiple classes are represented in both the labeled and unlabeled training data in this work.
As a third aspect, our method is generative and therefore allows a deeper understanding of the data by a statistical model, which presents an advantage over discriminative approaches.
We present a two-step approach, which comprises Bayesian classifiers as elementary components. In the first step, the S-EM algorithm, introduced in [9] for PU learning, is used to fit a classifier to the known classes and an artificial additional class accounting for unknown classes (classes not present in the labeled training data). Therefore, the first step identifies a subset of unlabeled data points as "likely unknowns", i.e., as being drawn from unknown classes. In the second step, a Gaussian mixture model is fit to these likely unknowns, combining the likelihoods of the known and unknown classes to obtain an improved Bayesian classifier.
As many similar (but not identical) problems exist in literature (see Section 2), we will demonstrate the abilities of our method in comparison to these in two distinct settings: the multiclass classification problem for which the proposed method is developed, and the PU learning problem, which is a special case of our multiclass classification problem. We compare the results of the applied methods on different open datasets from the UCI library 1 , including the two famous dataset MNIST [6] and LETTER [4], as well as on a wafer-test dataset from semiconductor industry [10]. While these datasets might be considered as "easy" classification problems in the machine learning community, the extension to semi-supervised classification with unknown classes massively complicates the problem since only a subset of the classes is considered for training. Our method achieves a competitive F1-measure when compared to the state-of-the-art algorithms from related fields (see Section 5).

RELATED WORK
Similar problem formulations to our setting were addressed in the machine learning literature. Related fields include exploratory learning, semi-supervised clustering, PU learning and open set recognition.
In [3], a semi-supervised multiclass problem called "exploratory learning" was attacked by the well-known EM algorithm to detect elements from unknown classes. The resulting elements are used to extend the set of classes. As elementary classifiers, methods such as Naive Bayes, seeded K-Means or seeded von-Mises Fisher are suggested. A similar problem is investigated in [2], where a Support Vector Machine (SVM) is generalized to the so-called LACU-SVM, which is able to detect new class by incorporating knowledge from unlabeled data. The approaches of exploratory learning and LACU-SVM additionally consider incremental learning of previously unseen classes, which is not intended in our problem.
[5] presents a statistical physics approach to semi-supervised learning, replacing classifiers minimizing some cost by a Boltzmann distribution (parameterized by a temperature) over all such classifiers. Their approach copes with unknown classes and degenerates to clustering if no labeled data is available. Similarly, the semi-supervised clustering method from [16], which combines model-based cross-entropy clustering with partition-level side information, has the capability to model unknown classes. A generative approach utilizing a constraint-based version of Gaussian mixture models (GMMs) for semi-supervised clustering was presented by [15]. However, semi-supervised clustering does not yield an assignment between the class labels in the training data and the produced clusters. Moreover, the number of produced clusters may be different from the number of classes in the labeled training data.
PU learning and one-class classification solve classification problems, where instances of one positive class of interest must be distinguished from an arbitrary number of other classes, based on labeled data from the positive class only. While one-class classification is restricted to a supervised setup, PU learning additionally considers unlabeled examples from both positive and unknown classes, i.e., a semi-supervised setup. For this purpose, [9] proposed the generative I-EM and the S-EM algorithms for detecting unknown samples in the unlabeled data and used them to train a Naive Bayes classifier.
Step 1 of our work is based on their S-EM algorithm. Discriminative approaches towards PU learning exist as well, e.g. [7], where the Rocchio method is combined with support vector machines. Further, a typical representative of one-class classification is the one-class SVM by [14], which aims to identify the closest possible class boundaries of one single class. However, in contrast to our problem, they do not use unlabeled training data, and their algorithm operates in a discriminative way.
Scheirer et al. formulated the problem of open set recognition, i.e. a classification problem with a rejection option for elements from a new class. They developed the 1-vs-set machine [12], as well as the Weibull-calibrated SVM (W-SVM) [13] for this purpose. As a major difference between the presented approach and open set recognition, the latter contains instances from classes that were not present in the training set. Instead, we assume that all classes are present in the unlabeled training set, but unknown classes do not occur in the labeled training set. In general, open set recognition is considered as supervised (probably containing more labeled training data than in our case) and attacked by discriminative methods, while our focus is put on a semi-supervised, generative model.
Furthermore, [18] demonstrated how new classes, which are observed in a data stream (i.e. one-pass learning) can be integrated into a classifier. In contrast to our proposed approach, these methods are discriminative. Another difference is that online learning systems obtain every instance of the training data only once, which is not required in our case.

A BAYESIAN PERSPECTIVE ON SEMI-SUPERVISED CLASSIFICATION
Formally, our investigated problem is given as follows: We assume that a set ⊂ F of labeled elements and a larger set ⊂ F of unlabeled elements are provided, where F ⊂ R is a continuous feature space. While the elements of are assigned a class label out of := { 1 , . . . , }, we assume that contains elements from the same (known) classes { 1 , . . . , }, as well as from unknown classes { +1 , . . . , + }, i.e., from + := { 1 , . . . , + }. In terms of the notation, we aggregate all unknown classes to one residual class, denoted by 0, and therefore, we aim to classify new elements by one of the classes { 1 , . . . , } ∪ {0} =: 0 (classification with unknown classes). In a mathematical sense, our intention is to find a measurable classification function : F → 0 , which, to each element in F, either assigns one of the known classes or responds with class 0. The scheme of this situation is depicted in Fig. 1. Given the problem setting stated above, we provide a generative method to solve it. As a starting point, we look at the classification problem from the viewpoint of Bayesian decision theory: we assume that a decision corresponds to assigning a new sample to a known class ★ ∈ or to the unknown class ★ = 0. Hence, the optimal Bayesian posterior decision given a bounded loss function : 0 × + → [0, ], < ∞, is obtained by the class, which minimizes the expected value of the loss function This approach is usually referred to as Bayesian posterior expected loss. We will next adapt the 0-1-loss to our purposes, i.e. we propose the following loss function: The loss is defined as 0, if either both and˜represent the same known class, or˜is any unknown class and is 0. Otherwise, the loss is set to 1.
The likelihood and the prior for ∈ will now be selected in the same way as for the Bayes classifier. In detail, we assume that the likelihood corresponds to a (potentially multivariate) Gaussian distribution, the prior is calculated by the relative frequency of each class within the labeled dataset ( ) = | | | | , where is the subset of labeled elements from class . Alternatively, a uniform prior is also applicable. However, as we need to account for the "unknown" classes as well, the prior distribution is modified by a probability of unknown classes 0 . The resulting components of the model (restricted to ∈ ) are summarized by The challenge of the problem is now to select an appropriate prior and likelihood for the "unknown" classes, which should be mapped to the class 0. We propose a model for which the likelihood is the same for all unknown classes, i.e., that ( | +1 ) = · · · = ( | + ) =: ( |0). Specifically, we model the data for unknown classes using a GMM. It follows that the probability of unknown classes evaluates to (0) := 0 = ℓ=1 ( +ℓ ), i.e. the prior probability for the unknown classes. With (0) and ( |0) defined accordingly, (4) is solved by what resembles the well-known maximum a-posteriori (MAP) estimator

A SEMI-SUPERVISED APPROACH TO CLASSIFICATION WITH UNKNOWN CLASSES
We aim at defining a prior (0), as well as a likelihood ( |0). At this stage, we depart from a purely supervised setting and take advantage from the information provided by the unlabeled data.
Our approach consists of two major steps (see Fig. 2): (1) detect "likely unknown" elements in the unlabeled data -this will provide us the prior probability, (2) fit an adequate mixture model, which is able to divide these "likely unknowns" into subclasses -this will serve as a likelihood model.

Step 1: Detecting "Likely Unknowns"
The S-EM algorithm by Liu et al. [9] detects a subset ⊂ , which is likely to be in a new class. While originally developed in the context of PU learning, we generalize it for distinguishing multiple classes 1 , . . . , against an unknown residual class 0. As a first step, a random subset ⊂ , including a predefined percentage ∈ [0, 1] of the labeled training data, is selected as so-called spies. While \ retains their labels { 1 , . . . , }, the labels of ∪ are specified as 0. Applying the I-EM algorithm (a version of EM, introduced in [9]) delivers a Bayes classifier for { 1 , . . . , , 0}, which produces a likelihood function ( | ). A datapoint is considered a "likely unknown" if its likelihood falls below a certain threshold , i.e., We set the threshold to = min ∈ ( | ( )), where ( ) denotes the correct label of the spy element. In the noise-affected version, Liu et al. suggest to use the empirical 0.1-quantile instead of the min. Now, instead of selecting a classifier from the iterative I-EM procedure, our interest lies on the set of unknown samples , which are identified by the procedure. We assume that these are representatives of one or more unknown classes, which are not yet captured by the labeled training data. In case that is empty, we conclude

Step 2: Fitting a Mixture Model
We next build a generative model for data originating from the unknown classes +1 , . . . , + . An intuitive choice for this purpose is a GMM, which has proven to be useful in many applications.
The GMM model is commonly defined as a linear combination of ℓ Gaussian distributions, i.e.
where ℓ =1 = 1. As we are interested in reducing the number of model parameters due to potentially few likely unknown elements, we set = 1 ℓ for all = 1, . . . , ℓ. It is worth mentioning that ℓ is not necessarily equal to , the number of unknown classes. Indeed, ℓ may be strictly smaller (modeling multiple classes with a single Gaussian component) or strictly larger (modelling a single class with multiple Gaussian components) than .
For a selected number of components ℓ, we initialize the partition of via k-Means and proceed with an EM algorithm, which provides us the final Gaussian components. However, during EM, we also introduce the labeled data (and their classes) as fixed elements. Hence, elements which were selected as likely unknowns during S-EM although they are part of a known class, can be assigned to their original classes without introducing a "new" Gaussian component.
Our first interest concerns the number ℓ of mixture components: Selecting this number via a maximum likelihood approach is not possible, as the joint likelihood of the model will increase with the number of components ℓ. Hence, it is necessary to apply an appropriate criterion, which provides a trade-off between the likelihood and the model complexity (number of estimated parameters). In line with statistical theory, we select a tailored version of the Bayesian information criterion (BIC) to obtain the number of mixture components in the GMM, defined as denotes the number of estimated parameters (i.e., the number entries in the mean vectors and covariance matrices { , }) and denotes the likelihood of the whole classifier, regarded as a single GMM (each Gaussian component, regardless of being associated with a known or unknown class, is weighted equally). At this point, note that ( | ) = ( | ), for all ∈ , as the likelihood of the known classes remains unchanged. The BIC is therefore always calculated for all known and unknown components, but minimized over models of different complexity in order to retrieve an optimal choice. The BIC is known to overestimate the number of parameters of a model [1]. Observations from experiments, where the number of components selected via BIC are compared to the true number of unknown classes, underline this behavior. From a theoretical perspective, this overestimation is unproblematic in our setting, since it is preferable to model the unknown classes with too many mixture components, rather than modeling them with too few. By selecting the model with the lowest BIC value, we obtain the model parameters ( ) and ( ) , = 1, . . . , ℓ and thereby create a generative model for the previously unknown classes. Using this model, we can determine the likelihood ( |0) in (10) for a new element and finally obtain the posterior probabilities ( | ) for all ∈ 0 in (4). As a result, we can calculate the minimum posterior expected loss (i.e., the best Bayesian decision). At this stage, we have invented a full classification model to solve a multiclass classification problem, while accounting for unknown classes. More formally, the proposed algorithm can be described as Alg. 1. For practical reasons, the number of components ℓ is bounded by a maximum integer ℓ max .  With regard to Alg. 1, the main novelty of our work is the extension of S-EM to the multiclass case, so that training with more than one known class is possible. Therefore, we combine two sophisticated approaches, S-EM and a GMM, to a powerful method for performing multiclass classification under the presence of previously unknown classes. As a third aspect, we select the number of mixture components by a version of BIC, which was tailored to the case of both known and unknown classes, where only the number of unknown classes is variable in the model.

EXPERIMENTS AND RESULTS
In this section we analyze the sensitivity of our proposed method to its parameters and compare it with state-of-the-art methods. In our experiments, we consider several datasets, which are characterized in Table 1. Specifically, we perform experiments on UCI datasets, as well as a wafer test dataset from semiconductor industry [10].
In order to provide evidence for the performance of our proposed method, we perform two different experiments: the first experiment is used to demonstrate the ability to discriminate between "known" and "unknown" classes and will be compared to state-of-the-art methods on different UCI datasets, including ecoli, glass, iris, segmentation, user, vertebral and wine, as well as the two famous datasets MNIST [6] and LETTER [4] and a wafer test dataset from semiconductor industry [10], represented by tailored image features [11]. In the second experiment, we compare our method to other multiclass algorithms, which are able to handle unknown classes in a semi-supervised way. However, not all datasets are suitable for this experiment, as some contain too few samples for splitting them into labeled, unlabeled and test datasets.
We will judge the quality of our results by the F1-measure, defined as the harmonic mean between precision and recall of the regarded classes and calculate the macro-averaging F1-measure (see, e.g., [17]). If ( ) represents the real class of and * ( ) the assigned class, we define the following sets for a known class : According to these definitions, the F1-score for a class , = 1, . . . , is defined as follows: leading to the macro-averaging F1-score

.
We compare our semi-supervised classifier for unknown classes (SSC-UC) to the following methods: the generative Naive Bayes version of exploratory learning by [3] (EL) 2 , the S-EM algorithm for PU learning by [9] (S-EM) 3 , the GMM-based semi-supervised clustering method with must-link and cannot-link constraints by [15] (cGMM) 4 and the semi-supervised cross-entropy clustering method with information bottleneck regularization by [16] (CEC-IB) 5 . For the clustering algorithms cGMM and CEC-IB, the assignment of the class labels to the resulting clusters was done in favor of the F1 scores, as they provided the clusters in an arbitrary order. In the setting where multiple classes are known, we apply a committee of several S-EM algorithms, where each member is trained on a different known class as the positive class. If all committee members decide that a sample belongs to unknown class, then the sample is assigned to class 0; otherwise, the sample is assigned to the positive class with the highest score among all committee members. By this adaption, we directly generalize the S-EM algorithm, representing a state-of-the-art PU learning method.

Parameter Sensitivity
For our method, the employed S-EM algorithm proved to be highly dependent on a good sampling of the spy elements. This aspect is even more crucial if we investigate small labeled training datasets. In case of a bad selection of the threshold (caused by an unfortunate sampling of the spies), two failure modes can be observed: if the threshold is selected too small, the returned set is empty although 2 implemented by the authors 3 implemented by the authors 4 http://www.scharp.org/thertz/code.html 5 https://github.com/mareksmieja/CEC-IB  To resolve this issue and increase the robustness of the S-EM algorithm, we run the procedure times for independently sampled spy sets of equal size and obtain sets of likely unknown elements . We continue with the set with the median number of elements, i.e., = ⌈ 2 ⌉ , where are ordered by increasing cardinality. To investigate the sensitivity of SSC-UC w.r.t. the parameters, we observe the F1-scores when varying the fraction of spies in the step 1 in a range from 0.01 to 0.2 and the number of times how often this step is repeated in a range from 5 to 50. The results of this analysis are shown in Table 2. It can be seen that neither , nor has a significant influence on the overall performance of the SSC-UC classifier for most datasets. The fact that the parameter has only minor influence on the result is in accordance with the observations on document classification described in [9], where the authors claimed that, if in a reasonable range, the percentage of spies does not matter to the S-EM step. In the remainder of this work, we thus set = 0.1. During the experiments, we observed that for an increasing , the results became more stable (less affected by random initialization). However, as also the runtime increased linearly with this parameter, the setting with = 10 was deployed as a default for further experiments.
Among the datasets, the best results were achieved for the dna, ecoli and iris dataset, while the method performed worst on the LETTER dataset. This fact can be explained by the total number of classes comprised in these datasets: as LETTER contains many unknown classes, modelling these via a GMM is difficult due to inhomogeneity. However, iris and dna are datasets with only 3 classes, i.e. only 1 class is unknown during training.

Multi-Class Classification with Unknown Classes
In order to demonstrate the abilities of our method for multiclass classification with unknown classes, we evaluate the given datasets in comparison with state-of-the-art methods described in the related work section. We select the following experimental setup: We define an experimental dataset size ∈ N for each dataset, which is equal to the full size of the dataset with exception for some datasets (e.g. LETTER), where the selected labeled training classes are not sufficiently populated. Then, we set the number of labeled samples to approximately 0.1 · , the number of unlabeled training data to 0.6 · and the number of test data to 0.3 · . We train the classifier using a balanced, randomized sample of representatives from the known classes { 1 , . . . , } of each dataset as labeled data and samples of all present classes { 1 , . . . , + } as unlabeled data. We fix the number of known classes to 2, to half the total number of classes (we set ≈ ), and to the total number of classes minus one (i.e., = 1), respectively. The unlabeled training dataset and test dataset are sampled out of all available classes. Although regarded as baseline datasets for multiclass classification, these datasets are still challenging in the investigated problem setup.
The results of this experiment are shown in Table 3. The cGMM did not converge for all dataset. Furthermore, as datasets containing merely 3 or 4 classes are present, some settings are redundanthence, only the setup with = 2 is investigated for datasets with 3 classes, while results for = 2 and = 1 are presented for datasets with 4 classes. Note that the distinct settings cannot be ranked w.r.t. their complexities due to two counteracting aspects: on the one hand, increasing the number of known classes leads to more options in the classification step; while for = 2, a decisions between classes 0, 1 and 2 must be taken, the number of options is much larger if e.g. the predicted class is selected from {0, . . . , 10} for = 10. On the other hand, an increasing leads to a lower number of unknown classes, i.e. the set of unknown elements is more homogeneous and easier to separate from the known classes.
In the multiclass setting, our method outperformed the clustering approaches, as classification algorithms are effectively able to make use of labeled data in contrast to (semi-supervised) clustering algorithms, especially in a high-dimensional feature spaces, where the clustering relies more on the distance measure than the classification methods. In many cases, the number of clusters was selected very low, which caused bad results. Regarding the EL method, the algorithm assigns an unknown class to many samples belonging to known classes. This leads to a high number of "false negatives" and a low recall. However, the performance of EL tends to improve if the number of known classes is increased. With regard to the S-EM method, we can observe the contrary behaviour: while for = 2, this method delivers very accurate results, the F1-scores decreases for larger . SSC-UC, however, shows a more stable behavior with regard to the number of known classes.
Concerning the number ℓ of mixture components in step 2 of the SSC-UC method, we observed that, as expected, BIC tends to overestimate ℓ. While only one class is unknown in the case of = 1, the parameter ℓ is estimated to approximately 3 or 4 for most datasets. This effect guarantees that the inhomogeneous set of unknown classes is modelled in a flexible way, improving results.
To investigate how our method is affected by the size of the labeled training dataset, we additionally carried out experiments with increasing propertion of labeled vs. unlabeled training data. As this comparison is not possible with datasets containing sparsely populated classes, we conduct this experiment on few larger datasets, i.e. LETTER, pendigits, satimage and wafer-test. In detail, the proportions of sampling training and test data from the datasets are varied: while the standard setting comprised an / (70 − ) / 30 split (labeled / unlabeled / test) w.r.t. the total experimental dataset size with = 0.1, we investigate the cases of ∈ {0.23, 0.35} in addition. The results are depicted in Fig. 3.
When increasing the proportion of labeled data, obviously not all methods are able to profit from this additional information. In particular, SSC-UC delivers significantly better results for the pendigits and the satimage dataset, while the results for LETTER and wafer-test remain in a similar range. Unexpectedly, CEC-IB and EL do not benefit from a larger amount of labeled training data, but in contrast, their performance drops. S-EM continuously delivers results of similar quality in all setups.

PU Classification
We finally assess our proposed classification method in a problem setting from PU learning. Therefore, we assume that instances from only one known (positive) class are provided in a labeled training dataset, while instances from the known and multiple unknown classes (positives and negatives) are available in the unlabeled dataset. The goal is to train a classifier, which is able to distinguish between the positives and negatives in the test set. Note that this setting is a special case of the setting presented in the previous experiments. By default, the positive class will be randomly chosen from all classes represented in the dataset. Again, we reduce the number of labeled samples to approx. 10% of the dataset size, as well as we sample 60% as unlabeled training data and 30% for testing. The results are evaluated by the F1-measure, considering the prediction labels "positive" (known) and "negative" (unknown). Table 4 demonstrates the performance of our proposed algorithm (SSC-UC) compared to those of existing methods. SSC-UC performed equally well or better than other algorithms in most cases, while the performance of the competing methods varies strongly by the datasets. Especially the clustering algorithms have difficulties especially with larger or higher-dimensional datasets (such as MNIST) -as before, no convergence could be achieved for the cGMM method in some cases. While S-EM performed well  in this experiment, EL did not assign any unknown class labels in most datasets, which explains the low F1 scores. From the perspective of computational complexity and runtime, all compared methods were in an equal range. In detail, cGMM had the lowest, CEC-IB had the longest processing time. SSC-UC was in a medium range, depending on the number of iterations . The main factor influencing the runtime of the experiments was the number of features contained in the datasets -hence, MNIST and usps had the longest processing times. For all other datasets, results could be obtained within seconds.

DISCUSSION
The experiments in the previous section illustrated that our approach to semi-supervised classification with unknown classes successfully competes with the state-of-the-art in semi-supervised clustering, exploratory learning or PU learning. For a small ratio between the numbers of known and unknown classes, a committee of S-EM algorithms achieves top performance; if this ratio is large, then exploratory learning performs well. In both extremes, SSC-UC performs on par with these schemes, indicating the wide applicability of our method. In addition to this, SSC-UC achieves competitive performance in PU learning problems, where it is only outperformed by S-EM, a method that is tailored towards this very problem. To these selling points one can add the fact that SSC-UC requires the setting of only few parameters, and the performance is not very sensitive even to those few.
Compared to related methods, which are able to tackle the problem of semi-supervised classification with unknown classes, we extend the current state-of-the-art by increasing the potential of the Bayes classifier with an option to assign a sample to an unknown class. The generative nature of the method provides not only an appropriate prediction for the test dataset, but also a basic understanding of the distribution of the classes (unknown, as well as previously known classes). Furthermore, due to its low number of parameters, our model is capable of handling low amounts of labeled data, which is a clear benefit in many practical applications. Possible extensions of our model are soft classification and adaptation to use mixtures of other probability distributions or even distinct (generative) classifiers. In contrast to clustering methods, we can conserve the distinction between known classes, but alsoin contrast to supervised classifiers -detect unknown elements. Our model shares the same limitations that are known for all generative models: high-dimensional data leads to more complex computations and results in a degraded runtime and accuracy, compared to discriminative models. In addition, single outliers (unlabeled elements that neither belong to a known class, nor belong to any cluster) are a challenging problem, which is not yet completely solved with our method. Finally, we implicitly assume that the unlabeled data are a representative sample of the unknown classes, which -on the one hand -assumes a large unlabeled dataset andon the other hand -might not be true in non-stationary scenarios, where new classes emerge over time. Open set classifiers such as those proposed by Scheirer et al. might be a solution, but to the best of our knowledge all these classifiers are discriminative. Future work will therefore be dedicated to the generalization of our generative model to the open set problem.

CONCLUSION
We presented a generative approach to solve the multiclass semisupervised classification problem with unknown classes. The classifier is based on the well-established Bayes classifier, but can be adapted to other generative frameworks (such as GMMs). The algorithm for semi-supervised classification consists of two steps: first, the unlabeled dataset is screened for so-called likely unknown elements, i.e., samples that are drawn from the unknown classes with high probability. Then, these likely unknown elements are used to set up a new classifier, which fits a mixture of Gaussians to describe the unknown elements.
In the provided experiments, our approach performs on par with the state-of-the-art methods from different branches of semisupervised learning. Further, we presented a list of notable facts that underline the quality and promising features of our method. The ability to handle many constraints in the problem setup, especially the detection of new classes, paves the way for practical applicability in many domains, such as decision support systems. A future goal is to modify our method towards incremental learning and life-long-learning, i.e., to continuously improve the performance of the method after initial training.