MARLINE: Multi-Source Mapping Transfer Learning for Non-Stationary Environments

Concept drift is a major problem in online learning due to its impact on the predictive performance of data stream mining systems. Recent studies have started exploring data streams from different sources as a strategy to tackle concept drift in a given target domain. These approaches make the assumption that at least one of the source models represents a concept similar to the target concept, which may not hold in many real-world scenarios. In this paper, we propose a novel approach called Multi-source mApping with tRansfer LearnIng for Nonstationary Environments (MARLINE). MARLINE can benefit from knowledge from multiple data sources in non-stationary environments even when source and target concepts do not match. This is achieved by projecting the target concept to the space of each source concept, enabling multiple source sub-classifiers to contribute towards the prediction of the target concept as part of an ensemble. Experiments on several synthetic and real-world datasets show that MARLINE was more accurate than several state-of-the-art data stream learning approaches.


I. INTRODUCTION
The need for efficient streaming data analytics has rapidly grown in recent years [1]. A data stream can be defined as a sequence of observations that continuously arrive over time, occurring in many applications, such as credit card approval, fraud detection, and software defect prediction [2]. A key challenge in data stream learning is that the joint probability distribution of an application may change over time, i.e., there may be concept drift [3]. Learning from data streams that may suffer from concept drifts is frequently referred to as learning in non-stationary environments [4], [5], whereas a given joint probability distribution can be treated as a concept [2], [6]. Data stream learning algorithms must be able to adapt and swiftly react to concept drifts to avoid poor predictions [1].
Using information learned from different data sources is a feasible way to speed up the learning of a new target concept and improve the accuracy of the predictions. This can be considered as transfer learning [7]. However, transfer learning has been usually used off-line, requiring the entire training set to be available before training commences. While a few Authors in order of contribution. L. Minku  recent studies applied transfer learning in non-stationary data streaming environments [5], most of the approaches presume a similarity existing between the source and target concepts [2], [8], [9]. This assumption often fails to hold in practice. For example, the concept underlying the prediction for bike rental demands in Washington D.C. and London are different due to different weather patterns and consumer behaviours in these two cities. However, data streams of these two locations are available [10], [11] and could potentially be used to improve the predictive performance of data stream learning approaches. Another example is software effort estimation, where data streams describing software projects developed by different companies may be used to improve software effort estimation in a given company, despite having different underlying distributions [12].
Therefore, this paper aims to answer the following research questions: Can multi-source transfer learning help us to improve the predictive performance in non-stationary environments where source and target data streams do not share the same concept? If so, how?
To answer these questions, we hereby propose a novel approach, namely Multi-source mApping tRansfer LearnIng for Non-stationary Environments (MARLINE). MARLINE is the first approach designed to benefit from multiple source data streams even when sources and target could have considerably different concepts. It achieves that by projecting the target concept to the space of each source concept through a novel mapping mechanism, enabling multiple source sub-classifiers to contribute towards the prediction of the target concept as part of an ensemble. Our experiments show that MARLINE can improve the predictive performance of the existing approaches over time and quickly obtain good performance at the early learning stage or after the concept drift occurs, even though there are only a few training examples available.

II. RELATED WORK
Several approaches have been proposed for data stream learning in non-stationary environments [4], [6]. Among these, approaches able to learn example-by-example (online) rather than chunk-by-chunk [13] are particularly relevant to our paper. They have the potential to adapt to concept drifts faster than chunk-based approaches. Such approaches can be further divided into active and passive approaches [4], [13]. Active approaches trigger the adaptation mechanisms when concept drift detection methods alert [4], [14]. Examples of drift detection methods include the Drift Detection Method (DDM) [15] and Drift Detection Methods based on the Hoeffding's inequality (HDDM) (HDDM A and HDDM W ) [16]. Passive approaches continuously adapt to concept drift without relying on explicit concept drift detection [4], [13]. A popular passive approach is Dynamic Weighted Majority (DWM) [17]. In spite of their learning capacity, none of these approaches uses transfer learning or operates in multi-source scenarios.
Very few approaches have used transfer learning in nonstationary environments [5]. Online inductive parameter transfer learning approaches include Dynamic Cross-company Mapped Model Learning (Dycom) [12], Diversity for Dealing with Drifts (DDD) [18] and Online Window Adjustment Algorithm (OWA) [19]. Dycom treats source data offline, whereas DDD and OWA do not use source data, transferring knowledge only from the immediate previous target concept to the current target concept. A new chunk-based inductive parameter transfer approach called Diversity and Transferbased Ensemble Learning (DTEL) was proposed in [20]. Similar to DDD, DTEL does not consider different sources but uses historical target concepts. In addition, this is a chunkbased approach. Recently, two transductive transfer learning approaches called MultiStream Classification using Relative Density Ratio (MSCRDR) [8] and Cross-domain Multistream Classification (COMC) [9] have been proposed to handle multiple sources, with the assumption that the target stream is unlabelled and the source is labelled. However, these two algorithms require both the source and the target to share the same task, even after a concept drift happens. Multi-source transfer learning for non-stationary environments (Melanie) [2] is another online transfer learning method that can learn from multiple sources. However, Melanie only benefits from source concepts that are similar to the target.
Overall, there is no existing approach to perform transfer learning using multiple non-stationary data sources that may have different concepts from those of the target.

III. PROBLEM STATEMENT
Let {x i , y i } denote an example received at a given point in time at a data stream i with domain is the marginal probability distribution and p i (y|x) is the posterior probability distribution.
All sources and target streams may suffer concept drift. We will enumerate the concepts p j i seen in a data stream i using a sequential identifier j. Whenever a concept drift occurs, we increment j. We use J i to denote the number of concepts observed so far in data stream i. Note that a given training example, domain and task are all associated with the given concept. Therefore, they are actually all indexed by j as shown in {x j i , y j i }, D j i , and T j i , but we will leave this index implicit. At the beginning of the data stream or after a concept drift, due to the lack of the target data representing the new concept, the performance of the predictive models is usually poor. Therefore, the aim of multi-source transfer is to improve predictive performance in non-stationary environments and speed up learning especially in the beginning of the data stream or after the occurrence of concept drifts by using the data from multiple sources. We will investigate inductive transfer learning (i.e. T Si = T T while D Si = D T or D Si = D T ), as concept drifts may cause changes in T T and T Si over time.

IV. PROPOSED METHOD
This section introduces MARLINE. An overview of MAR-LINE's training framework is given in Figure 1, and its Java implementation is available in [21]. MARLINE considers that we have multiple input data streams (represented in grey in the figure), where the concepts of the sources and target streams may be different. However, classifiers learned from the source concepts may still be used to improve the predictive performance in the target domain. For that, each concept observed from each data stream is learnt by an independent online learning ensemble, which we refer to as base learning ensemble (shown in yellow). To identify different concepts, a concept drift detection method is used. Whenever a new target example needs to be predicted, MARLINE uses a mapping between the current target concept and each source concept, so that the target example is geometrically projected onto the space of the source concept (the projections are shown in purple). Through this mapping, the target example can be predicted by different sub-classifiers (shown in orange) trained by different base learning ensembles. MARLINE weights each sub-classifier depending on how useful it is for predicting the projected target (shown in cyan-blue). The prediction by MARLINE is based on the weighted majority vote of the predictions of all the sub-classifiers that compose all the source and target ensembles. The set of all weighted sub-classifiers is referred to as the MARLINE ensemble.
Section IV-A introduces MARLINE's training procedure. Section IV-B presents the mapping procedure for concept projections, which is undertaken with respect to the centroids of each concept. The centroids calculation is explained in Section IV-C. Section IV-D discusses the weighting of sub-classifiers that compose the MARLINE ensemble used for predictions. Section IV-E explains the voting procedure used for making predictions. Section IV-F shows the time complexity.

A. Training
The pseudo-code of MARLINE's training process is shown in Algorithm 1. The set M ⊆ {S 1 , · · · , S n , T } is the set of all the sources and target for which an online base learning ensemble has already been generated. When a new example (x i , y i ) is received from a source or target data stream i for the first time, the proposed method creates one online base learning ensemble H 1 i for this source or target (lines 3 to 6). Any online learning ensemble method can be used, e.g., online boosting or bagging [22]. Ensembles are used here because their diversity increases the chances that at least some of the sub-classifiers become useful for predicting the target [2].
As the concept of each data stream i may change due to concept drifts, each source/target i is associated with a pool of base learning ensembles H i , where each ensemble may represent a different concept observed from that source/target. Each ensemble H j i within the pool contains K sub-classifiers h j,k i , where 1 ≤ k ≤ K. The newly created ensemble H 1 i is added to its corresponding ensemble pool H i (line 7). This pool receives additional ensembles when i suffers from concept drifts, as explained later in this section. After having added H 1 i to H i , the method discussed in Section IV-C is used to calculate the centroid associated with the concept p 1 i (line 8). This centroid is used to create the mapping from the target to the source concepts, as explained in Section IV-B. Line 9 is used to initialise the weights associated with each sub-classifier. These weights are used to identify which sub-classifiers are more important for predicting the projected target, and their calculation is explained in Section IV-D. Whenever a new training example (x i , y i ) is received from stream i, MARLINE runs a drift detection method on i. Any drift detection method could be used, e.g., HDDM [16].  If the new training example belongs to the target stream, the sub-classifiers' weighting scheme discussed in Section IV-D is applied (line 23). The mapping procedure shown in Section IV-B is used in the weighting scheme.

B. Mapping Procedure
The main purpose of the mapping procedure is to create the projection (x j i ) of the current target example (x T ) on the source concept p j i . Therefore, given a current target example, the classifiers trained with a certain source concept can make a prediction on the projection of this target example based on the knowledge learned from the source concept. After a concept drift has been detected in the target data stream, any previous concept in that stream is also regarded as a source concept. From this point onward, we will use the term "source+" instead of "source" when the past target concepts are included as sources. Therefore, i and j in the source+ concept p j i are defined as follows: To seek the projection (x j i ) of target example (x T ) on p j i , a mapping function between p j i and p JT T is required. However, for online learning we only store the latest example in memory Calculate centroids c i,Ji as shown in Section IV-C, used to create the mapping as shown in Section IV-B 9: Initialise new online base learning ensemble H Ji+1 i 13: Update centroids c i,Ji as shown in Section IV-C, used to create the mapping as shown in Section IV-B 22: if i = T then 23: Update each sub-classifier's weight as shown in Section IV-D 24: end if 25: end while over time. To build the mapping function between the source+ and target concepts without retrieving the overall historical data, we propose the following procedure. Consider a pair of reference points in a given source+ concept, and another pair in the target concept. For instance, for a single concept p j i , the pair of reference points can be the centroids of the distributions The calculation of c y i,j is shown in Section IV-C. We connect the pairs of reference points of a given concept using a vector −→ JT can be considered as the mapping function of any two vectors between p j i and p JT T : Therefore, the source+ vector −→ V i,j can be seen as a projection of the target vector When a new example (x T ) from the target domain is received, the vector between the new example and one centroid (c y=1 T,JT ) of the target concept can be computed as follows: The projection of −→ V T on p j i can be calculated as follows:

C. Calculating and Updating the Centroids
For saving computational time, we update the centroids of each concept instead of updating transformation matrix R at each time step. The transformation matrix R will be calculated whenever a target example needs to be predicted.
The centroids of the concept p j i (x, y) are dynamically updated based on examples (x i , y i ) received from data stream i during the time window since the concept j has become active. During this time window, if no example with label y i has been seen before, the centroid c y=yi i,Ji is set as x i . Otherwise, it is updated as follows: where L is the number of the training examples received in the data stream i since the concept p j i has become active, and θ, 0 < θ ≤ 1, is a pre-defined forgetting factor used to reduce the weight given to the historical examples. It helps to deal with non-stationary environments. The summation in the denominator is a normalisation factor, which can be updated in an online manner.

D. Sub-Classifiers Weighting
When an example (x T , y T ) from the target domain is received, all the sub-classifiers' weights ω h j,k i are updated. We assign larger weights to the sub-classifiers which focus on harder classified examples. The sub-classifier's weight ω h j,k i depends on the corresponding sub-classifier's performance updated based on the corresponding projection (x j i , y T ) of the current target example (x T , y T ), which has been obtained by the mapping procedure explained in Section IV-B. When To update each subclassifier's performance, the current example's weight needs to be calculated. We get the prediction of each sub-classifier on the corresponding target example projection. The example's weight SW SC can be calculated as follows: where P (h(x) = y) is the probability of class y, estimated by the sub-classifier h(x), and α h j,k i is the performance computed based on the target examples received before receiving (x T , y T ). The example's weight SW SC is used to indicate how confident the MARLINE ensemble is for the current example. We expect that the more confident the MARLINE ensemble is, the smaller weight the example receives, and vice versa.
Consider that λ sc where 0 < θ ≤ 1 is a forgetting factor and represents how much contribution sub-classifier h j,k i makes in the MARLINE ensemble to vote for the projection (x where σ is a pre-defined parameter, (testCondition ? v1 : v2) retrieves v1 if testCondition is true, and v2 otherwise.

E. Voting Procedure for Making Predictions
When a prediction is needed for a target instance (x T ), we multiply the corresponding weights of the sub-classifiers with the probabilistic prediction made by each sub-classifier on their corresponding projection (x j i ) of the current target example. All sub-classifiers h j,k i , i ∈ M, j ∈ {1, · · · , J i }, k ∈ {1, · · · , K} are considered for this purpose. Afterwards, we obtain the sum of the weighted prediction probabilities of all classes and use majority vote to decide the predicted class.

F. Time Complexity Analysis
When learning a target training example, MARLINE's training time complexity is O(f DD +f H +(J S1 +J S2 +· · ·+J Sn + J T )d 2 +(J S1 +J S2 +· · ·+J Sn +J T )K ×f h ), where f H , f DD and f h are the time complexities for training the base learning ensemble with the example, running the drift detection method and getting the prediction from a sub-classifier.
When learning a source training example, MARLINE's training time complexity is O(f DD + f H + d).
MARLINE's time complexity for prediction is O((J S1 + J S2 + · · · + J Sn + J T )d 2 + (J S1 + J S2 + · · · + J Sn )K × f h Details on the complexity estimation can be found in the Supplementary Material of this paper [23].

V. EXPERIMENTS SETUP
We evaluate MARLINE under several different conditions, including stationary environments, non-stationary environments with different types of concept drifts, and different target data stream sizes. Artificial datasets enable us to better understand when and how MARLINE can be helpful. Real world datasets enable us to check whether MARLINE can work well in practice.

A. Datasets
We use the same three artificial datasets of similar target and sources as those of [2] and generated additional datasets where the source was non-similar to the targets. These datasets have two numeric features and one binary output. The examples belonging to each output class were generated by a Gaussian distribution as shown in Table II, where each dataset is composed of several (target and sources) data streams. The three datasets with similar sources use only the target and similar source data streams from Table II, whereas the three datasets with non-similar source use only the target and nonsimilar source data stream from Table II The real-world datasets are acquired from London bike sharing dataset [10] and Bike Sharing in Washington D.C.
dataset [11]. The task is to classify whether rental bikes are in low or high demand. We use the median of the total count of the rental bikes of a given dataset to indicate low and high demands in this dataset. We select the features shared by the two datasets (actual temperature, feeling temperature, humidity and wind speeds) to unify the feature dimension. Each dataset is divided into three sub-datasets based on holiday, weekend and weekday and compose the following three scenarios: We choose weekdays from Washington D.C. as the source, and (1) holidays and (2) weekends in London are the targets. We make weekends in Washington D.C. as the source and (3) weekdays in London are the target. The aim of these three different target sub-datasets is to create small, medium and large stream sizes (Holiday: 384, Weekend: 4970, Weekday: 12060).

B. Benchmark Methods and Evaluation Measures
MARLINE was compared against Melanie [2], Adaptive Random Forest [14], Dynamic Weighted Majority (DWM) [17], Online Bagging [22], Online Boosting [22], Online Bagging with Drift Detection and Online Boosting with Drift Detection. Melanie was chosen for the comparison because it is the state-of-the-art multi-source transfer learning approach for non-stationary data streams. As MSCRDR [8] and COMC [9], Melanie is only able benefit from source concepts when they share the same task as the target concept. However, different from these approaches, Melanie has the advantage of being able to detect when source tasks are dissimilar to the target, avoiding to hinder predictive performance on the target when that is the case. Comparing MARLINE against Melanie will reveal whether MARLINE's mapping function is helpful to improve predictive performance against an approach that is only able to benefit from source concepts when they are similar to the target concept.
Adaptive Random Forest and DWM were chosen because they are popular data stream learning approaches available in the Massive Online Analysis (MOA) tool [24]. Comparing against them shows whether MARLINE can outperform the popular approaches. Online bagging and online boosting [22] were included as baseline ensemble approaches with no strat- The covariance matrix for each class of each domain is egy to deal with concept drift. They provide a desired lower bound for the predictive performance achieved by MARLINE and any other approaches for non-stationary environments. Additionally, they were also applied in combination with a drift detection method, which resets the models upon drift alarm, to enable these approaches to cope with drifts. Comparing against the combination shows whether or not MARLINE is able to benefit from sources in general. MARLINE without source data streams was also used in the comparison because mapping is also performed between the old target concepts and the current target concept. Including this approach shows whether or not it is beneficial to use different sources especially for the initial learning stage, rather than only mapping between old and new target concepts.
Both MARLINE and Melanie have been investigated with online bagging and online boosting as base learning ensemble methods [22]. All ensemble approaches used Hoeffding trees [25] as the basic units of learning, except for one of the compared approaches (ARF), which is based on ARFHoeffding Tree [14]. Two drift detection methods (DDM [15] and HDDM A [16]) were used for all the approaches that require drift detection. DDM is a well known method. HDDM A has been recently shown to perform well compared to other drift detection methods when configuring ensembles [26].
Thirty runs were performed for all the compared approaches, except for DWM [17], which is deterministic and requires a single run. The average accuracy across the 30 runs is reported.
Grid search was used to tune the hyperparameters of each approach on each dataset based on a preliminary run.  [23].
For the artificial data streams, the predictive performance is calculated prequentially and reset upon the real location of the drifts [18]. In the artificial datasets, we know exactly when the concept drifts happen. This evaluation framework will reset the accuracy to zero when the concept drifts (or the increments of an incremental drift) occur. This enables us to measure the performance on each concept separately without being affected by the previous concepts. For real world data streams, the predictive performance of all the approaches is evaluated using sliding windows [27] with the size of 10% of the target data stream. Friedman and their Nemenyi post-hoc tests were used to compare the predictive performance of all approaches, on each dataset.

A. Comparison on Artificial Datasets 1) Experiments with Non-Similar Source:
The experiment aims to investigate whether or not the use of very different  Target Stream Size  50  500  5000  50  500  5000  50  500  5000  50  500  5000  50  500  5000  50  500  5000  384  4970  Friedman's p-values were always < 2.2 × 10 −16 . The best approach has its ranking in red with grey background and the approaches not significantly different from it according to the Nemenyi test are in bold with grey background. Mean accuracy and standard deviations are in the Supplementary Material [23]. concept sources by MARLINE can help us to improve the predictive performance. The Friedman ranking of the approaches on each dataset is shown in Table III. We can see that MARLINE with source is amongst the best performers under different amounts of the training data and different types of drifts, as shown by the table cells highlighted in grey, except the incremental drifts with the class size of 500. MARLINE without source is sometimes amongst the best, demonstrating that mapping the new target concept to the space of the historical target concepts is also beneficial. Figure 2 shows some representative results across time. Other figures were omitted due to space restrictions.
2) Experiments with Similar Source: Melanie was designed to transfer knowledge with similar sources and target concepts, being thus expected to achieve the best performance for these data streams. Based on Friedman and Nemenyi tests shown in Table III, Melanie outperforms the other approaches in most scenarios. However, MARLINE with source also achieves competitive results, similar to those of Melanie. In some cases, the performance of MARLINE with source is better than that of Melanie, e.g., MARLINE(HDDM A (Online Boosting)) with source and Melanie (HDDM A (Online Boosting)) with source with the class size of 500 on the abrupt drift dataset.

B. Comparison on Real World Datasets
London and Washington D.C. bike sharing data were collected from different sources, so their input and output spaces are quite different (see Table IV). Therefore, the concepts (both in terms of domains and tasks) of the two datasets are supposed to be very different. From Table III,  MARLINE(HDDM A (Online Bagging)) with source has the best performance on all the real world datasets. MARLINE without source is the 2nd best. From Figure 3, we can see that the accuracy of MARLINE with source is quite similar to that of MARLINE without source. This may be due to the adaptive mechanism of MARLINE. It is worth noting that when the concept of the stream was easier to learn (as for the artificial datasets), then MARLINE was most helpful in the beginning of the stream and right after drifts. This is because, with time increasing, every method can learn the concept well ( Figure 2). However, when the concept was more complex (as in the real world datasets), then MARLINE provided great help throughout time (Figure 3). Additional related analyses are in the Supplementary Material [23].

C. Contribution of Source+ Sub-classifiers
To further investigate the importance of individual subclassifiers in MARLINE, we select two datasets (Abrupt with non-similar source and class size of 5000; and Weekday) and plot the average weight ratios of all the source+ sub-classifiers over 30 runs in Figure 4. The total weight ratio of the source+ sub-classifiers is calculated as: This is the sum of the total weights assigned to the source and historical target sub-classifiers in the MARLINE ensemble.
From Figure 4a, before the concept drift occurs, the mean total weight of the source sub-classifiers during this period is 25.86%. After concept drift, due to past target subclassifiers joining the MARLINE ensemble, the importance of the source+ sub-classifiers increases and the mean total weight of the source+ sub-classifiers is 48.24%. The average total weight is even larger for the real world dataset (see Figure  4b). From Figure 4b, we notice that the source+ classifiers can significantly contribute towards the predictions throughout time (the mean of the total weights over the whole data stream is 94.88%). This may be due to the fact that the real world dataset poses more challenges to the target sub-classifiers, which struggle to maintain their performance on the artificial datasets. We also notice some spikes and sudden drops in the total weights over time. This suggests that the weighting mechanism is affected by noise. In our future work, we will investigate whether or not other weighting mechanisms can improve the predictive performance of MARLINE further.

VII. SENSITIVITY ANALYSIS
As MARLINE has a few hyperparameters, it is important to conduct a study to understand (Q1) how different MARLINE ensemble compositions (different types of base learning ensemble with different sizes K and performance index σ) affect the predictive performance, and (Q2) the influence of different drift detection methods and forgetting factors θ used to handle different types of concept drift. Section VII-B1 answers these questions based on artificial datasets. The real world datasets are also used to support the analysis in Section VII-B2.

A. Experimental Design
To investigate (Q1) and (Q2), Analysis of Variance (ANOVA) [28] was performed to analyse the influence of each hyperparameter as well as its interactions with others on the average prequential accuracy. The step-wise changes of each hyperparameter are defined to cover the range of the best hyperparameter values selected by the grid search for different datasets in Section V.
For (Q1), the following factors are investigated: base learner with two levels (BLM: Online Bagging and Boosting), base learning ensemble size K ∈ {10, 20, 30} and performance index σ ∈ {0.0, 0.2, 0.4, 0.6}. As these are all subject-based factors, a Repeated Measures ANOVA design is used. For (Q2), the following factors are investigated: drift detection method (DD: DDM and HDDM A ), forgetting factor θ ∈ {0.9, 0.92, 0.94, 0.96, 0.98, 1} and drift type (DT: No Drift, Abrupt, Incremental). The last is only considered when we apply the artificial datasets. As the first two factors are withinsubject factors and the last is a between-subjects factor, a split plot (mixed) ANOVA design is adopted for the artificial datasets and a Repeated Measures ANOVA design is adopted for the real world datasets. Thirty runs for each combination of the factors are carried out on each dataset.
Mauchly's sphericity test [29] is used with a level of significance of 0.05 to evaluate whether or not the sphericity assumption is violated. If violated, the ANOVA's p-values are corrected to take that into account. If the epsilon estimate is below 0.75, the Greenhouse-Geisser correction [30] is adopted to correct the degrees of freedom of the F-distribution. Otherwise, the Huynh-Feldt correction [31] is adopted to make it less conservative [32].

B. Results
Table V and VI present the ANOVA results for the artificial and real world datasets, respectively. The Sum of squares (SS), degrees of freedom (DF), mean squares (MS), test F statistics (F) and partial eta-squared (η 2 p ) are reported.

1) Analysis Using Artificial Datasets:
As it can be observed from Table V, performance index σ, base learning ensemble size K and interaction K * σ have a large impact (η 2 p ≥ 0.131), whereas the other factors and interactions have a small impact (η 2 p ≤ 0.017). Therefore, the MARLINE ensemble composition factors σ and K have more influence on the accuracy. Figures 5a and 5b illustrate the impact of factors σ, K, base ensemble learner method and their interaction. The two plots are fairly similar to each other, confirming that the interaction BLM * K * σ has a small impact. A large σ = 0.06 is detrimental to the accuracy with worse accuracy obtained especially with a smaller ensemble size K. This is because this performance index is difficult to be reached by the subclassifiers. Therefore, most sub-classifiers will have weight zero in the MARLINE ensemble, effectively decreasing its size and diversity. When σ ≤ 0.4, different ensemble sizes K and σ values lead to similar accuracy, where a smaller ensemble size, e.g. K = 10, leads to slightly worse accuracy and σ = 0.4 leads to slightly better accuracy when we use Online Bagging.
Table V also shows that the forgetting factor θ, the interaction between the drift detecting method and the drift type DD * DT and the interaction θ * DT have a medium impact (0.071 ≤ η 2 p ≤ 0.076). So, the drift detection method and θ play important and probably diverse roles when handling different types of concept drifts. Figures 5c and 5d illustrate the effect of the factors θ, DD and DT and their interaction. The two plots show fairly similar patterns, verifying that the interaction DD * θ * DT has a small impact.
When the drift appears abruptly, independent of the drift detection method (DT), θ = 0.9 result in the best accuracy. As θ increases, the accuracy slightly decreases. When the drift type is incremental, the accuracy has larger drops with θ ≥ 0.96, compared with the abrupt concept drift. As the concept drifts in the incremental datasets are more difficult to be detected than the concept drifts in the abrupt datasets, it is reasonable that θ will take more responsibilities to cope with concept drifts when drift detection does not perform well. When the dataset has no drift, there is no significant difference between the accuracy obtained by different drift detection methods, which we confirmed by additional paired T tests with Bonferroni corrections. Furthermore, the accuracy changes very slightly when we change θ.
Therefore, we summarise that: • Q1: Large performance index σ and small ensemble sizes K can be detrimental to the predictive performance of MARLINE, whereas in general σ = 0.4 associated with K ≥ 20 led to better results. • Q2: If there are concept drifts in the data steam, when the drift detection method cannot detect the concept drifts accurately, a small value for the forgetting factor (0.9 ≤ θ ≤ 0.94) normally can help MARLINE to increase predictive performance on handling concept drifts. When there is no concept drift, a small value for the forgetting factor will not hurt the performance either. Table VI shows the tests of within-subjects performed on the real world datasets. The plots of marginal means are shown in Figure 6. The results are in general similar to the results on artificial datasets, though certain effects and differences in the magnitude of the predictive performance were larger. We can see that σ, K and interaction σ * K have large effect size (η 2 p ≥ 0.161). Meanwhile, θ has a very large effect size (η 2 p = 0.495) and the choice of drift detection method also has a large effect size (η 2 p = 0.111). This could be because in the real world datasets, the concept drifts are a mix of different types of concept drift, which makes them more difficult to be detected. Therefore, MARLINE relies more on θ to cope with the concept drifts. Figures 6a and 6b show similar trends to the artificial datasets. However, when σ ≤ 0.4, the improvement in accuracy with a greater σ is more significant, confirming that both the size and the quality (performance index) of the subclassifiers are important. Figure 6c also shows that HDDM A performs better on the real world datasets, in line with the experiments shown in Section V. Also, we find that smaller θ values benefit the accuracy more.

VIII. CONCLUSION
In this paper, we focus on a general and challenging problem -learning from very different concepts in data stream mining. By mapping the target concept to the space of each source+ concept, the sub-classifiers that closely match the part of the projection of the target concept are given higher weights in the MARLINE ensemble, being able to achieve better performance in non-stationary environments. We carried out extensive experiments and the results demonstrate that our proposed MARLINE is effective. A sensitivity analysis is also presented. Future work includes the investigation of strategies to reduce the size of MARLINE's classifier pool; investigation of different weighting schemes to further improve accuracy; analysis of the computational time taken to run the approach, complementing its time complexity analysis; experiments with more data streams, base learners and drift detection methods; and an investigation of sensitivity to noise.