Learning from evolving data streams

Ubiquitous data poses challenges on current machine learning systems to store, handle and analyze data at scale. Traditionally, this task is tackled by dividing the data into (large) batches. Models are trained on a data batch and then used to obtain predictions. As new data becomes available, new models are created which may contain previous data or not. This training-testing cycle is repeated continuously. Stream learning is an active field where the goal is to learn from infinite data streams. This gives rise to additional challenges to those found in the traditional batch setting: First, data is not stored (it is infinite), thus models are exposed only once to single samples of the data, and once processed those samples are not seen again. Models shall be ready to provide predictions at any time. Compute resources such as memory and time are limited, consequently, they shall be carefully managed. The data can drift over time and models shall be able to adapt accordingly. This is a key difference with respect to batch learning, where data is assumed static and models will fail in the presence of change. Model degradation is a side-effect of batch learning in many real-world applications requiring additional efforts to address it. This papers provides a brief overview of the core concepts of machine learning for data streams and describes scikit-multiflow, an open-source Python library specifically created for machine learning on data streams. scikit-multiflow is built to serve two main purposes: easy to design and run experiments, easy to extend and modify existing methods.


Introduction
The minimum pipeline in machine learning is composed of: (1) data collection and processing, (2) model training, and (3) model deployment. Conventionally, data is collected and processed in batches. Although this approach is state-of-the-art in multiple applications, it is not suitable in the context of evolving data streams. The batch learning approach assumes that data is sufficiently large and accessible. This is not the case in streaming data where data is available one sample at a time, and storing it is impractical given its (theoretically) infinite nature. In addition, non-stationary environments require to run the pipeline multiple times in order to minimize model degradation, in other words maintain optimal performance. This is especially challenging in fast-changing environments where efficient and effective adaptation is vital.
As a matter of fact, multiple real-world machine learning applications exhibit the characteristics of evolving data streams, in particular we can mention: Financial markets generate huge volumes of data daily. For instance, the New York Stock Exchange captures 1 terabyte of information each day 1 . Depending on the state of such markets and multiple external factors data can become obsolete quickly rendering it useless for creating accurate models. Predictive models must be able to adapt fast to be useful in this dynamic environment.
• Predictive maintenance. The contribution of IoT to the digital universe is substantial. Data only from embedded systems accounted for 2% of the world's data in 2013, and is expected to hit 10% by 2020 2 . IoT sensors are used to monitor the health of multiple systems, from complex systems such as airplanes to simpler ones such as household appliances. Predictive systems are required to react fast to prevent disruptions from malfunctioning elements.
• Online fraud detection. The speed of reaction of an automatic system is also an important factor in multiple applications. As a case in point, VisaNet has a capacity (as of June 2019) to handle more than 65,000 transactions per second 3 . Fraud detection in online banking involves additional challenges beside data collection and processing. Fraud detection systems must adapt quickly to changes such as consumer behavior (for example during holidays), the stability of the financial markets, as well as the fact that attackers constantly change their behavior to beat these systems.
• Supply chain. Several sectors use automatic systems in their supply chain to cope with the demand for products efficiently. However, the COVID-19 pandemic brought to attention the fragility of these systems to sudden changes, e.g., in less than 1 week, products related to the pandemic such as face masks filled the top 10 searched terms in Amazon 4 . Many automatic systems failed to cope with change resulting in the disruption in the supply chain.
• Climate change. Environmental science data is a quintessential example of the five v's of big data: volume, velocity, variety, veracity, and value. In particular, NASA's Earth Science Data and Information System project, holds 24 petabytes of data in its archive and distributed 1.3 billion files in 2017 5 . Understanding environmental data has many implications in our daily lives, e.g., food production can be severally impacted by climate change, disruption of the water cycle has resulted in a rise of heavy rains with the associated risk of floodings. IoT sensors are now making environmental data available at a faster rate and machine learning systems must adapt to this new norm. Fig. 1: Batch learning systems are characterized by the investment in resources like memory and training time as the volume of data increases. Once a reasonable investment threshold is reached, data becomes unusable turning into a missed opportunity. On the other hand, efficient management of resources makes stream learning an interesting alternative for big data applications.
As shown in the previous examples, dynamic environments pose an additional set of challenges to batch learning systems. Model degradation is a predominant problem in multiple realworld applications. As enough data has been generated and collected, proactive users might decide to train their models to make sure that they agree with the current data. This is complicated for two reasons: First, batch models (in general) are not able to use new data into account, so the machine learning pipeline must be run multiples times as data is collected over time. Second, the decision for such an action is not trivial and involves multiple aspects. For example, should a new model be trained only on new data? This depends on the amount of variation in the data. Small variations might not be enough to justify retraining and re-deploying a model. This is why a reactive approach is predominantly employed in the industry. Model degradation is monitored and corrective measures are enforced if a user-defined threshold is exceeded (accuracy, type I, and type II errors, etc.). Fig. 1 depicts another important aspect to consider, the tradeoff between the investment in resources such as memory and time (and associated cost) and the pay-off in predictive performance. In stream learning, resource-wise efficiency is fundamental, predictive models not only must be accurate but also must be able to handle theoretically infinite data streams. Models must fit in memory no matter the amount of data seen (constant memory). Additionally, training time is expected to grow sub-linearly with respect to the volume of data processed. New samples must be processed as soon as they become available so it is vital to process them as fast as possible to be ready for the next sample in the stream.

Machine learning for streaming data
Formally, the task of supervised learning from evolving data streams is defined as follows. Consider a stream of data S = {( x t , y t )}|t = 1, . . . , T where T → ∞. Input x t is a feature vector and y t the corresponding target where y is continuous in the case of regression and discrete for classification. The objective is to predict the targetŷ for an unknown sample x. For illustrative purposes, this paper focuses on the classification task. In stream learning, models are trained incrementally, one sample at a time, as new samples ( x t , y t ) become available. Since streams are theoretically infinite, the training phase is non-stop and predictive models are continuously updating their internal state in agreement with incoming data. This is fundamentally different from the batch learning approach, where models have access to all (available) data during training. As previously mentioned, in the stream learning paradigm, predictive models must be resourcewise efficient. For this purpose, a set of requirements [BHKP11] must be fulfilled by streaming methods: • Process one sample at a time, and inspect it only once.
The assumption is that there is not enough time nor space to store multiple samples, failing to meet this requirement implies the risk of missing incoming data.
• Use a limited amount of memory. Data streams are assumed infinite, thus storing data for further processing is impractical.
• Work in a limited amount of time. In other words, avoid bottlenecks generated by time-consuming tasks which in the long run could make the algorithm fail.
• Be ready to predict at any point. Stream models are continuously updated and must be able to provide predictions at any point in time.

Concept drift
A challenging element of dynamic environments is the chances that the underlying relationship between features X and target(s) y can evolve (change) over time. This phenomenon is known as Concept Drift. Real concept drift is defined as changes in the posterior distribution of the data p( y|X). Real concept drift means that the unlabeled data distribution does not change, whereas data evolution refers to the unconditional data distribution p(X). In batch learning, the joint distribution of data p(X, y) is, in general, assumed to remain stationary. In the context of evolving data streams, concept drift is defined between two points in time t o ,t 1 as p t 0 (X, y) = p t 1 (X, y) Concept drift is known to harm learning [GZB + 14]. The following patterns, shown in Fig. 2, are usually considered: • Abrupt. When a new concept is immediately introduced. The transition between concepts is minimal. In this case, adaptation time is vital since the old concept becomes is no longer valid.  Recurring. If an old concept is seen again as the stream progresses. For example, when the data corresponds to a periodic phenomenon such as the circadian rhythm.
• Outliers. Not to be confused with true drift. A drift detection method must be robust to noise, in other words, minimize the number of false positives in the presence of outliers or noise.
Although the continuous learning nature of stream methods provides some robustness to concept drift, specialized methods have been proposed to detect drift. Multiple methods have been proposed in the literature, [GZB + 14] provides a thorough survey of this topic. In general, the goal of drift detection methods is to accurately detect changes in the data distribution while showing robustness to noise and being resources-wise efficient. Drift-aware methods use specialized detection mechanisms to react faster and efficiently to drift. For example, the Hoeffding Tree algorithm [DH00], a kind of decision tree for data streams, does not handle concept drift explicitly, whereas the Hoeffding Adaptive Tree [BG09] uses ADaptive WINdowing (ADWIN) [BG07] to detect drifts. If a drift is detected at a given branch, an alternate branch is created and eventually replaces the original branch if it shows better performance on new data.
ADWIN, a popular drift detection method with mathematical guarantees, keeps a variable-length window of recent items; guaranteeing that there has been no change in the data distribution within the window. Internally, two sub-windows (W 0 ,W 1 ) are used to determine if a change has happened. With each new item observed, the average values of items in W 0 and W 1 are compared to confirm that they correspond to the same distribution. If the distribution equality no longer holds, then an alarm signal is raised indicating that drift has occurred. Upon detecting a drift, W 0 is replaced by W 1 and a new W 1 is initialized.

Performance evaluation
Predictive performance P of a given model h is usually measured using some loss function that evaluates the difference between expected (true) class labels y and the predicted class labelsŷ.

P(h) = (y,ŷ)
A popular and straightforward loss function for classification is the zero-one loss function which corresponds to the notion of whether the model made a mistake or not when predicting.
Due to the incremental nature of stream leaning methods, special considerations are used to evaluate their performance. Two prevalent methods in the literature are holdout [Koh95] and prequential [Daw84] evaluation. Holdout evaluation is a popular method in both batch and stream learning where testing is performed on an independent set of samples. On the other hand, prequential evaluation, is specific to the stream setting. In prequential evaluation, tests are performed on new data samples before they are used to train (update) the model. The benefit of this approach is that all samples are used for both test and training. This is just a brief overview of machine learning for streaming data. However, it is important to mention that the field of machine learning for streaming data covers other tasks such as regression, clustering, anomaly detection, to name a few. We direct the reader to [GRB + 19] for an extensive and deeper description of this field, the state-of-the-art, and its active challenges.
The scikit-multiflow package scikit-mutliflow [MRBA18] is a machine learning library for multi-output/multi-label and stream data written in Python. Developed as free and open-source software and distributed under the BSD 3-Clause License. Following the SciKits philosophy, scikitmultiflow extends the existing set of tools for scientific purposes. It features a collection of state-of-the-art methods for classification, regression, concept drift detection and anomaly detection, alongside a set of data generators and evaluators. scikit-multiflow is designed to seamlessly interact with NumPy [vCV11] and SciPy [VGO + 20]. Additionally, it contributes to the democratization of stream learning by leveraging the popularity of the Python language. scikit-multiflow is mainly written in Python, and some core elements are written in Cython [BBC + 11] for performance.
scikit-multiflow is intended for users with different levels of expertise. Its conception and development follow two main objectives: 1) Easy to design and run experiments. This follows the need for a platform that allows fast prototyping and experimentation. Complex experiments can be easily performed using evaluation classes. Different data streams and models can be analyzed and benchmarked under multiple conditions, and the amount of code required from the user is kept to the minimum. 2) Easy to extend existing methods. Advanced users can create new capabilities by extending or modifying existing methods. This way users can focus on the details of their work rather than on the overhead when working from scratch scikit-multiflow is not intended as a stand-alone solution for machine learning. It integrates with other Python libraries such as Matplotlib [Hun07] for plotting, scikit-learn [PVG + 11] for incremental learning 6 compatible with the streaming setting, Pandas [pdt20] for data manipulation, Numpy and SciPy for numerical and scientific computations. However, it is important to note that scikit-multiflow does not extend scikit-learn, whose main focus in on batch learning. A key difference is that estimators in scikitmultiflow are incremental by design and training is performed by calling multiple times the partial_fit() method. The majority of estimators implemented in scikit-multiflow are instanceincremental, meaning single instances are used to update their internal state. A small number of estimators are batch-incremental, where mini-batches of data are used. On the other hand, calling fit() multiple times on a scikit-learn estimator will result in it overwriting its internal state on each call.
As of version 0.5.0, the following sub-packages are available: • anomaly_detection: anomaly detection methods.
• data: data stream methods including methods for batchto-stream conversion and generators. 6. Only a small number of methods in scikit-learn are incremental.
• evaluation: evaluation methods for stream learning.
• lazy: methods in which generalization of the training data is delayed until a query is received, e.g., neighborsbased methods such as kNN.
• meta: meta learning (also known as ensemble) methods.

In a nutshell
In this section, we provide a quick overview of different elements of scikit-multiflow and show how to easily define and run experiments in scikit-multiflow. Specifically, we provide examples of classification and drift detection.

Architecture
Here we describe the basic components of scikit-multiflow. The BaseSKMObject class is the base class. All estimators in scikit-multiflow are created by extending the base class and the corresponding task-specific mixin(s): ClassifierMixin, RegressorMixin, MetaEstimatorMixin and MultiOutputMixin.
The ClassifierMixin defines the following methods: • partial_fit --Incrementally train the estimator with the provided labeled data.
• fit --Interface used for passing training data as batches.
• predict --Predict the class-value for the passed unlabeled data.
• predict_proba --Calculates the probability of a sample pertaining to a given class.
During a learning task, three main tasks are performed: data is provided by the stream, the estimator is trained on incoming data, the estimator performance is evaluated. In scikit-multiflow, data is represented by the Stream class, where the next_sample() method is used to request new data. The StreamEvaluator class provides an easy way to set-up experiments. Implementations for holdout and prequential evaluation methods are available. A stream and one or more estimators can be passed to an evaluator.

Classification task
In this example, we will use the SEA generator. A stream generator does not store any data but generates it on demand. The SEAGenerator class creates data corresponding to a binary classification problem. The data contains 3 numerical features, from which only 2 are relevant for learning 7 . We will use the data from the generator to train a Naive Bayes classifier. For compactness, the following examples do not include import statements, and external libraries are referenced by standard aliases.
As previously mentioned, a popular method to monitor the performance of stream learning methods is the prequential evaluation. When a new data sample (X, y) arrives: 1. Predictions are obtained for the new data sample (X) to evaluate how well the model performs. 2. Then the new data sample (X, y) is used to train the model so it updates its internal state. The prequential evaluation can be easily implemented as a loop: The previous example shows that the Naive Bayes classifier achieves an accuracy of 93.95% after processing all the samples. However, learning from data streams is a continuous task and a best-practice is to monitor the performance of the model at different points of the stream. In this example, we use an instance of the Stream class as it provides the next_sample() method to request data and the returned data is a tuple of numpy.ndarray. Thus, the above loop can be easily modified to read from other data structures such as numpy.ndarray or pandas.DataFrame. For real-time applications where data is actually represented as a stream (e.g. Google's protocol buffers), the Stream class can be extended to wrap the necessary code to interact with the stream.
The prequential evaluation method is implemented in the EvaluatePrequential class. This class provides extra functionalities including: We can run the same experiment on the SEA data. This time we compare two classifiers: NaiveBayes and SGDClassifier (linear SVM with SGD training). We use the SGDClassifier in order to demonstrate the compatibility with incremental methods from scikit-learn.  We set two metrics to measure predictive performance: accuracy and kappa statistics [Coh60] (for benchmarking classification accuracy under class imbalance, compares the models accuracy against that of a random classifier). During the evaluation, a dynamic plot displays the performance of both estimators over the stream, Fig. 3. Once the evaluation is completed, a summary is displayed in the terminal. For this example and considering the evaluation configuration: In Fig. 3, we observe the evolution of both estimators as they are trained on data from the stream. Although NaiveBayes has better performance at the beginning of the stream, SGDClassifier eventually outperforms it. In the plot we show performance at multiple points, measured by the given metric (accuracy, kappa, etc.) in two ways: Mean corresponds to the average performance on all data seen previously, resulting in a smooth line. Current indicates the performance over a sliding window with the latest data from the stream, The size of the sliding window can be defined by the user and is useful to analyze the 'current' performance of an estimator. In this experiment, we also measure resources in terms of time (training + testing) and memory. NaiveBayes``is faster and uses slightly more memory. On the other hand,``SGDClassifier is slower and has a smaller memory footprint.

Concept drift detection
For this example, we will generate a synthetic data stream. The first 1000 samples of the stream contain a sequence from a normal distribution with µ a = 0.8, σ a = 0.05, followed by 1000 samples from a normal distribution with µ b = 0.4, σ b = 0.2, and the last 1000 samples from a normal distribution with µ c = 0.6, σ c = 0.1. The distribution of data in the described synthetic stream is shown in Fig. 4.  In this example, we compare two popular stream models: the HoeffdingTreeClassifier, and its drift-aware version, the HoeffdingAdaptiveTreeClassifier.
For this example, we will load the data from a csv file using the FileStream class. The data corresponds to the output of the AGRAWALGenerator with 3 gradual drifts at the 5k, 10k, and 15k marks. A gradual drift means that the old concept is gradually replaced by a new one, in other words, there exists a transition period in which the two concepts are present. The summary of the evaluation is: The result of this experiment is shown in Fig. 5. During the first 5K samples, we see that both methods behave in a very similar way, which is expected as the HoeffdingAdaptiveTreeClassifier essentially works as the HoeffdingTreeClassifier when there is no drift. At the 5K mark, the first drift is observable by the sudden drop in the performance of both estimators. However, notice that the HoeffdingAdaptiveTreeClassifier has the edge and recovers faster. The same behavior is observed after the drift in the 15K mark. Interestingly, after the drift at 10K, the HoeffdingTreeClassifier is better for a small pe-riod but is quickly overtaken. In this experiment, we can also see that the current performance evaluation provides richer insights on the performance of each estimator. It is worth noting the difference in memory between these estimators. The HoeffdingAdaptiveTreeClassifier achieves better performance while requiring less space in memory. This indicates that the branch replacement mechanism triggered by ADWIN has been applied, resulting in a less complex tree structure representing the data.

Real-time applications
We recognize that previous examples use static synthetic data for illustrative purposes. However, the goal is to work on real-world streaming applications where data is continuously generated and must be processed in real-time. In this context, scikit-multiflow is designed to interact with specialized streaming tools, providing flexibility to the users to deploy streaming models and tools in different environments. For instance, an IoT architecture on an edge/fog/cloud computing environment is proposed in [CW19]. This architecture is designed to capture, manage, process, analyze, and visualize IoT data streams. In this architecture, scikitmultiflow is the stream machine learning library inside the processing and analytics block. In the following example, we show how we can leverage existing Python tools to interact with dynamic data. We use Streamz 8 to get data from Apache Kafka. The data from the stream is used to incrementally train, one sample at a time, a HoeffdingTreeClassifier model. The output on each iteration is a boolean value indicating if the model correctly classified the last sample from the stream. Alternatively, we could define two nodes, one for training and one for predicting. In this case, we just need to make sure that we maintain the test-then-train order.

Conclusions and final remarks
In this paper, we provide a brief overview of machine learning for data streams. Stream learning is an alternative to standard batch learning in dynamic environments where data is continuously generated (potentially infinite) and data is non-stationary but evolves (concept drift). We present examples of applications and describe the challenges and requirements of machine learning techniques to be used on streaming data effectively and efficiently. We describe scikit-multiflow, an open-source machine learning library for data streams in Python. The design of scikit-multiflow is based on two principles: to be easy to design and run experiments, and to be easy to extend and modify existing methods. We provide a quick overview of the core elements of scikit-multiflow and show how it can be used for the tasks of classification and drift detection.