Database techniques for resilient network monitoring and inspection

Network connection logs have long been recognized as integral to proper network security, maintenance, and performance management. This paper provides a development of distributed systems and write optimized databases: However, even a somewhat sizable network will generate large amounts of logs at very high rates. This paper explains why many storage methods are insufficient for providing real-time analysis on sizable datasets and examines database techniques attempt to address this challenge. We argue that sufficient methods include distributing storage, computation, and write optimized datastructures (WOD). Diventi, a project developed by Sandia National Laboratories, is here used to evaluate the potential of WODs to manage large datasets of network connection logs. It can ingest billions of connection logs at rates over 100,000 events per second while allowing most queries to complete in under one second. Storage and computation distribution are then evaluated using Elastic-search, an open-source distributed search and analytics engine. Then, to provide an example application of these databases, we develop a simple analytic which collects statistical information and classifies IP addresses based upon behavior. Finally, we examine the results of running the proposed analytic in real-time upon broconn (now Zeek) flow data collected by Diventi at IEEE/ACM Supercomputing 2019.


INTRODUCTION
The ability to analyze traffic allows network operators to identify network errors, detect anomalous traffic, classify malware [1], find botnets [2], and more. This analysis can be performed as part of an investigation into a detected breach or as part of a real-time analytic to give network insight or detect zero-day attacks as they occur. In order to enable these applications, the system storing network flows faces many challenges. It must simultaneously store a colossal amount of data, ingest event logs at an extremely high rate, and quickly respond to queries. The importance of many events is not known at the time the associated log is generated therefore a large amount of dfata must be stored. The event could be indicative of a misconfiguration or of malicious activity, but the system may not know immediately. This is especially true in the case of security TELKOMNIKA Telecommun Comput El Control  Database techniques for resilient network monitoring and inspection (Israa Al Barazanchi) 2413 monitoring. In 2018 a majority of successful breaches were not discovered for weeks (<20%), months (40%) or years (20%). [3] Therefore, forensic analysis of intrusions could necessitate the long-term storage of these events.
Additionally, data must be ingested at a high rate in order to keep pace with network events. The system must ingest at least one second's worth of network data in one second so that it does not rapidly fall behind. Queries upon stored logs need to complete quickly so that analytics running in real time get actionable results, to enable rapid automated analysis of past events, and to ensure the sanity of network security operators using the system. Many existing methods for storing network flow fail to manage these competing interests: -Problem statement In order to improve data ingestion, query performance, and capacity, ability to analyze traffic allows network operators to identify network errors, detect anomalous traffic. Network connection logs have long been recognized as integral to proper network security, maintenance, and performance management. The purpose of presenting the following insufficient storage methods is not to make an argument that distributing data or WODs are the only two solutions to this challenge, but rather to further illustrate its complexity. This work presents its data among many nodes in the cluster. This allows insertions and queries to different shards to run in parallel and additionally allows operators provide the system with more resources without scaling up one device. Scale-up quickly becomes prohibitively expensive, but scale-out is much more cost effective. Data integrity is also ensured by duplicating the data across nodes. If the amount of data stored on each node is not too large, the drawback noted above can now become an advantage to the system as it increases the likelihood that data is found on many shards instead of just one.

-Contributions
The main contributions of this work involve improving the results of previous systems in the domain of networking and database, while we compare Diventi and Elasticsearch here, it is worth noting that they are not mutually exclusive, and both provide unique benefits. As an example, use case of these databases, we created a simple analytic to run through the network history of an IP address. This script produces histograms of the following metrics: number of packets in and out, number of bytes in and out, source and destination port numbers, and number of connections with each neighbor. We utilize these metrics to classify the IP addresses as exhibiting one or more network behaviors. We discuss methods for storing network flows, then compare and contrast Elasticsearch and Diventi in section 1.6. Then, in section 2, current and potential network flow analysis of large flow datasets are discussed. Finally, in section 3, we present the results of analytic we developed, the performance of the analytic running in real-time at SC19 (supercomputing 2019), and the results we gleaned from this information. Elasticsearch [4] and Diventi [5] are two examples of systems which properly address these concerns, yet the means by which they do are very different. Elasticsearch is a "real-time distributed search and analytics engine." which allows for both rapid ingestion and query response by sharding data across many nodes in a cluster. Diventi on the other hand leverages the write optimizations of a B to the epsilon tree (Bε-tree) to keep up with data ingestion needs while utilizing the underlying B-tree structure to ensure timely queries. This allows Diventi to store a high amount of data on a single node. -Network flow database A network flow is a unidirectional stream of packets with common source and destination. Netflow and IPFIX, two common flow export protocols, aggregate packets from this stream within a given window of time into a single flow. Bro-conn logs [6] are not truly flows, as each log refers to a single bidirectional connection which may be composed of multiple packets to open the connection, send data, and close the connection. All the packets in this connection are aggregated, and for the purposes of this research paper we will refer to bro-conn logs as flows for the sake of simplicity. Packet aggregation inherently results in the loss of packet specific information [7]. This was an early sacrifice made to address the extremely high volume of events created by logging every packet. However, in spite of this reduction in volume, network flow data still suffers from Big Data complexity. "Big Data is data whose complexity hinders it from being managed, queried and analyzed through traditional data storage architectures, algorithms, and query mechanisms." This complexity is defined by the data's volume -the quantity of data to be stored; variety -the system must simultaneously hold un-structured, semi-structured, and structured data; and velocity -the pace at which data is generated for the enterprise network for NIDS in Figure 1. [8] Thankfully, variety is not of concern as network flows all follow a similar structure. Unfortunately, any system tasked with storing this data will still have to contend with high volume and velocity. This complexity hinders unsophisticated efforts to manage, query, and analyze the data.
The fields of network flows which are commonly used as database keys include the timestamp, originating IP address and responding IP address. This research paper is primarily concerned with queries upon the history of IP addresses and subnets. As such, methods discussed here use IP addresses as keys. While not necessary for these applications, full-text search is certainly an attractive feature as it allows for specific queries searching across multiple fields simultaneously.  Figure 1. The design for the network intrusion detection using the network flow data and stored historical data in the database as forensic queries were asked by the security analyst for the network monitoring and threat inspection [9] -Naïve methods The purpose of presenting the following insufficient storage methods is not to make an argument that distributing data or WODs are the only two solutions to this challenge, but rather to further illustrate its complexity [10]. A first attempt to match the speed of the network could be to directly write each log to a file, however querying for a individual log would require a O(n) search through the database. For real-time analysis and responsive queries this is unacceptably slow, especially as n grows to months or years of data. In response to this setback we might attempt to utilize a data structure such as a hash table to allow a query for a single log to complete in O (1) time. However, this approach fails to take into account volume, as it is only effective while a large portion of the stored data can fit within RAM [11]. This limitation of data to RAM is a direct result of the benefits hashing the key normally provides. The avalanche effect, whereby small changes to the text create a large difference in the hash value, ensures that hashed keys are spread across the used storage space. This is an important feature of hashing allowing the hash table to reduce collisions, but inevitably results in an increasing ratio of disk IOs per inserted log [12]. A large number of disks IOs causes rapid degradation of the insertion rate to below the pace of the network. The only solution would be to keep a majority of the data in RAM, but this limits data retention as increasing the size of RAM becomes prohibitively expensive. NfSen and Flowscan are two common tools used to analyze flows and both suffer from the performance limitations noted above. These methods use a Round Robin database (RRDtool) [8,13] to store the data. As a consequence, the IP addresses are unordered, and as discussed above must be kept within RAM in order to facilitate quick searches. RRDtool is a time series database which maintains a constant system footprint by automatically overwriting the oldest values with the newest, once the maximum size of the database is reached [14]. However, this severely limits how long network administrators can store potentially important network flows. Many other potential solutions fall into either of these traps. Elasticsearch and Diventi serve as two examples of how to properly avoid them.

-Elastic search
Elasticsearch uses an inverted index as its underlying data structure to allow data to be stored in the order it arrives while maintaining query performance. An inverted index Figure 2 [4] is a structure which lists all the unique values that appear in any document and the documents in which that value appears. The index can be imagined a map through the unorganized data which allows queries to quickly find what they're looking for. When logging network flows, documents are individual logs and indices include fields such as the originating IP address. Elasticsearch creates an index for each field in the log, allowing operators to query by timestamps, IP addresses, ports, and more.
When preforming a query, the inverted index returns a list of matching documents which can then be retrieved with minimal search cost. However, one drawback of this structure is that for queries on indices other than time, matching documents will be scattered throughout storage. This means that a single node is unable to take advantage of spatial locality and will likely have to contend with a high ratio of memory IOs to access all the documents. In order to improve data ingestion, query performance, and capacity, Elasticsearch shards its data among many nodes in the cluster. This allows insertions and queries to different shards to run in parallel and additionally allows operators provide the system with more resources without scaling up one device. Scale-up quickly becomes prohibitively expensive, but scale-out is much more cost effective. Data integrity is also ensured by duplicating the data across nodes. If the amount of data stored on each node is not too large, the drawback noted above can now become an advantage to the system as it increases the likelihood that data is found on many shards instead of just one.

-Diventi
Write optimized datastructures are designed to provide efficient write performance at the expense of a limited query performance penalty. A B ε -tree is a B-tree with an insertion buffer placed at each node. Data is TELKOMNIKA Telecommun Comput El Control  Database techniques for resilient network monitoring and inspection (Israa Al Barazanchi) 2415 inserted to the root buffer which, when filled, flushes its contents to its children Figure 3 [15]. This structure has a few key benefits over a B-tree in exchange for paying a small-time penalty on queries. The cost of inserting is O (logB N), as opposed to a B-tree's cost of O (logB N). N is the number of entries in the tree and B 1 −ε is the size of the buffer. This may seem to be a small difference in performance, but has much larger implications upon the capacity the database can maintain while keeping pace with the network data. If the system must complete X insertions in one second then the maximum number of logs N is reached when (logB N) X = 1. log N grows logarithmically with respect to N; therefore, the maximum value of N increases exponentially with respect to the buffer size [16].  Figure 3. B ε -tree: Insertion of red data item triggers flush to the child nodes, data is inserted to the root buffer which, when filled, flushes its contents to its children that make up a tree [5] As another performance benefit, writes to disk are amortized because buffers higher in the tree are held within caches and RAM. This means that only those flushes which reach the lower levels of the tree trigger blocking disk IOs. This increases the rate of ingestion. The time penalty to query performance is a consequence of the need to search through the buffer of each node visited while traversing the tree. This means that there is a positive correlation between buffer size and query latency. For more details see Raizes et al. [17]. Diventi orders data first upon the source ip addresses and then the timestamp. The benefit of this is that queries to IP addresses and subnets are quick. The logs which match the query will be contiguous within the database and quick to identify as a result of the B-tree structure. The system can take advantage of spacial locality when performing these queries in addition to quickly identifying the matching logs. As such, disk IO penalties during queries should be minimized. The drawback is that the system can not efficiently query on any other field, however, the workload we are concerned with is primarily investigations into individual ip addresses or subnets. Another drawback is that in order to retrieve logs when the queried IP address matches either the source or responding IP address, two logs of each event must be inserted. One with normal IP ordering and the other reversed. Diventi provides efficient use of resources to allow a large amount of data to be stored on a single node while maintaining high performance. Multiple Diventi nodes responsible for different network taps or for data that is split between them by another process may be deployed if even more capacity is required.

METHODOLOGY
Network flows do not contain packet payloads information, and do not provide packet level granularity for fields such as the number of bytes per packet or TCP flags. Instead, flows contain aggregate totals of the number of packets, bytes, flags used in any packet, and more. Despite this loss of specifics, flows still contain sufficient information to identify network intrusions [18]. This section describes existing and potential network flow analysis to be improved or enabled by network flow databases.

Current methods and research -Blacklist approach
At the most basic level Nfsen and Flowscan (performance discussed in subsection 1.4) provide for analysis of flow statistics and traffic. They create statistical summaries and graphical displays of data in addition to providing the ability to filter results by a variety of fields. Nfsen additionally provides the ability to define alerts. These alerts can act upon a filtered subset of the overall traffic, trigger on up to 6 chained conditions, and take a set of actions [19]. Similarly, the Bro-IDS provides the ability to trigger actions, such as blocking traffic and creating alerts, based upon the content of the packets4 collected by the tap [20]. This is often accomplished using IP blacklists and checking packet contents against common malware patterns.
-Zero-day attacks Statistical analysis and machine learn-ing are employed for detecting and classifying a wide variety of network traffic patterns. However, they commonly share the goals of reducing the false positives generated by systems like those discussed above, and detecting zero-day attacks. There are many examples of using network flows to detect and classify actions taken by network participants. Moustafa et al. present an ensemblebased technique for detecting exploits of IoT systems, particularly botnets, using statistical summaries provided by the Bro-IDS [2]. MalClassifier, a tool developed by researchers at Oxford [21], uses the network flow behavior of malware to classify it into various malware families without requiring sandbox execution [22]. MalCalssifier additionally has the ability to determine if the malware does not fit previously established malware families, allowing security operatives to propose new families. Finally, Rodriguez et al. present work in using time series databases studying historical patterns to predict future behavior and detect anomalies. They state that the more data that is used in the time series the more accurate the predictions will be.

Potential application
This portion of the research paper attempts to address the possible uses of a fully operational elasticsearch or diventi database -Lateral network movement Lateral network movement is a process by which an attacker takes advantage of access to one machine in the network to gain access to another machine. This is done for the purpose of reconnaissance to find future targets, to reach an objective, or gain a higher level of access to the network. Detecting how a bad actor or piece of malware has moved within one's network is essential for a proper response to an intrusion. Otherwise, malware may remain within the network, continuing to cause damage after action has been taken to address the compromise. Additionally, this type of monitoring may allow security operators to detect anomalies. A chain of ssh logins may be indicative of an attack. In order to identify this movement, it is necessary to hold a large amount of network data [23]. This requirement necessitates the use of a proper network flow database. Additionally, sensors that monitor local traffic are required. The more network visibility the system is given the better, if the system has no view of the connection between computer A and B than it cannot detect lateral movement between them. Tracking lateral movement was considered for this research paper but was ultimately forgone as a result of limited network visibility.

-Machine learning, human interaction and verification
The current work presented in subsection 2.2 is useful in detecting anomalous network behavior and zero-day attacks. However, for complicated use cases, a human security operator will likely have to interact with the machine learning algorithm to verify that the correct actions were taken or to interpret results. To facilitate this, it may become necessary for the user to look into the history of ip addresses which the algorithm has flagged. These queries by the human operator need to complete quickly and have access to a large amount of data in order to facilitate the interaction and save valuable analyst time.

IP flow analysis at supercomputing
To begin to evaluate the performance and use of a network flow database, we developed an analytic to produce flow metrics in real-time at SC19 (supercomputing 2019). Diventi indexed bro-conn logs from a tap used by the SciNet Security Team over the course of the event. At the time the analytic was run, Diventi had indexed a quarter of a billion events corresponding to half a billion logs. What follows is a description and evaluation of the analytic.

Description
The analytic is designed with the goal of gathering basic flow statistics about an IP address. We collected the total number of connections, number of packets in and out, number of bytes in and out, source and destination port numbers, and neighbors. Diventi records the magnitude of the packets and bytes by storing log2 x rather than the exact number x. For example, records with byte counts between 1 and 3 are recorded using magnitude 1 and records with a byte count from 4 to 7 recorded with magnitude 2. The number of packets and bytes is therefore already bucketed so as to obscure small differences and highlight large ones [24]. The information was collected in real time and then processed to see what basic conclusions we could reach from the data. We classify each IP address as follows: -Active if it has more than 100 connections within our network and inactive if it has less than 20.
-High degree (number of neighbors) if the degree is greater than 30 and a small degree if less than 5.
-Receiver if it receives twice the number of packets it sends and a sender if the opposite is true -Elephant if the average number of bytes sent or received per connection is greater than 10, 000 bytes, a mouse if both are less than 1000 but greater than 80, and a gnat if both are less than 80. -These categories were defined somewhat arbitrarily with the goal of demonstrating the ability to quickly categorize the behavior of an IP address. As discussed earlier, deep inspection into the history of IPs or subnets is the purpose of this work.

Query performance
We first discovered that the performance of queries across a range of logs was fairly constant even as the size of the database grew much larger. We posit that this is because the majority of a query workload is composed lateral scans through the leaves of the tree. Therefore, increases in the smaller cost to traverse down the tree and find the first matching key are relatively insignificant. Especially as the cost of traversal increases as the log of the number of records. We show this trend in Table 1 [9]. Table 1 shown serverside latency, averaged over 3 queries. Query latency increased by 20% while the size of the database increased by nearly 1000% over the same period, showing the relatively flat latency. In order to ensure uniform results, the query time is an average of 3 queries [9].
One million logs were queried at intervals when the database had stored between 1 million and 1 billion logs. Query latency increased by 20% while the size of the database increased by nearly 1000% over the same period, showing the relatively flat latency. In order to ensure uniform results, the query time is an average of three queries and the quesy processing and optimization with high level language as shown in Figure  4 [12]. The server was shut down between each query to prevent the results from being cached. A Dell PowerEdge R520 with 165GB of RAM and 32 cores was used for this test, however, the size of the RAM was not a significant factor, as Diventi does not preemptively load data from the underlying storage into its cache.
Table 2 [11] shows the performance of the metrics analytic running upon diventi at supercomputing. Analytic performance for creating metrics end client's end, Real refers to the total time from the start of the program to completion, User to the amount of time the process was executing on the CPU, and System to the amount of time the process was executing system calls [11]. The Unix utility time was used to measure the total amount of time required to create the metrics on the client's end. To establish the performance and generate data we queried a single IP address, a 255.255.255.0 subnet mask, a 255.255.0.0 subnet mask, and the entire database [25]. Real refers to the total time from the start of the program to completion, User to the amount of time the process was executing on the CPU (central processing unit), and System to the amount of time the process was executing system calls. We see an increasing amount time spent off the CPU in Table 2 likely because of the increasing size and complexity of the data stored on the client end, causing IO blocking when preforming analysis. The analytic was able to very quickly establish the statistical history of an ip address or subnet by taking advantage of the Bε-tree's structure. This provides evidence that network flow databases will allow complex analysis to complete rapidly.   Figure 4. Quesy processing and optimization with high level language query through three levels of query optimization i.e. parser and translator, query optimizer and query evaluation engine to generate the real time results of optimized query by inserting data from the database [12] 3.

RESULTS
The result shows graphical representations of the metrics collected upon a single IP address shown in Figures 5-9. This IP had 603 connections, the source port was scattered, but the destination port was always 13568. From this information the IP address is classified as active, of small degree, neither a sender nor receiver, and a mouse. This IP address was likely receiving and sending data to a small number of other IP addresses from a process running on port 13568. The neighbor histogram indicates that the behavior of this IP address is likely most dependent upon neighbor 4.
At the time the analytic was run 1,480,024 IP addresses were stored within the database. Based upon the statistical summary returned when the analytic was run across the entire database each IP address was matched with the classifications described in subsection 2.4. The number of IP addresses that match each category are shown in Table 3 along with the percentage of the total IPs which matched. Using this table, it is observed that a vast majority of IPs were classified as inactive, small degree, senders, or gnats. Based upon this information, we could posit that a vast amount of the traffic collected was composed of simple interactions which didn't require much data to be transferred back to the receiver therefore most packets were sent by the originator. Some examples of this type of traffic include DNS (domain name service) lookups, ICMP (internet control message protocol) messages, and SYN scanning.

DISCUSSION
First, it should be noted the analytic provides no ability to determine the portion of IPs which matched multiple categories. Each set of classifications: inactive and active, small degree and high degree, etc. is calculated in isolation. A simple improvement to the analytic would be to add this capability allowing the user to zero in on particularly rare behaviors. Diventi was granted only limited access to the supercomputing network. As a consequence of this limited network visibility, some results may be incomplete. It is important that any organization seeking to employ network monitoring carefully consider the implications of the visibility provided by their network taps.

CONCLUSIONS
Using databases designed for the big data challenges associated with logging and querying network flows is necessary in order to provide network operators with a larger, more efficient window into the network traffic of both past and present. Its imperative for the development of automated analytics and for effective post-intrusion investigations that these methods are adopted when logging network flows. This research paper shows that Diventi was able to ingest over a billion logs while providing rapid query responses with relatively flat latency. These queries are capable of quickly collecting information regarding IP addresses to classify them into various categories. Running this analysis at supercomputing 2019 revealed that a majority of the traffic collected was composed of simple interactions such as DNS (domain name service) lookups, ICMP (internet control message protocol) messages, and SYN scanning. In conjunction with machine learning and other cutting-edge techniques, these databases allow security personnel to use their time to efficiently identify threats and respond to alerts instead of waiting for information.