A Survey of Recent Prefetching Techniques for Processor Caches

As the trends of process scaling make memory systems an even more crucial bottleneck, the importance of latency hiding techniques such as prefetching grows further. However, naively using prefetching can harm performance and energy efficiency and, hence, several factors and parameters need to be taken into account to fully realize its potential. In this article, we survey several recent techniques that aim to improve the implementation and effectiveness of prefetching. We characterize the techniques on several parameters to highlight their similarities and differences. The aim of this survey is to provide insights to researchers into working of prefetching techniques and spark interesting future work for improving the performance advantages of prefetching even further.


INTRODUCTION
As the on-chip core count increases at a much faster rate than the memory bandwidth, 1 the memory system becomes an increasingly crucial bottleneck in modern processor design. This has forced researchers to pursue aggressive approaches to hide memory latency, for example, use of large-size caches, multithreading, and prefetching. Of these, prefetching offers unique advantages. Large caches incur an energy penalty and consume precious chip area that may be better used for additional cores [Mittal 2014]. MT can improve the performance of parallel applications only. By comparison, prefetching does not incur a large area/energy penalty and can boost even serial applications. In the optimal case, prefetching can bring the performance close to that of a perfect cache by removing nearly all the cache misses [Ferdman et al. 2011;Annavaram et al. 2001b]. Thus, due to its advantages, prefetching is now used in nearly all high-performance commercial processors, such as AMD Opteron, IBM Power8, Intel Xeon, and Oracle Sparc M7. However, realizing the full potential of prefetching requires careful management and addressing several key challenges. Unlike MT, prefetching requires prediction of future access patterns that is non-trivial in most cases. Complex access patterns demand sophisticated prefetchers that have huge metadata and latency overheads. Naive prefetchers may bring useless lines that consume cache space and may degrade performance by displacing the useful lines [Srinath et al. 2007]. Further, prefetching may interfere with other processor management policies (e.g., cache replacement policy) and cause BW contention. Clearly, although promising, prefetching by itself is not a panacea for improving performance. To address these challenges, several techniques have been recently proposed.
Contributions: In this article, we present a survey of prefetching techniques for processor caches. Figure 1 shows the organization of this article. Section 2 provides a background on and classification of prefetching techniques and then discusses the key challenges related to implementation and effectiveness of prefetching. Section 3 discusses techniques for hardware, data, and core-side prefetching, and Section 4 discusses techniques for software, instruction, and memory-side prefetching. Section 5 discusses techniques evaluated using real systems and analytical models. Section 6 presents several techniques for reducing overhead of prefetchers and improving their effectiveness. Section 7 concludes this article with an outlook towards future work.
Scope of the article: To strike a balance between breadth and brevity, we focus on recent research works that present innovations or insights focused on prefetching in caches and not on other techniques or processor components. We only discuss prefetching in central processing unit (CPU) and not in graphics processing unit (GPU). We present key ideas of research works and do not include their quantitative results, since they use different evaluation platforms and methodologies. To bring out the similarities

Original code
Code with software prefetching instructions

Data and Instruction Prefetching
Compared to instruction access patterns, data access patterns show higher sensitivity to input dataset and less regularity, which makes data prefetching more challenging. The large instruction working set size of commercial workloads can lead to misses at the L1 and L2 caches, which underscores the need of instruction prefetching for them. However, for applications with a negligible I-cache miss rate (e.g., scientific), instruction prefetching is not required.

Core-Side and Memory-Side Prefetching
In core-(or processor) side prefetching, the prefetch requests are issued by an engine in cache hierarchy, while in memory-side prefetching, such an engine resides in the main memory subsystem (after any memory bus). Memory side prefetching can save precious chip space by storing metadata off-chip and can also perform optimizations at main memory side [Yedlapalli et al. 2013]. By comparison, core-side prefetching can avail more accurate knowledge of memory reference patterns and can perform cache level optimizations, such as avoiding cache pollution [Srinath et al. 2007].

A Classification Based on Pattern and Complexity
Prefetchers can also be classified based on the (ir)regularity or complexity of the miss/access pattern they target. The "Next-K" line prefetcher brings next K lines after the current miss. The stride prefetcher brings lines showing a strided pattern relative to the current miss [Chen and Baer 1995], refer to Figure 2(b). For example, if a past sequence of addresses accessed by loads have been A, A+ Q, A+ 2Q, and A+ 3Q, then the data at address A + 3Q + Q can be prefetched, since this sequence has a stride of Q. For Q = 1, this is referred to as stream prefetching. For many applications, however, the access patterns are not perfectly strided, and these are termed irregular patterns (refer to Figure 2(c)). Correlation prefetching tracks past reference sequence or miss addresses to detect some correlation and use it to guess future miss addresses that are used for prefetching. Spatial prefetchers assume spatial locality and, thus, bring lines into the vicinity of a current miss [Somogyi et al. 2006]. Temporal prefetchers assume that recently seen address streams are expected to recur and, hence, they prefetch based on temporal streams from recent miss history [Wenisch et al. 2005].

A Classification Based on Objective and Cache Level
Before moving to detailed discussion and classification of prefetching techniques in Sections 3 through 6, we first classify the prefetching techniques based on their optimization goal. Table I shows this classification and, from this, it is clear that prefetching can provide versatile optimizations. The first and last level caches have different properties (e.g., locality of access stream, characteristic such as private/shared, acceptable latency/storage overhead, etc.) which dictate choice of prefetching technique/ parameters for them (e.g., Mehta et al. [2014]). For this reason, Table I also classifies the works based on the cache level where prefetching is used.

Challenges in Prefetching
Several challenges need to be addressed to realize the full potential of prefetching.
2.7.1. Implementation Overheads. While simple prefetchers (e.g., next-line prefetching) have limited coverage and accuracy, sophisticated prefetchers require a large amount of metadata (e.g., tens of MBs [Lai et al. 2001;Ferdman and Falsafi 2007]). Storing the metadata off-chip requires frequent and costly communication of data on and off chip, while storing it on-chip is only possible for small structures, which still consumes precious chip resources. Also, passing information about the miss address, program counter (PC), and so on, to prefetchers (of especially lower-level caches) introduces nontrivial changes (such as wire-routing) to chip design. To avoid redundant prefetches, the cache needs to be probed, which requires an extra port or sequential checking [Reinman et al. 1999].
2.7.2. Performance Tradeoffs. Due to the features of modern processors, such as out-oforder execution, reduction in misses brought by prefetching may not directly translate into performance improvement. Issuing prefetches in a timely manner requires estimating cache miss latencies and other timing information [Srinath et al. 2007;Marathe and Mueller 2008;Zhu et al. 2010], and this is especially challenging for SW-based prefetchers. 2.7.3. Cache Pollution and Resource Contention. Prefetching can be seen as a complementary approach to bypassing and this is illustrated in Figure 4. As the volume of data fetched in cache increases, hit rate increases due to better storage utilization; however, thrashing begins as soon as the working set exceeds cache capacity. Prefetched blocks start evicting useful demand-fetched blocks or those brought into shared cache by prefetchers of other cores. Resultant cache pollution generates further misses, which may trigger more prefetches. With increasing core-count, inter-core interference escalates and, due to reduced per-core BW, contention from prefetch requests also increases.
Some techniques place prefetched blocks in an additional buffer (Section 6.1). However, accessing them sequentially or in parallel with cache causes energy/latency overhead. Also, they require reorganization of chip architecture and preclude the possibility of cache space sharing between demand-fetched and prefetched data. It is clear that achieving the optimum balance (peak point in Figure 4) requires a careful moderation of prefetching parameters and aggressiveness.
2.7.4. Reliability Challenges. Prefetching techniques can increase the soft error rate by increasing the residency time of data in the cache . Also, by inducing extra writes, they can cause hard errors and reduced device-lifetime in nonvolatile memories that have limited write endurance.
The techniques discussed in subsequent sections seek to address these challenges.

HARDWARE PREFETCHING, DATA PREFETCHING, AND CORE-SIDE PREFETCHING
In this section, we discuss HW, data, and core-side prefetching techniques that form the most prominent prefetching approaches. In this and the next sections, we discuss prefetching techniques by roughly organizing them into several groups. Although some of the techniques belong to multiple groups, we discuss them under a single group only. Jouppi [1990] presents a prefetching scheme using stream buffers. On a cache miss, sequential cache blocks are prefetched into a separate stream buffer until it is filled. Thus, the prefetched blocks are not placed into the L1 cache to avoid its pollution. The stream buffer is organized as a first in first out (FIFO) buffer. Next time, when the L1 cache sees a miss, the first entry of the stream buffer is checked and, on a hit, a block is brought into the L1 cache. As prefetched data are used, more prefetches are issued, which keeps the buffer sufficiently ahead of the instruction stream of the processor so the entire latency can be hidden. Jouppi also explored the use of multiple streaming buffers in parallel that can prefetch multiple intertwined reference streams. This is useful in several cases, for example, when applications access multiple arrays inside a loop. Joseph and Grunwald [1997] present a prefetching approach based on the assumption that the miss address stream can be approximated by a Markov transition diagram. In this diagram, a weight is given to every transition from node P to node Q, which shows the fraction of accesses to P where the next access happens to node Q. Assuming that the execution pattern is repetitive, this Markov diagram can be used to predict the miss address wthat follows the current miss address. Since programs show time-varying behavior and the degree of a node may become very large, constructing and storing a full Markov diagram is infeasible. To address this issue, they limit the number of nodes in the transition diagram along with their out-degrees. This allows the diagram to be stored in a table. When a miss address matches that in the table, the predicted next addresses can be prefetched. Higher priority is given to those "next addresses" that have a higher probability of transition to them from the current address. If the prefetch request queue is full, then low-priority requests are discarded. Sherwood et al. [2000] propose a scheme where the stream buffer follows the stream predicted by an address predictor, instead of a fixed-stride predictor. This allows use of different predictors that can prefetch more effectively than the fixed-stride predictor. They demonstrate the use of a stride-filtered Markov predictor that uses a two-delta stride table before the Markov predictor table (a two-delta stride predictor is one where only a new stride, which is consecutively observed twice, can replace a predicted stride). In the write-back stage, the PC for a missed load is used to index the stride table. If the stride computed from (present miss address-last address) differs from two-delta stride or the last stride, then the address cannot be predicted by the stride, and, hence, it is stored in the Markov predictor. The Markov table is queried using the last miss address for finding the next prefetch address. In case of a hit, the Markov address is utilized for prefetching and, in case of a miss, the next stride address for prefetching is computed using the last address. They show that their predictor allows accurately prefetching for pointer-based and complex array-access-based applications.  analyze load miss streams to obtain insights that can improve prefetcher effectiveness. Based on the miss access patterns, they classify the load miss streams into four categories, viz. stride, next-line, same-object, and pointer-based miss stream. Same-object misses are further misses occurring on a heap object that has been accessed recently. These misses can be avoided by prefetching the entire object or, minimally, the blocks of the object to be accessed in the near future. Pointer-based misses happen when a pointer is dereferenced to access an object. Since these misses are most challenging to eliminate using prefetching, they discuss two metrics to analyze them and design a strategy for alleviating them. "Pointer variability" shows the number of pointer transitions that are frequently changing and that are stable (a load is called a pointer transition if it loads a pointer). Pointer variability quantifies the number of times a pointer transition for an address does not load the same pointer as the one loaded previously from that address. "Object fan out" shows how many pointers are transitioned and frequently miss in the cache. Thus, programs with high variability and fan out are more difficult to prefetch than those with small variability and fan out. They also show that the classification of misses can be accurately done in HW and, based on this classification, prefetching can be efficiently performed. Iacobovici et al. [2004] note that unit stride and single non-unit stride prefetching techniques do not work for patterns that frequently appear in scientific applications. They present a multi-stride prefetcher that can detect and prefetch for a stream consisting of a maximum of two steady-state and two transitional strides, that is, four distinct strides. An example of multi-stride miss address stream is A, A+1, A+2, A+4, A+5, A+6, A+8, . . . , which is composed of two stride components, viz. a unit stride that appears twice and a stride of two that occurs once. In the training phase, the prefetcher detects the stride patterns. If successive misses in a stream show identical stride, then the prefetcher takes them as belonging to a single stride stream, and the prefetcher remains in state 1. If a miss displays a different stride, then the prefetcher transitions to state 2. The stride from this miss sets the transitional stride (stride12).

Stream and Stride Prefetching
A future miss that deviates from the second stride switches the state to state 1 and this stride sets the transitional stride (stride21). After training, if, for a missing load, the recorded stride is same as the one that is predicted, then a prefetch is triggered. Zhu et al. [2010] note that, in a stream, the timing of data accesses happens in a predictable manner, for example, in a constant-stride prefetcher, adjacent accesses in a stream are expected to occur at nearly equal time intervals. Based on this, their prefetching technique stores both the addresses referenced and the timing information to predict when a miss happens, in contrast with the conventional stream-based prefetchers that only store the addresses. The timing information is measured in terms of the number of misses. Their technique classifies miss addresses into different streams based on whether they are from the same memory region or the same instruction. By virtue of avoiding untimely prefetches, their technique mitigates cache pollution and memory BW wastage.  present a technique that aims to identify all potential stride streams, including those detected by PC-based and delta-correlation-based prefetchers. Their technique sees whether the last miss and a previous miss form a stream where a fixed stride separates more than two miss addresses. Since the last miss may be part of different streams having dissimilar strides, tracking multiple streams is essential for choosing the best stream from them. In such a case, their technique chooses the longest stream, since it is likely to cover a larger number of misses and be more accurate than a short stream. This addresses the issue of overlapping streams. On detecting a stream, future accesses in that stream can be identified. The number of streams that can be tracked are limited by the available storage space. Still, since their technique tracks multiple streams, it can continue prefetching by skipping a few of them.

Correlation Prefetching
As shown in Table II, many works propose correlation prefetching (refer Section 2.5) techniques. We now discuss a few of them. Lai et al. [2001] propose a dead-block prediction-based prefetching technique. Their technique records the memory reference trace to estimate when an L1-D cache block sees the last access. From this time, until a cache miss replaces the block, the block is dead [Mittal 2014]. They note that dead times are typically large and more than the time required for fetching data from the next level of cache. Their technique also uses address CoR to predict the block that will be referenced soon and prefetches it to eliminate the miss and improve performance. This block can be stored at the place of the dead block in the L1 cache itself. They show that their technique provides timely and effective data prefetching. However, their technique requires storage proportional to the application working set size. CoR data need to be stored across long recurring application phases and since the information about last reference is computed on each L1 access, it needs to be stored on-chip to achieve high BW. Hence, to provide reasonable coverage, their technique may require impractically large storage space [Ferdman and Falsafi 2007].  note that conventional prefetch methods store miss address streams in a table, which provides fast lookup, but the table reserves a fixed space, and the entries in a table quickly get stale, which may trigger useless prefetches. They present an alternative organization for storing prefetch history. Their method decouples matching of prefetch key from storage of history required for prefetching. First, the prefetch key is used to access the "index table" to obtain a pointer into a global history buffer (GHB). In every GHB record, a global miss address and a link pointer are stored. Using link pointers, GHB entries are chained into address lists that are chronological list of addresses with an identical "index table" prefetch key. By using different keys, different history-based prefetching techniques are realized, for example,  [Ferdman et al. 2008;Solihin et al. 2003;Lai et al. 2001;Hu et al. 2003;Ferdman and Falsafi 2007;Chou 2007;Diaz and Cintra 2009;Manikantan et al. 2011;Dang et al. 2012;Liu et al. 2012;Jain and Lin 2013;Somogyi et al. 2006;Burcea et al. 2008;Somogyi et al. 2009;Srinath et al. 2007;Ferdman et al. 2011;Huang et al. 2012;Roth et al. 1998] Tag-based correlation prefetching [Hu et al. 2003;Sharma et al. 2005] Pre-execution based prefetching [Zilles and Sohi 2001;Burtscher 2005, 2006;Zhang et al. 2007;Huang et al. 2012;Annavaram et al. 2001b;Collins et al. 2001b;Lu et al. 2005;Luk 2001;Balasubramonian et al. 2001;Collins et al. 2002;Rabbah et al. 2004;Aamodt et al. 2002] stride prefetching can use PC of load instruction while Markov prefetching [Joseph and Grunwald 1997] can use PC-independent global miss addresses. The GHB can be sized depending on the length of the history required to be tracked, which leads to better storage efficiency than conventional tables. The GHB is organized as a FIFO buffer and, thus, stale entries are automatically removed from it. Thus, accurate reconstruction of access pattern allows them to implement sophisticated prefetch techniques that exploit complex access patterns. Chou [2007] note that as off-chip latency increases, latency of on-chip computation that separates overlapping off-chip accesses becomes negligible, and very little useful work can be performed during this time. Thus, application execution can be logically divided into "epochs," such that each epoch has on-chip computation periods and off-chip accesses. Eliminating off-chip accesses removes the epoch, and decreasing the number of epochs translates into performance improvement. Conventional CoR prefetching techniques avoid single misses and epochs are eliminated as a secondary effect. However, avoiding single misses may not remove an entire epoch and, thus, may not improve performance. Instead of eliminating individual misses, their technique prefetches all the off-chip misses in the epochs to remove them entirely. Their technique does not attempt to eliminate cache misses that overlap with another miss that triggered the epoch, and this helps in reducing the size of the CoR table. Their correlating prefetcher is stored in main memory and its access latency is hidden by exploiting memory level parallelism, that is, the table is accessed during the time an off-chip access stalls the processing core. Thus, without wasting on-chip space, prefetches can still be issued in a timely manner. Liu et al. [2012] present a miss CoR-based prefetching technique. In their technique, CoR between a miss and a previous miss is ascertained when they happen closely in space and time, where space CoR means that these misses lie within a specified address range. After dynamically capturing these miss correlations, their technique uses compression to save them along with the data block content. Thus, along with the demand data, prefetch metadata are brought with minimal overhead. This allows very large CoR history. Compression is used to save metadata of each block within the original block size in the Dynamic random access memory (DRAM). Based on the miss correlations, accurate prefetches can be issued for improving performance. Roth et al. [1998] present a dependence-based prefetching technique that works by identifying a program kernel that computes addresses of LDS elements. Assuming that, in the near future, the program will follow similar steps to traverse the structure, a prefetching engine speculatively executes this kernel together with the main program. As an address is loaded, the loads consuming that address are predicted, and prefetches for those loads are immediately issued, and, in this way, dependence information is utilized. By virtue of executing only those loads required for traversing the LDS, the prefetching engine can run far ahead of the main program and thus perform prefetch in a timely manner. Using this, their technique can cover long LDS access latencies. Wenisch et al. [2005] present a temporal memory streaming (MS) technique to eliminate coherent read misses in shared memory multiprocessors. They note that shared addresses are likely to be accessed together and in the same sequence. Also, recently accessed address streams tend to recur. Such temporal correlation is found in accesses to general data structures such as arrays and LDSs (e.g., trees and lists). By comparison, spatial or stride locality, found only in array-based data structures, relies on memory layout of the data structure. Based on temporal correlation, they extract temporal streams from miss history of recent sharers and move data to a subsequent sharer before the data are requested. Somogyi et al. [2006] note that in commercial workloads, memory accesses show repetitive layouts spanning over large (e.g., several kilobytes) memory regions. Also, repetitive pattern of these accesses can be predicted by code-based correlation. Since these patterns may be non-contiguous, the use of larger cache block size for capturing such spatial correlation wastes the bandwidth. They present a technique that detects code-correlated spatial access patterns and brings such blocks into the cache before demand misses. On first reference to a spatial region, their technique predicts the cache blocks that will be referenced in that region during a monitoring interval. The monitoring interval is the time from first access to the region until any block accessed in the interval is invalidated or evicted from the cache. By virtue of exploiting correlation between code and access patterns, their technique achieves much higher prediction coverage than the address-based predictors since there are so much fewer distinct code sequences than data addresses. Somogyi et al. [2009] propose spatio-temporal memory streaming (STeMS) to synergistically integrate spatial and temporal streaming. Temporal MS tracks past miss sequences to predict subsequent chains of dependent misses, while spatial MS tracks recurring data layout patterns in memory regions of fixed size to predict future misses. While temporal MS fails to predict compulsory misses and achieves low accuracy due to the inability to detect where streams terminate, spatial MS cannot establish order between predictions and is also limited due to the use of fixed size regions. Noting that the spatial access sequence recurs in a single region and across regions, STeMS tracks temporal sequence of region accesses. Also, spatial relationships in every region are used to predict the complete miss sequence. Based on it, cache blocks are prefetched to the requesting processor. Unlike a naive combination of spatial and temporal MS, STeMS avoids interference between the predictors and, thus, achieves higher prefetch accuracy. They also show that STeMS achieves comparable or better prefetch coverage and performance than using either spatial or temporal MS alone. Panda and Balachandran [2014] note that in parallel applications data and code are shared and communicated between cores. Also, demand misses seen in a core repeat in other cores at a large time interval (e.g., average of tens of thousands of cycles). These miss streams, referred to as cross-core miss streams, cannot be eliminated by core-local stream prefetchers. They propose a cross-core spatial streaming technique in which the cross-core spatial streams and the cores involved in it are detected. Then, the spatial streams from a private MLC prefetcher are transmitted to MLC prefetchers of associated cores well in advance to allow them to prefetch data and eliminate cross-core misses for improving performance. Cantin et al. [2006] present a technique, called stealth prefetching, for broadcastbased shared-memory multiprocessor systems. In such systems, memory latency values are high and early prefetches may cause state downgrades or invalidations in remote nodes. They define a region to encompass power-of-two number of cache lines and identify non-shared regions using a coarse-grained coherence tracking scheme [Cantin et al. 2006]. They note that the majority of memory accesses happen in memory regions that are not shared when the access happens, and the majority of lines in such regions are accessed. Based on this, when the lines accessed in a region exceed a threshold, their technique prefetches a certain number of lines in the region from DRAM and dispatches them to the requesting processor. To improve prefetching accuracy, lines in a region that were previously accessed are tracked. Prefetched lines are stored with a no-permission coherence state and not kept individually coherent. If another processor obtains exclusive access to the region or sends a memory request to make the prefetched lines stale, then the prefetched lines in the original processor are invalidated. The prefetch requests are not broadcast to other processors, and they can still get exclusive copies of lines, and, thus, the prefetching is stealthy. Also, since multiple lines can be prefetched in a single request, the prefetching is aggressive and efficient.

Approaches for Improving Prefetcher Training
The training stream presented to a prefetcher decides the "repeating pattern" observed by the prefetcher and, hence, it has a significant impact on the efficacy of the prefetcher. We now discuss a few techniques to improve prefetcher training (e.g., Ferdman et al. [2011], Manikantan et al. [2011], Jain and Lin [2013], and Guttman et al. [2015]). Manikantan et al. [2011] study extending the training stream stored by CoR prefetchers to improve their performance. They denote a primary miss as one that initiates a request to the next level of cache/memory and a secondary miss as a request where the data requested by a primary miss have not arrived in the cache. They note that presenting only a primary miss address stream to train a prefetcher precludes the opportunity of exploiting information provided by secondary misses and cache hits. They propose including secondary misses and cache hits in the training stream to improve the regularity seen by the prefetcher. The improvement in regularity is confirmed by a reduction in entropy measurement. While other techniques trigger a prefetch only on a primary miss, they suggest triggering prefetches on secondary misses also to improve the performance of prefetchers. While requiring minimal HW modifications, their technique reduces cache misses. Jain and Lin [2013] present a prefetcher, called an irregular stream buffer, for targeting irregular streams of memory accesses that are temporally correlated. Their technique translates groups of correlated physical addresses into contiguous addresses in a new address space by using an extra indirection level. Based on this, their technique organizes prefetching metadata such that it is simultaneously spatially and temporally ordered. This reduces the problem of irregular stream prefetching to sequential prefetching in the new address space. This remapping also improves accuracy and coverage since, based on PC of the loading instruction, prefetcher input can be segregated into several streams . Further, storing most of its metadata on-chip allows us to use the LLC access stream (and not the LLC miss stream) to train the prefetcher, which leads to significant improvement in reference stream predictability.

Tag-Based Correlation Prefetching
Since applications reference a large number of addresses, the CoR prefetchers that work by tracking addresses incur large metadata overhead. To address this, tag-based correlation prefetching (TCP) techniques have been proposed that utilize the observation that, due to address locality, the tags formed by high-order address bits also show locality. Since one tag sequence may correspond to multiple address sequences, a tag-based CoR table requires a much smaller number of entries. We now discuss a few TCP techniques. Hu et al. [2003] note that while a memory address always maps to a fixed cache set, a tag can appear in different cache sets, which happens when multiple addresses have the same tag but different set indices. Hence, tag sequences are highly repetitive both in a single set and across the sets. Based on this, they propose a TCP technique that has the same accuracy as an address-based CoR prefetching scheme but requires magnitude order smaller storage. Their technique monitors per cache-set tag sequences and makes predictions based on recurring tag correlation sequences. They show that their technique provides better performance than a prefetching approach based on correlations of both PC traces and addresses. Sharma et al. [2005] present a prefetching technique that works by partitioning the memory address space into tag concentration zones (TCzones). If addresses of two misses have same lower order bits, then they are in the same TCzone. The prefetcher tracks L2 cache miss stream. The number of miss events, after which a miss stream shows the same data item again, is termed recurring distance. On detecting multiple misses with identical recurring distance, a pattern is inferred. At this point, the prefetcher starts recording for making future prediction. The prefetcher can work in one of the two modes, viz. absolute and differential. In absolute mode, it looks for value locality by monitoring pattern of tags within every TCzone and in differential mode, it looks for stride value locality by monitoring stride (i.e., delta between subsequent tags) pattern in every TCzone. Depending on which of the two patterns is dominant in the miss stream, the prefetcher can switch to it to improve the effectiveness of prefetching.

Pre-Execution or Helper-Thread-Based Prefetching
In processors with multithreading, while the main thread executes the program, another thread can redundantly execute a full or reduced version of the program to speculatively generate data addresses for performing prefetching. Such a thread is known as speculative slice, pre-execution (or pre-computation) thread, helper thread, or future thread. These threads progress ahead of the main thread and run nearly the same code. Thus, they actually compute load addresses instead of predicting them and allow restricting prefetching to probable control-flow paths instead of all possible paths. Figure 5 show some examples of helper-thread prefetching (based on Luk [2001]). We now discuss a few of these techniques (refer Table II). Annavaram et al. [2001b] present a precomputation-based approach to predict prefetch addresses. For an instruction fetched in instruction fetch queue (IFQ) from I-cache, their technique determines the dependencies and stores them in IFQ as pointers along with the instruction. Using profiling, their technique identifies addresses referenced by the load/store instructions that may lead to majority (90%) of D-cache misses. When a load/store instruction that may cause a cache miss enters the IFQ, their technique tracks the dependence pointers that are stored in IFQ for generating a dependence graph of instructions that await execution. The dependence graphs are executed by a separate precomputation engine (PE) to produce load/store instructions early for prefetching. PE executes in a speculative manner and, thus, it does not affect the processor state, and it makes progress faster than the main execution by avoiding delays in reorder buffer and fetch queue that may be seen by the main execution. Their technique achieves performance reasonably close to that of perfect D-cache. Luk [2001] present a software-controlled pre-execution scheme to accelerate programs with irregular access patterns. Their technique runs the original program itself and, thus, does not require program shortening. When the main thread is stalled, more resources can be given to the pre-execution thread to enhance overall performance. They suggest pre-execution schemes for dealing with different irregular access patterns. For the pointer chasing problem, one helper thread is spawned for pre-executing each pointer chain. Similarly, helper threads can be used to execute different procedure calls or traverse different control-flow paths. On determining the correct execution path, their technique cancels the wrong-path pre-executions. Thus, their technique performs effective prefetching under complex data access patterns and control flows. Collins et al. [2001b] present a technique where idle HW contexts are used to spawn speculative threads that aim to hide miss latency by triggering upcoming cache miss events well in advance of access by main thread. Since speculative threads cause contention for processor resources (e.g., fetch and memory bandwidth), such pre-execution is done only for those static loads, called delinquent loads, which lead to the majority of stalls in the main thread. For example, fewer than 10 static loads may lead to more than 80% of L1D cache misses. Speculative threads compute the address referenced by a future delinquent load to prefetch the corresponding data. They show that, compared to the case when only the main thread starts speculative threads, allowing the speculative thread to start additional speculative threads provides higher performance improvement due to aggressive speculation. Collins et al. [2002] propose using a pointer cache (PtC), a dedicated cache that stores pointer transitions in the application. Given the effective address of a pointer and assuming the pointer points to an object, PtC provides this object's base address. When a load shows miss in L1 cache but hit in PtC, the first two cache blocks of the pointed-to object are prefetched. They also explore use of PtC with speculative precomputation. With pointer-traversing codes, a speculative-thread in other techniques cannot progress faster than the main thread. In their technique, with help from PtC, a speculative thread can progress farther ahead by traversing data structures under pointer-transition induced cache misses. PtC helps in avoiding serial accesses to recurrent loads and, thus, by using PtC, speculative-threads' control instructions can go along the right object traversal path, which leads to accurate prefetching for the main thread. As pointer transitions change, PtC is also updated. For a fixed transistor budget, using a PtC with an L3 cache provides higher performance benefits than using a larger-sized L3 cache alone. Aamodt et al. [2002] present a helper-thread based prefetching technique. Using profiling, they identify code regions with many I-cache misses. The instructions immediately preceding the basic blocks showing I-cache misses are identified and are called target points. Further, trigger points are identified where a helper thread can be initiated for prefetching instructions after the target point. Then, a helper thread is generated and is attached to the main thread. At runtime, on encountering a trigger point in the main thread, a helper thread is spawned in the spare context. The helper  Solihin et al. 2003;Kandemir et al. 2009;Wang et al. 2003;Lu et al. 2003Lu et al. , 2005Luk 2001;Chilimbi and Hirzel 2002;Zhang et al. 2006;Mehta et al. 2014;Rabbah et al. 2004;Collins et al. 2001b;Mowry et al. 1992;Fuchs et al. 2014;Guo et al. 2011;Chen et al. 2007;Zhang et al. 2007] HW prefetching Almost all others Comparison/coordination of HW-SW prefetching [Verma and Koppelman 2012;Guttman et al. 2015;Mehta et al. 2014; Use of dynamic optimization framework [Lu et al. 2003[Lu et al. , 2005Zhang et al. 2006Zhang et al. , 2007 thread speculatively executes instructions related to control flow that will be later encountered by a main thread and this achieves prefetching effect. Ganusov and Burtscher [2006] present an event-driven helper threading approach for software emulation of simple or complex HW prefetchers. Based on the cache miss observed in the main thread, the helper thread predicts the data that will be accessed next and prefetches it. For making a prediction, the helper thread can use any simple or complex prefetching algorithm. After that, the helper thread again stalls, waiting for a miss in main thread. To enable efficient inter-thread communication without causing contention on shared cache ports, they use a FIFO-style event buffer. Using this, the helper thread can receive information about cache misses from the Re-order buffer (ROB) of the core running the main thread. When the main thread accesses a prefetched line, the corresponding load instruction is marked as consumer of the prefetched data. When this instruction is committed, a prefetch trigger is transmitted to the helper thread as if the cache miss was shown by this instruction. In this manner, their technique emulates the conventional HW prefetching and the helper thread can progress faster than the consumption of data by main thread. Their prefetching mechanism allows flexibly, turning on/off the prefetcher when desired without affecting the operation of the main thread.

SOFTWARE PREFETCHING, INSTRUCTION PREFETCHING, AND MEMORY-SIDE PREFETCHING
This section complements Section 3 by discussing techniques for SW, instruction, and memory-side prefetching. Table III classifies the research works based on HW or SW prefetching and use of dynamic optimization framework.

Software Prefetching
Chilimbi and Hirzel [2002] present a prefetching technique that is especially useful for pointer codes developed in weakly typed languages such as C/C++. First, a temporal data reference profile is collected from the application being executed. Then, profiling is disabled, and hot (e.g., frequently repeating) data streams are extracted from the reference profile using an analysis algorithm. Only reasonably long streams are extracted to amortize the prefetching overhead. Then, instructions are inserted into the application for detecting and prefetching the hot data streams. Then, analysis is disabled, and the application runs with the prefetch instructions. Afterwards, the inserted checks and prefetch instructions are removed. After this, the entire cycle of profiling step, analysis step, optimization step, and so on, begins again. Their technique improves performance while keeping the overhead low by focusing on few hot data streams. Marathe and Mueller [2008] present a SW-only prefetching technique to reduce L1 cache misses. They first run the application with a training data set to extract annotated memory access trace. Using this, memory addresses generated by loads and stores and their contents are monitored in an offline manner to see whether a load miss address may be predicted from previous instructions (called predictor instructions). Based on this, prefetch predictors are produced of which timely, accurate, and nonredundant predictors are selected. To see timeliness of a predictor, their technique checks whether a load miss is too close (in terms of processor cycles) to a predictor instruction, since, in such a case, prefetching will not be useful. When this distance is smaller than a threshold, the predictor is not used. Also, if a large fraction of prefetched data of a predictor is redundant, it is eliminated. Based on the selected predictors, SW prefetching instructions are placed in an application's assembly code directly. Working at instruction level enables their technique to have a broad view of memory access patterns spanning over boundaries of functions, modules, and libraries. Their technique integrates and generalizes multiple prefetching approaches such as self-stride, nextline and intra-iteration stride, same-object, and other approaches for pointer-intensive and function call-intensive programs. Wang et al. [2003] present a HW-SW co-operative prefetching technique. Their technique uses compiler analysis to generate load hints, such as the spatial region and its size (number of lines), to prefetch the pointer in the load's cache line to follow for prefetching and the pointer data structure to recursively prefetch. Specifying size allows prefetching a variable-size region based on loop bounds. Based on these hints and triggered by L2 cache misses, prefetches are generated in HW at runtime. Unlike other SW techniques, in their technique, individual prefetch addresses are generated by HW and not by SW, which allows timely prefetching of data. Also, use of compiler hints allows reduced storage overhead and accurate prefetching even for complex access patterns, which is challenging for a HW-only approach. They show that while generating significantly less prefetch traffic, their technique still provides similar performance improvement as a HW-only prefetching technique.
Rabbah et al. [2004] present a compiler directed prefetching technique that uses speculative execution. The portion of program dependence graph relevant to the address computation of a long latency memory operation is termed as load dependence chain (LDC). The LDCs are identified by the compiler and precomputations are statically embedded in the program instruction stream with prefetch instructions. Using speculative execution of LDCs, future memory addresses are precomputed, which are utilized for performing prefetching. For pointer-based applications, LDCs contain instructions that may miss in D-caches, and, hence, generating prefetch addresses for such applications causes large instruction overhead. To address this, their technique provisions that if a load in the LDC sees a miss, successive precomputation instructions are bypassed. Khan et al. [2014] present a SW prefetching technique based on runtime sampling and fast cache modeling. Their technique randomly samples memory instructions with low sampling ratio (e.g., 1 in 100,000). The blocks in D-cache that are accessed by sampled instructions are monitored for data reuse. Further, whenever the sampled instructions are re-executed, a stride sample is recorded. Based on data reuse samples recorded over the whole execution, a statistical cache model estimates per-instruction cache performance for given cache sizes. Based on it, delinquent loads are identified and stride samples for each delinquent load are analyzed to detect regular stride patterns. On detecting a dominant stride pattern for a delinquent load, their technique computes suitable prefetch distance and inserts a prefetch instruction for such a load. Based on cache modeling, their technique also identifies whether a data block will not be reused in MLC/LLC. For such data blocks, their technique uses a special prefetch instruction that prefetches data in L1 cache without polluting MLC/LLC. On eviction, this cache block is directly written back to main memory. Thus, by using cache bypassing, their technique reduces cache pollution. For single-thread applications, their technique provides performance comparable to HW prefetching, but, with increasing core-count and resource-contention, the advantage of their technique improves.

Use of Dynamic Optimization Framework
Several input-and microarchitecture-dependent events cannot be predicted at compile time, which limits the effectiveness of SW prefetching. Dynamic optimization frameworks can address these limitations by allowing performance monitoring and addition/removal of optimizations at runtime. Several prefetching techniques that use these frameworks are discussed next. Lu et al. [2003] use a user-mode dynamic optimization system, named ADaptive Object code REoptimization (ADORE) for implementing D-cache prefetching. Using ADORE, they monitor performance of binary execution and ascertain performancecritical traces/loops that show frequent D-cache misses. Prefetch instructions are inserted only in these loops/traces, and binary is patched to redirect further execution to optimized traces. Their approach provides higher performance than static prefetching, while incurring only small overhead. Lu et al. [2005] note that SMT and CMP processors present different tradeoffs of helper-thread-based prefetching. In SMTs, several processor resources, for example, issue queues, and L1 caches, may be shared or partitioned. Both main and helper threads run on the same core, which enables fast synchronization. By comparison, in CMPs with private L1 cache and shared L2 cache, the main thread cannot easily start helper-thread execution for a specific L2 cache miss. Also, communication of register values between main and helper threads is not straightforward. To address this, they use the ADORE framework to implement helper-thread-based prefetching in CMPs. They bind the main thread to one core, while the helper code, runtime optimizer, and runtime performance monitoring codes are executed on another core. This minimizes the negative influence of helper threads on main threads and precludes the need of starting multiple thread slices. Performance monitoring code detects program regions with delinquent loads and the helper code for these regions prefetches for delinquent loads. The main thread initiates the helper thread and communicates with it using a mailbox in shared memory. Zhang et al. [2006] present a prefetching technique that uses Trident [Zhang et al. 2006] dynamic optimization framework to collect frequently executed basic instruction blocks that form hot traces. By analysis of hot traces, delinquent loads and a suitable prefetch distance for them are identified, and then prefetch instructions are inserted into the hot trace. To adapt to dynamically changing workload behavior, they use HW monitoring to adjust prefetch distance for each load operation or even remove prefetch instructions. This allows their technique to achieve higher performance by utilizing runtime information. Zhang et al. [2007] use the Trident framework to improve the effectiveness of helperthread based prefetching. Trident monitors program's behavior and triggers compiler optimizations in a separate thread to adapt to that behavior. The hot execution traces of the main thread are stored in the code cache of the dynamic optimization system and are used to create p-slices. Generation of p-slices happens in the Trident framework, which allows adaptation based on program input, HW configuration (e.g., cache architecture, available HW contexts etc.), and runtime behavior (e.g., control-flow execution). To accelerate p-slice ahead of main thread's execution, they predict HW load stride to speculatively specialize the p-slice for reducing their overhead. Further, based on tracking the effectiveness of prefetching, they adapt the runahead distance of p-slices so memory access is fully covered. Control-flow hazards are mitigated by the streamlined nature of the hot traces. Also, prefetching addresses are monitored to detect and prevent a p-slice from diverging from the main thread.  study the advantages and disadvantages of HW and SW prefetching and their interaction. As for advantages of SW prefetching, HW prefetchers fail for very short streams and require complex structures for detecting irregular access patterns. Also, when the number of streams present in the application exceeds the HW resources, HW prefetchers may not clearly distinguish them, whereas SW prefetchers can insert prefetch instructions for each stream individually. Further, the HW prefetchers in commercial processors place data in lower-level (L2 or L3) cache only, whereas SW prefetchers can place data directly into the right cache level. Also, SW prefetchers can easily ascertain loop bounds and avoid prefetching outside the bounds that (especially aggressive) HW prefetchers fail to do. Based on this, SW prefetching can be used for short streams, irregular access patterns, and L1 cache miss avoidance. As for advantages of HW prefetchers, they can account for runtime behavior and input variations and can adapt their aggressiveness. Also, SW prefetchers can greatly increase the instruction count (e.g., up to 100%), which wastes fetch and execution BW, although applications that use prefetching are generally memory bound, and, hence, additional instructions may not increase their completion time. By using HW and SW prefetchers together, prefetching can be performed for a larger variety of streams, and SW prefetch requests can be used to train the HW prefetcher. However, these prefetchers may also interact negatively, for example, when SW prefetches inhibit the ability of HW to detect streams properly or when the harmful prefetches brought by SW cause cache pollution and BW contention. Mehta et al. [2014] note that different architectural features on different processors present unique tradeoffs and demand specific prefetching strategies for them. For example, in SandyBridge, the MLC streamer prefetcher and L1 SW prefetcher can coordinate to bring data to L1 cache, overcoming the limitations of the L1 HW prefetcher. By comparison, on Xeon Phi, no prefetching is done by the L2 streamer prefetcher in the presence of L1 SW prefetch commands. Yet, to bring data into the L1 cache, coordination between SW prefetch instructions at multiple cache levels can be utilized, such that LLCs prefetch first and use larger prefetch distance than L1 cache, and then the L1 cache prefetches data from the LLC. They present coordinated multi-stage prefetching techniques for each of the two processors. Their technique prefetches array references that are direct-indexed streaming, direct-indexed strided, and indirect indexed. For such references, prefetch instructions are inserted only for L1 cache for SandyBridge since other cache levels use HW prefetch. For Xeon Phi, prefetch instructions are inserted for all levels of cache. In a loop, prefetch instructions are introduced in a manner where computation is performed between them, which avoids the possibility of pipeline stall due to prefetches filling the miss status holding register (MSHR). Finally, their technique determines the prefetch distance, and this computation differs between Xeon Phi and SandyBridge since the former has in-order cores while the latter has out-oforder cores. They also discuss several processor-specific optimizations for improving the efficacy of prefetching. Table IV shows several techniques for instruction prefetching (Section 2.3). We now discuss a few of them. Zhang et al. [2002] present a prefetching technique that works by correlating execution history with cache miss history. For each cache miss, a correlated instruction is ascertained that was fetched a fixed number of instructions before the miss, and this correlation is stored in a table. For instance, for an I-cache with 12-cycle miss latency, an instruction fetched 12 cycles before the miss is used as the prefetch trigger. When these correlated instructions are again encountered, a prefetch is performed. Since multiple execution paths may lead to one miss, multiple triggering instructions may be stored for that miss. This is useful for cache blocks that tend to show miss on only some of the execution paths. They associate neighboring cache misses with a single instruction to reduce CoR table size. To avoid redundant prefetches, they use a filtering mechanism that uses a confidence-counter scheme to retire ineffective correlations. This reduces unbeneficial prefetches and also obviates the need of probing the I-cache before prefetching. Spracklen et al. [2005] analyze I-cache miss behavior of commercial workloads that exhibit high miss rates in both L1 and L2 I-caches. They show that eliminating misses due to discontinuous accesses is as important as removing those due to sequential accesses. They propose a discontinuity prefetching technique that can be used together with sequential prefetching for removing both types of misses. Control transfer instructions such as functional calls and branches lead to discontinuity in instruction fetch by transitioning to a non-sequential address. When such a transition leads to an I-cache miss, their technique inserts this discontinuity information in a table. The prefetcher moves ahead of the demand fetch stream, and when it finds a valid table entry, it prefetches the target of discontinuity. Ferdman et al. [2008] note that as repeated traversals of program data structure lead to repetitive data-cache miss sequences, repeated traversals of CFG lead to repetitive sequences of I-cache misses. Also, almost all I-cache misses can be associated with these recurring sequences, termed temporal instruction streams. Their technique dynamically tracks temporal instruction streams themselves and records them in L2 cache. Afterwards, their technique predicts when these streams would repeat and, based on this, prefetches instructions before the demand requests are made. The techniques (e.g., Spracklen et al. [2005] and Reinman et al. [1999]) that use a branch predictor to traverse a program's CFG to predict discontinuous control flow have a limited lookahead distance since they use a branch predictor and work on basic-block sequence. By comparison, their technique works directly on I-cache misses and does not explore program's CFG, and, hence, it achieves higher lookahead distance, BW efficiency, and prefetching accuracy. Ferdman et al. [2011] note that control-flow variations are amplified by microarchitectural components and these, along with random HW interrupts, lead to non-repetitive instruction history that degrades the effectiveness of conventional prefetching techniques. For example, control-flow variations may disturb the branch predictor state and L1 I-cache replacement sequence, leading to randomness of instruction stream, different miss sequences, and execution of wrong-path instructions. To address this, they propose using a correct-path, retire-order instruction sequence to track an accurate sequence of instruction-fetch. This provides a near-ideal recurring instruction stream and eliminates the randomness introduced due to branch predictor, cache, and interrupts. On encountering a recorded address, a prefetch is triggered for subsequent requests based on replaying a recorded sequence beginning with the most recent position of the recurring address in the sequence. Further, instead of storing individual accesses, recording temporally and spatially correlated groups of accesses enables compact storage of history. Thus, only one address is stored per spatial region (e.g., a function), and storage of multiple iterations of tight loops is avoided. Their approach improves the prefetching coverage and accuracy and enables the L1 I-cache to achieve nearly 100% hit rate. Kolli et al. [2013] note that the current call stack reflects the execution path of the program traversed to reach an execution point and the program context captured by it has strong correlation with L1 I-cache misses. Further, the return address stack (RAS) concisely summarizes the program context. On any call or return operation, their technique saves the RAS state into a signature consisting of current call stack along with direction/destination of call or return operation. These signatures are strongly correlated with L1 I-cache misses, and, hence, a sequence of misses seen for any signature is associated with that signature. Signatures constructed with fine-grained context information allow prefetching of the traversals deep into the call graph and within large functions. Further, they show that RAS signatures have high predictability and, thus, prediction of subsequent signatures can be accurately done based on current signatures, which provides higher prefetch lookahead. Their technique reduces storage and energy overhead of accurate HW instruction prefetching (incurred in Ferdman et al. [2011]), while still maintaining high prefetcher coverage and program performance.

Use of Branch Prediction or History Information
We now discuss some techniques that use information from branch prediction to speculate on program control flow for determining the data to be prefetched. Srinivasan et al. [2001] present an instruction prefetching scheme that correlates branch instruction execution with misses in I-cache, based on the fact that control-flow alterations due to branches cause I-cache misses. Based on this, branch instructions are used to trigger prefetching of instructions that appear in the execution after a fixed number (say, K, for example, K = 4, etc.) of branches. For instance, a candidate basic block (BB 1 ) will be associated with a branch instruction (R 1 ) if an I-cache miss to BB 1 happens exactly K branches after the execution of R 1 occurs. On future execution of R 1 , BB 1 will be prefetched. Thus, their technique avoids the need of a branch predictor to estimate the result of K + 1 branch operations. Also, since another basic block starts at the branch instruction target address and may last several cache lines, their technique also stores the length of prefetch candidate blocks along with their addresses to prefetch the entire blocks in a timely manner. Zilles and Sohi [2001] note that a few frequently executed static instructions (called problem instructions) cause a majority of branch mispredictions and cache misses. Their technique creates a code portion, called a speculative slice, that mimics the computation including the problem instruction and includes only those operations that are necessary to compute the outcome of problem instruction. By forking such slices well before the problem instruction, data prefetching can be done to avoid the penalty of misses. Solihin et al. [2003] present a technique where CoR prefetching is performed by a user thread running in main memory, and the prefetched data are sent to the L2 cache. L2 cache misses are tracked and recorded in a CoR table. Afterwards, for each miss, the CoR table is looked up and a prefetch of several lines is triggered for the L2 cache.

Memory Side Prefetching
The CoR table is stored in main memory and, thus, changes to L2 cache are minimal. They show that by combining their technique with a core-side sequential prefetcher, the performance improvement can be increased further. Also, the prefetch algorithm used by the thread can be adapted on a per-application basis. Yedlapalli et al. [2013] present a memory-side prefetcher (MSP) that fetches data on-chip from memory but, unlike in Solihin et al. [2003], does not push the data to the caches and, thus, avoids resource contention. They use a next-line prefetching scheme and prefetch when a row buffer hit occurs such that, first, the demand request is served and then the prefetch request is served. Successive requests to lines in that row then turn into prefetch hits. Data are maintained in a separate buffer at each memory controller and, thus, access to prefetched data does not use memory channel or bank, which lowers the queuing delays. The advantage of their MSP is that it can utilize knowledge of memory state to reduce row-buffer conflicts for reducing the miss latency itself, in addition to reducing the number of memory accesses. Also, unlike a core-side prefetcher, an MSP can easily adapt to the available memory BW. To achieve larger performance improvement, MSP can be integrated with a core-side prefetcher, whereby the former brings data on-chip and the latter leverages core request predictability to access the data brought by MSP without issuing an off-chip request.
Hughes and Adve [2005] present a prefetching technique where the prefetch engine resides on the processor-in-memory. A local or remote processor provides a summary of LDSs and likely traversals. Based on it, the prefetcher performs traversal independently and sends the data to the requesting processor. By virtue of its proximity to memory, their prefetcher provides faster service than a processor-side prefetcher, and this allows their prefetcher to run ahead of the processor, thus bringing data in advance of the processor access and pipelining data transfer over the network. They compare their prefetcher to a processor-side prefetcher by using programs with significant LDS memory stall time. They observe that both their prefetcher and the processor-side prefetcher provide better performance on different applications that highlight the need to combine them for optimal performance.

EVALUATION USING REAL SYSTEMS AND ANALYTICAL MODELS
Different experimentation approaches/platforms offer complementary insights, for example, real hardware allows quick and realistic evaluation, whereas simulators offer high flexibility to compare with different configuration/parameter choices that may even be unrealizable on real systems. Analytical models provide insights independent of a particular platform or program. Clearly, it is revealing to note the evaluation approach of a study, and, hence, Table V classifies the works based on this feature. We now discuss some works that use real processors or theoretical models for evaluation.  propose a technique that dynamically turns on/off the prefetcher, based on the LLC interference caused by it. They perform experiments on an Intel Core i7 processor, where, of the four different HW prefetchers, two prefetchers, viz. MLC spatial and MLC streamer prefetchers, can be externally controlled using SW-accessible state registers. The samples for application LLC miss counts are collected using Intel's Precise Event Based Sampling (PEBS) capability. Their technique works in two phases. In the profiling phase, the time taken to see a fixed number of LLC misses is observed with prefetchers turned on and off, respectively. If this time is higher when prefetchers are turned off, then it indicates that misses happen less frequently in time without the prefetchers, and, hence, prefetchers are turned off in the run phase; otherwise, the prefetchers are turned on. For a small implementation overhead, their technique effectively mitigates prefetching-induced LLC interference.  Khan et al. 2014;Marathe and Mueller 2008;Kang and Wong 2013;Huang et al. 2012;Jiménez et al. 2012;Lu et al. 2003Lu et al. , 2005Chilimbi and Hirzel 2002;Mehta et al. 2014;Guttman et al. 2015] Analytical model [Liu and Solihin 2011;Chen and Aamodt 2008] Simulator Almost all others Jiménez et al. [2012] study prefetching on an IBM POWER7 processor. They first examine the performance and power consumption of a prefetcher for its different parameter settings using microbenchmarks and standard benchmarks. These settings include prefetch degree, stride-N (whether streams with a fixed stride greater than one cache line are prefetched), whether prefetching is done on store operations and whether the prefetcher is enabled/disabled. Since the optimal prefetcher setting varies between and within benchmarks, they propose an adaptive prefetch technique that dynamically configures the prefetcher parameters based on benchmark characteristics. Their technique works in two phases. In the exploration phase, different prefetcher parameter settings are tried, and the one that provides the largest IPC is used in the running phase. Since phase changes may also lead to a change of IPC, instead of comparing with individual measurements, they use a moving average buffer to record most recent Q IPC values for each setting and then compare different settings by using average values in the buffer. To avoid the impact of inefficient settings on performance, their technique does not try them for K exploration phases, where K is determined by the slowdown introduced by those settings. Thus, inefficient settings are penalized by being dropped from exploration. Kang and Wong [2013] note that prefetching leads to contention for shared cache and bandwidth, and virtualization also complicates shared cache management by consolidating multiple VMs on a single CMP. Hence, a study of their interaction can allow effective LLC sharing among VMs and deciding whether prefetching should be enabled. They study HW prefetching in virtualized environments and account for various virtualization factors, such as number of vCPUs and VMs, interference between VMs and vCPU-core binding, and so on. They show that, for most configurations, the negative influence of prefetching is small, and prefetching degrades overall performance significantly only in few configurations. To avoid destructive interaction between the two approaches, they propose a prefetching-aware vCPU-core binding technique. Based on the cache access pattern (including both demand and prefetch requests) of every VM's workload, VMs are categorized into groups showing different cache sharing constraints. For example, a VM with a very low miss rate has the lowest constraint, while one with a high miss rate and showing negative self-prefetching influence (self-prefetching influence shows whether prefetching benefits or harms performance on consolidating it with the same application in a different VM) has the highest constraint. The higher the constraint, the higher the priority of allocating to a dedicated cache to avoid the negative impact of prefetching. VMs with lower constraint are given more shared caches, since they do not cause contention. Thus, based on the cache sharing constraints, scheduling of the vCPUs of every VM on suitable cores is done to improve performance. Guttman et al. [2015] study HW and SW prefetching and their interaction on the Xeon Phi system. They show that when HW and SW prefetchers work together, SW prefetch misses train HW prefetchers directly, and the demand misses generated by SW prefetching train HW prefetchers indirectly. Further, when the SW prefetcher is effective in removing the majority of L2 demand misses, the HW prefetcher is not triggered frequently, and, thus, it is throttled. In this way, the HW prefetcher effectively adapts itself to the SW prefetcher, and the coordinated HW+SW prefetching provides the best performance by virtue of prefetching for a wide variety of access patterns. For some applications, compiler-inserted prefetch instructions are ineffective, and, hence, those inserted manually by the programmer alone are effective for prefetching. For other applications, compiler-inserted instructions are sufficient, and, hence, programmer effort is not worthwhile. When prefetching improves performance, the resulting energy savings offset the energy overhead due to extra instruction/metadata and memory operations, and, thus, the overall effect of prefetching on energy is positive. Liu and Solihin [2011] study the interaction of prefetching and bandwidth partitioning (BPT) and their impact on system performance. Based on CPI model and queuing theory, they develop an analytical model that takes as input several key system parameters (e.g., frequency, available BW, and cache block size) and application cache behavior indices (e.g., prefetching accuracy and coverage and prefetching frequency). Their model provides a composite prefetching metric that determines conditions in which prefetching improves performance. This metric is shown to be more effective than traditional metrics such as coverage and accuracy. Their BPT model accounts for prefetching and determines BW partitions for each core that lead to optimal performance. Based on their model, they derive several important insights and conclusions. Use of prefetching reduces the available BW, which enhances the role of BPT in improving performance. In BW-constrained systems, performance loss due to prefetching cannot be alleviated by BPT, and, hence, the decision of (de)activating prefetchers needs to be made before using BPT. Also, when the prefetcher of every core is activated, naively providing large BW to a core that performs accurate prefetching does not lead to highest performance (weighted speedup). Instead, BW partition of this core should be more constrained; this is because, by virtue of utilizing the BW efficiently, this core can donate some BW to other cores for improving overall performance. Chen and Aamodt [2008] present an analytical model to evaluate the impact of HW prefetching, pending cache hits, and limited MSHR resources on superscalar processor performance with long latency memory systems. A CPI stack divides the application CPI into CPI from useful computation and that from miss events, for example, cache misses, branch mispredictions, and so on, and, of these, their technique models CPI from D-cache misses. Under no prefetching, the instruction trace produced by a cache simulator is analyzed to see misses in the cache. Under prefetching, several loads that might have seen misses become hits or pending hits (a memory reference where the cache block has been requested but the data have not arrived), depending on whether prefetching can fully hide the memory latency. A pending hit may be due to a demand miss or a prefetch miss. For each pending hit, they identify a previous instruction that brought the current instruction's required data into cache. The latency of the current instruction that can be hidden is computed as the number of instructions between the current and previous instructions divided by the processor's issue width. Then the actual latency of the current instruction is the delta between memory access latency and hidden latency; if memory latency is fully hidden, then the actual latency is zero. Using this approach, CPI due to D-cache misses under prefetching is estimated.

REDUCING IMPLEMENTATION AND PERFORMANCE OVERHEAD OF PREFETCHING
As discussed in Section 2.7, a careful choice of design parameters is required to reduce area/latency overheads of prefetching and its negative impact on performance. We first summarize some approaches proposed for this (Section 6.1) and then discuss several of these approaches (Sections 6.2 to 6.7).

An Overview of Approaches
(1) Controlling negative impact of prefetching: -To avoid cache pollution, prefetched data can be stored in separate buffer(s) [Hur and Lin 2006;Yedlapalli et al. 2013;Jouppi 1990;Falcón et al. 2005;Zhang et al. 2002;Cantin et al. 2006;Somogyi et al. 2009;Joseph and Grunwald 1997;Roth et al. 1998] or in place of dead blocks [Lai et al. 2001]. -Some techniques temporarily disable the prefetcher (globally or for certain loads) [Kandemir et al. 2009;Yu and Liu 2014;Jiménez et al. 2012;Alameldeen and Wood 2007;Kadjo et al. 2014;, while others adapt its aggressiveness [Srinath et al. 2007;Jiménez et al. 2012;Zhang et al. 2007;Mehta et al. 2014;Ebrahimi et al. 2009;Lin 2006, 2009;Wang et al. 2003;Zhang et al. 2006;Albericio et al. 2012;Yu and Liu 2014]; for example, the prefetch degree/distance may be adapted. Some works use multiple prefetchers [Marathe and Mueller 2008;Guo et al. 2011] or switch between different prefetching modes [Sharma et al. 2005]. Some works start actual prefetching only after the prefetcher accuracy has been confirmed [Pugsley et al. 2014]. -Prefetch accuracy can be ascertained by seeing whether the prefetched or evicted block is accessed first [Alameldeen and Wood 2007;Dang et al. 2013;Kandemir et al. 2009]. -The position of prefetched blocks in the LRU stack can be controlled [Lin et al. 2001a;Srinath et al. 2007;Wang et al. 2003;Lin et al. 2001b]; for example, they can be placed near LRU to avoid pollution. -Redundant prefetches can be avoided [Spracklen et al. 2005;Zhang et al. 2002;Marathe and Mueller 2008;Guo et al. 2011;Reinman et al. 1999;; for example, a prefetch request can be canceled if it has been recently demand fetched or if it matches an upcoming demand fetch request [Spracklen et al. 2005]. Similarly, in parallel applications where per-core prefetchers may redundantly prefetch shared data, only the longest stream that is beneficial for shared data should be prefetched ]. -Prefetch filtering can be driven by the number of cache misses Albericio et al. 2012] or IPC [Jiménez et al. 2012;]. -Data brought by certain cores can be pinned in cache if they are frequently the victim of harmful prefetches [Kandemir et al. 2009]. -Strategies for addressing inter-core interference can be used Ebrahimi et al. 2009;Yu and Liu 2014;Kandemir et al. 2009;Albericio et al. 2012] (refer to Section 6.3).
(2) Improving effectiveness of prefetching: -Main memory policies can be made prefetch aware [Lee et al. 2008;Yedlapalli et al. 2013;Hur and Lin 2006;Lin et al. 2001aLin et al. , 2001bEbrahimi et al. 2011]; for example, a DRAM controller can give different priorities to useful and useless prefetches and demand requests. -Several works classify the miss stream into different categories to drive the prefetching algorithm or to gain insights Iacobovici et al. 2004;Spracklen et al. 2005;Zhu et al. 2010]. -Since prefetching is only useful if a helper thread runs ahead of the main thread, helper threading may be applied only for those loops where it can prefetch delinquent loads on a loop's critical path [Lu et al. 2005]. The helper thread can be terminated once the main thread makes equal progress [Aamodt et al. 2002]. -For higher effectiveness, prefetching can be triggered by dead-block prediction (and not cache miss) [Lai et al. 2001] or branch instructions [Srinivasan et al. 2001].
-Correlated accesses or misses can be stored together and not separately [Zhang et al. 2002;Ferdman et al. 2011]. -Instead of address-based correlation, tag-based correlation can be used [Hu et al. 2003;Sharma et al. 2005]. -Timing information can stored in terms of miss-counter instead of CPU cycles [Zhu et al. 2010], which also provides a more accurate and stable measure of time. -Access history can be shared between cores [Kaynak et al. 2013], and resource sharing can be used in helper-thread prefetching [Lu et al. 2005]. Lin et al. [2001b] present a technique to filter useless prefetches by predicting spatial locality in a memory region. A bit vector, called a density vector (DV), is used to record which cache blocks in a memory region were fetched during an epoch. The epoch for a region ends and its next epoch begins when a miss to a block happens that is already in density vector. The number of bits in a DV that are "1" (i.e., set) shows available spatial locality, and the longest consecutive string of set bits shows whether the access pattern is dense. CoR between the two DVs is defined as a fraction of identical bits between them, and the local-CoR is defined as the CoR between the two most recent epochs in a region. They note that, for most programs, local-CoR is strong, which indicates that, over time, access patterns in a region do not change. Using this, they design a filter that tracks current DVs and also stores previous DVs. It exploits local-CoR between DVs to filter out prefetch requests that it predicts will not be useful for a given region and epoch. They show that their technique can eliminate a large fraction of useless prefetches without harming performance. Hur and Lin [2006] present an adaptive stream detection (ASD) technique that modulates the aggressiveness of prefetch policy based on the spatial locality present in the workload. The prefetcher runs in the memory controller and brings data in a prefetch buffer. On detecting access to k successive cache lines, a stream prefetcher begins prefetching from (k + 1)th line onwards, until it detects a useless prefetch. Generally, k is chosen at design time. However, for short streams, a stream prefetcher becomes ineffective. For example, for applications where every stream has a length of 2, using k = 1 leads to one useless prefetch after every useful prefetch. Their technique associates every memory access with a suitable stream length to generate a stream length histogram (SLH). As an example, if most memory requests occur in streams of length 2, then their technique only prefetches the second and not the third line of a stream. This adaptive stream detection approach avoids useless prefetches. By dynamically adapting the histogram, changes in application behavior can be accounted for. Their technique extends the scope of a stream to include even those having only two cache lines, which allows several commercial applications to be viewed as stream based and, thus, enable the use of a low-overhead stream-based prefetcher with them. Hur and Lin [2009] present three approaches to further improve the effectiveness of the ASD stream buffer [Hur and Lin 2006]. To improve the stream detection mechanism of the stream buffer, they use a length-based stream detection approach, whereby, with increasing stream length, the time duration for which the stream filter waits for the next element of a stream is reduced, for example, for streams of lengths 1, 2 and 3, the wait time can be T , T /2, and T /4. If the next element does not arrive by this time, then the stream filter becomes available for allocation to a new stream, since the loss from not capturing the last element in the stream is higher for smaller streams than in long streams. This approach gives a greater chance for shorter streams to be fully prefetched, which increases the number of streams that can be gainfully prefetched in irregular applications. Since SLHs vary over time, their second approach uses an adaptive epoch length for the SLH feedback mechanism, based on whether SLHs observed in consecutive epochs are similar. The third approach uses information from SLH to prefetch a variable number of blocks at a time. When N consecutive memory requests are likely to appear in a burst, and the memory queue is not too busy, a maximum of N consecutive prefetch requests are generated, which helps in avoiding late prefetches. Srinath et al. [2007] present a technique to dynamically adapt the aggressiveness of a prefetcher. In each interval, they monitor three metrics, viz. prefetcher accuracy, lateness, and the cache pollution due to prefetching. Prefetch accuracy is estimated by tracking the fraction of prefetched blocks that lead to hit of demand requests. A late prefetch is identified when the data at a prefetched address are requested by the core but the data have not arrived. Cache pollution is measured by using a BF-based predictor that estimates the demand-fetched L2 cache blocks evicted due to prefetched data. Each of the three metrics are classified in different ranges (e.g., high and low) by using individual thresholds for them. Based on them, prefetching parameters viz. prefetch degree and prefetch distance are adjusted. For example, if prefetches are accurate and do not cause pollution, then both the parameters are increased to amplify the aggressiveness of prefetching. Similarly, if prefetching causes large amounts of pollution, then, in the next interval, prefetched blocks are inserted near the LRU position, while in the case of low pollution, they are inserted into the midway position in the LRU chain. Albericio et al. [2012] present ABS, an adaptive controller design that uses a hillclimbing approach for modulating the aggressiveness of prefetchers in a banked shared LLC. In their design, each LLC bank has a prefetch engine and an ABS controller that collects bank-local statistics. Prefetchers and ABS controllers of different banks are independent, which avoids the need of communication between them and requires looking up only a local bank for reducing useless prefetches. In every epoch, the aggressiveness of only one core's prefetcher is changed, and the miss ratio of the bank (computed as demand misses divided by demand requests of all the cores) is computed. The miss ratio in the current epoch is compared to that in a reference epoch. If the miss ratio in the current epoch is higher, then the change in the prefetcher aggressiveness is undone; otherwise, it is confirmed. Also, the accuracy of a core's prefetcher is computed by computing the ratio of the hits from the core being probed to prefetched blocks and the total number of prefetches issued by this core. If this accuracy value is lower than a threshold, then the aggressiveness of the prefetcher is decreased. Thus, each core's prefetcher can have dissimilar aggressiveness values in different LLC banks. They show that their technique improves performance and fairness.  present a prefetching technique based on monitoring delta correlations in memory zones. They divide the memory address space into concentration zones (CZones) of equal sizes. Within each CZone, patterns in miss address deltas (difference between successive addresses) are detected using a GHB. A prefetch is triggered when an access pattern is detected within a CZone. Their technique does not require a PC for load instructions that lead to misses. Since programs show different phases, the values of the prefetcher parameters (viz. CZone size and prefetch degree) that provide the best performance also change over the program execution. Hence, they also present an adaptive version of their technique. This technique begins in the UNSTABLE state with certain values of parameters. If, after an interval of fixed instructions, a phase change is detected, then the algorithm stays in the UNSTABLE state; otherwise, it switches to the TUNING state. In this state, the algorithm tries several parameter values (and "no prefetching"), and, at the end, the values providing the highest performance are selected, and the algorithm switches to a STABLE state. A phase change again switches the algorithm to the UNSTABLE state and resets the configuration. Adaptivity also allows us to turn off prefetching in case it harms the performance. Kandemir et al. [2009] propose two techniques for filtering harmful prefetches in a multi-core processor that causes inter-core interference and replaces useful data. They note that in any execution phase, a small set of cores brings or get affected by harmful prefetches, and such patterns change over different execution phases. Their first technique selectively suppresses prefetches from certain cores. The harmful prefetches from each core are recorded by tracking the block replaced by a prefetched block and checking whether the prefetched or discarded block is accessed first in later execution. At the end of each phase, if the ratio of harmful prefetches issued by a core and total harmful prefetches exceeds a threshold, then prefetches from a core are suppressed for a single (next) phase. The second technique records the cache misses seen by a core due to harmful prefetches and if, in a phase, the ratio of such misses seen by a core and the total such misses seen by all cores exceeds a threshold, then data blocks fetched by that core in the cache are marked as non-removable for a single (next) phase. Instead of the blocks from this core, those from other cores that are least recently used are selected. Thus, this technique aims to mitigate the negative effect of prefetching on cores that are harmed the most by it. By removing the effect of harmful prefetches, their techniques enable leveraging the benefit of SW prefetching, even for large core-counts. Yu and Liu [2014] note that in a multicore system running multi-threaded application, different threads share data and, hence, if prefetched data blocks replace demand-fetched data blocks, all the sharers of that cache block need to invalidate their local copy. Thus, prefetching can lead to inter-thread invalidations and cause contention in shared resources such as LLC and main memory. A prefetching request that causes demand miss in an L1 cache due to invalidation of data block in its sharer's L1 caches is termed an attacking prefetch request. For each thread, they utilize a prefetcher that can prefetch both sequential and chained stream patterns and they use filtering mechanisms for each of the two patterns. For sequential streams, the data of all attacking prefetches are ignored, and only the address is stored in the pattern table for later use. For chained patterns, if L1 prefetching misses are found to be attacking prefetches, then they are not immediately ignored. Instead, they are issued and the linked pattern streams are maintained based on the return value. In other words, return value is used to compute the next node address in the linked stream, and then it is ignored (i.e., not moved to cache). Further, based on the runtime feedback about memory requirement and prefetching effectiveness of each thread, their technique adapts the mode and aggressiveness of its prefetcher. For example, for applications with high memory intensity but low prefetching intensity, the prefetcher can be temporarily shut down to reduce contention on shared resources. For applications showing high accuracy and intensity of prefetching, the aggressiveness of the prefetcher is increased to maximally improve performance. By contrast, the aggressiveness of a prefetcher showing low accuracy is reduced.

Reducing Latency Overhead by Using a Bloom Filter (BF)
A Bloom filter [Bloom 1970] is a probabilistic algorithm that uses multiple hash functions to quickly check whether an item is certainly a non-member or may be a member of a set. By slightly sacrificing the accuracy, BF significantly speeds up the classification process. We now discuss prefetching techniques that use BF.
Peir et al. [2002] present a BF-based predictor that predicts whether an access is a miss or may be a hit. By making this decision early in the pipeline, data can be prefetched in L1 in a timely and accurate manner. The first one, called partitionedaddress, splits the line address into M partitions. If single or multiple address partitions in a requested line's address do not belong to a corresponding address partition of any cache line, then a cache miss is ascertained. In the second design, called partialaddress, a bit array is indexed using the least-significant bits of the line address. Every bit shows whether a match is found between the partial address and any corresponding partial address of a cache line. If no match occurs, then a miss is identified. Once a miss is identified in L1 cache, a miss request can be issued to L2 cache to prefetch data in the L1 cache. They also use their technique for speculatively scheduling dependent instructions for boosting performance. They show that accuracy of their predictor is very close to 100%, and their technique achieves significant performance improvement. Pugsley et al. [2014] present a technique to enable the use of aggressive prefetchers, while avoiding their limitations, such as BW wastage. Their technique tracks prefetch requests generated by a candidate prefetch pattern but does not actually prefetch requests. The addresses of all the cache blocks that would have been prefetched by a prefetch pattern, are stored in a "sandbox" that is implemented using BF. By comparing subsequent cache accesses against the addresses in the sandbox, both the accuracy of prefetcher and existence of prefetchable streams are ascertained. Only when accuracy of a prefetcher exceeds a threshold, actual prefetches are performed. By virtue of prefetching only after confirming, their technique avoids harmful prefetches and can confidently issue many prefetches along that pattern to improve performance by avoiding late prefetches.  Guo et al. [2011] present techniques to offset energy overheads of prefetching due to extra cache lookups (to avoid redundant prefetching) and prefetch-HW. Their first technique uses compiler to identify memory accesses such as scalar accesses, which are not advantageous for prefetching. Then, only accesses such as those to LDSs and arrays are passed to the prefetcher to reduce prefetch-HW lookups. The second technique annotates pointer and array accesses using a compiler. Based on them, at runtime, a suitable prefetching scheme (such as pointer prefetcher or stride prefetcher) is applied to optimize performance. The third technique reduces prefetch-HW lookups for access patterns with very small strides. For such patterns, a single lookup is performed for their multiple occurrences. A fourth technique reduces prefetching-induced cache tag lookups. It stores the most recently prefetched cache tags in a separate buffer. Each prefetching address is compared with this buffer, and, on a match, the prefetching operation is canceled. Otherwise, a cache tag lookup is performed. Dang et al. [2013] present a filtering technique for improving the energy efficiency of prefetching. When insertion of prefetched data leads to eviction of an existing data block, their technique records the prefetch-victim address pair. As execution proceeds, either (1) the prefetch address or (2) the victim address may be accessed or (3) the prefetched block may be replaced before being accessed. The first case indicates a useful prefetch while the last two cases indicate a useless prefetch. Thus, based on which address in the pair is accessed first, their technique collects a utilization count of both issued and filtered prefetches. Access to a filtered address indicates incorrect filtering of a useful prefetch. In such a case, its address is recorded to develop a feedback mechanism for avoiding filtering of future useful prefetches. Based on these, the decision about issuing or filtering the prefetch request is taken. Burcea et al. [2008] present predictor virtualization (PV) as a technique to use memory hierarchy for emulating large predictor tables and show its application by virtualizing a spatial memory prefetcher [Somogyi et al. 2006]. Their technique reserves a part of physical memory space for storing the predictor table (PVTable). Another structure, called PVProxy, preserves the interface between the actual optimization technique and the predictor table. It stores a few predictor entries in a small on-chip structure, called PVCache. If an entry is not found in the PVCache, then a memory request is generated to bring the entry. This memory request does not differ from those issued by L1 caches. In this way, PVProxy is transparent to the memory hierarchy. The benefits of virtualizing the predictor are that it avoids wasting a large on-chip space for storing the predictor table, and if the optimization technique is turned off, then the entries of its predictor in L2 cache are soon replaced. Spatial memory streaming [Somogyi et al. 2006] uses two HW structures, viz. a pattern history table (PHT) and an active generation table (AGT). The spatial patterns found by AGT from active memory regions are stored in PHT. In their design, PHT requires 86KB storage, while AGT requires less than 1KB storage. Hence, Burcea et al. apply virtualization to PHT only and bring its on-chip storage requirement to less than 1KB. For this, PHT is itself stored in main memory (PVTable), and a few sets from it are stored in PVCache, which are properly delivered to the prefetching engine by PVProxy. Ferdman and Falsafi [2007] note that, given the large-sized CoR tables, storing them on-chip forces limited size and reduced coverage, while storing them off-chip reduces prediction lookahead and increases prediction latency. To bring the best of the two together, CoR data are recorded off-chip in order that they will be used and are streamed to a small-sized on-chip long-recurring sequences of consecutively used last-touch signatures are stored offchip, and, for each sequence, only one head signature is stored on-chip. When an access sequence repeats, its associated last-touch signature sequence is streamed from offchip to on-chip table. Thus, their technique facilitates timely signature retrieval from off-chip and keeping the size of on-chip table small. They show that last-touch order of blocks can be approximated by a sequence of block evictions. They show that their technique reduces L1D cache misses with minimal on-chip storage overhead and offchip traffic increase. Kaynak et al. [2013] note that for processors with many simple (lean) cores, the storage overhead of stream-based prefetching techniques (e.g., Ferdman et al. [2008Ferdman et al. [ , 2011) becomes very high. For homogeneous server workloads that execute similar requests on all cores, instruction access sequences generated on the cores have high (e.g., 90%) similarity. They propose that commonality and recurrence of instructionlevel behavior across the cores can be leveraged to produce a single instruction history shared among all cores running the workload. This amortizes the area overhead of sophisticated instruction prefetchers. One randomly chosen core, termed the history generator core, generates the instruction fetch stream history. This is stored in the shared history buffer and read by the stream address buffer, which is private to each core and issues prefetches in coordination with I-cache misses. Since keeping a separate history buffer introduces several overheads (such as dedicated storage and logic), they propose embedding the history buffer in the LLC using the PV approach [Burcea et al. 2008]. The large capacity of LLC also allows us to easily support workload consolidation, where one history buffer can be allocated for each workload to maintain per-workload history. Alameldeen and Wood [2007] study the interaction of prefetching with cache and BW compression and show that these approaches work synergistically to provide large performance improvement. This is because prefetching partially hides the decompression latency, and compression reduces the BW contention caused by prefetching. Also, compression increases the effective cache capacity that facilitates bringing more blocks into cache using prefetching. They also propose an adaptive prefetching mechanism to throttle prefetching when it hurts performance. For each of the private L1 caches and shared L2 cache, they use a saturating counter, which is decremented on a useless or harmful prefetch and incremented on a useful prefetch. Initially, the counters have their largest value. When they reach zero, prefetching is disabled for that cache. They use a "prefetch bit" for each cache block, which is set at the time a prefetched line is inserted in cache. On first access to this block, the bit is reset and counter is incremented. If an evicted line still has its prefetch bit set, then the prefetch is considered useless and the counter is decremented. For detecting harmful prefetches, they utilize additional tags employed for cache compression and store the addresses of replaced blocks in them. On a cache miss, each of the invalid tags in the cache set is examined in the LRU stack sequence. If a match is found, then the line is replaced by a currently cached line, but if any valid line has its prefetch bit set, then this harmful prefetch is assumed to have evicted the line, and the counter is decremented.  note that conventional cache replacement schemes give identical treatment to prefetch and demand requests, and, hence, in the presence of prefetching, they may not provide expected performance improvement. Their cache management approach dynamically predicts and mitigates the cache interference due to prefetching by altering the insertion and hit promotion schemes of the cache such that they handle prefetch and demand requests differently. They demonstrate their approach by utilizing prefetch/demand request information in improving re-reference predictions generated by the Dynamic Re-Reference Interval Prediction (DRRIP) replacement policy. Thus, their technique synergistically integrates prefetching with intelligent cache replacement schemes. Also, for multicore multiprogrammed workloads, their technique eliminates inter-core and intra-core prefetch-induced interference in shared LLCs.

CONCLUSION AND FUTURE OUTLOOK
As key applications become even more data intensive and power and thermal budgets reach a plateau, effective latency hiding mechanisms such as prefetching are becoming more attractive than costly alternatives such as increases in cache size. However, several challenges remain to be addressed for fully realizing the potential of prefetching in next-generation computing systems. We now discuss a few of them.
In recent years, researchers have explored alternative memory technologies (e.g., non-volatile memories that are write-agnostic, gigabyte-size DRAM caches that may employ cache line sizes of few KBs ) and fabrication approaches (e.g., 3D stacking which demands intelligent data placement and thermal management), and so on. These trends present several new constraints and optimization opportunities for prefetching. Also, while conventional prefetching techniques only optimize performance, higher-level objectives such as quality of service (QoS), relative applications priorities, and so on, also become important with the rising number of cores. These factors call for re-evaluation of traditional techniques and design of novel techniques for ensuring effective prefetching, and this will be an interesting direction for researchers in coming years.
Exascale systems seek to achieve 10 18 computations per second within an energy budget of 20MW. It is clear that no single approach can bridge the performance and power efficiency gap between the current and the future systems. Hence, achieving synergistic integration of prefetching with other approaches, such as data compression, near-threshold voltage computing, dynamic voltage/frequency scaling (DVFS), and so on, will be vital and present a key challenge for computer architects.
In this article, we presented a survey of recent prefetching techniques for caches. We identified tradeoffs in the use of prefetching and the challenges that merit further investigation. We classified the works along several dimensions to highlight major research directions. It is hoped that this article will help the readers see the prefetching techniques in synthesis and understand their potential in improving performance of future processors.