Learning Models: An Assessment of Progress, Challenges and New Developments

Learning models extend the traditional discrete choice framework by postulating that consumers have incomplete information about product attributes, and that they learn about these attributes over time. In this survey we describe the literature on learning models that has developed over the past 20 years, using the model of Erdem and Keane (1996) as a unifying framework. We described how subsequent work has extended their modeling framework, and applied learning models to a wide range of different products and markets. We argue that learning models have contributed greatly to our understanding of consumer behavior, in particular in enhancing our understanding of brand loyalty and long run advertising effects. We also discuss the limitations of existing learning models and discuss potential extensions. One key challenge is to disentangle learning as a source of dynamics from other key mechanisms that may generate choice dynamics (inventories, habit persistence, etc.). Another is to enhance identification of learning models by collecting and utilizing direct measures of signals, perceptions and expectations.


Introduction
In the field of discrete choice, the most widely used models are clearly the multinomial logit and probit. 1 Of course, there has been substantial effort over the past 20 years to generalize these workhorse models to allow richer structures of consumer taste heterogeneity, serial correlation in preferences, dynamics, endogenous regressors, etc. However, with few exceptions, work within the traditional random utility framework maintains the strong assumption that consumers know the attributes of their choice options perfectly.
Learning models extend the traditional discrete choice framework by postulating that consumers may have incomplete information about product attributes. Thus, they make choices based on perceived attributes. Over time, consumers receive information signals that enable them to learn more about products. It is this inherent temporal aspect of learning models that distinguishes them from static choice under uncertainty models.
Within this general framework, different types of learning models can be distinguished along four key dimensions. One is whether consumers behave in a forward-looking manner. If attributes are uncertain, and consumers are myopic, they choose the alternative with highest expected current utility. But forward-looking consumers may (i) make trial purchases to enhance their information sets, or (ii) actively search for information about products via other sources.
A second key distinction is whether utility is linear in attributes or whether consumers exhibit risk aversion. In the linear case, forward-looking consumers are willing to pay a premium for unfamiliar products, as they receive not only the expected utility of consumption but also the value of the information acquired by trial. 2 But with risk aversion, consumers are willing to pay a premium for a more familiar product. This can generate "brand equity" for well-known brands.
A third distinction involves sources of information. In the simplest learning models trial is the only information source. In more sophisticated models consumers can learn from a range of sources, such as advertising, word-of-mouth, price signals, salespeople, product ratings, social networks, newspapers, etc.. A consumer must decide how much to use each available source. In particular, consumers may engage in "passive search," using only information sources that arrive form specifications, such as Guadagni and Little (1983)'s exponential weighted average of past purchases specification (the so called "loyalty variable"). 6 Erdem and Keane (1996) also found strong evidence that advertising has important long run effects on demand (via the total stock of advertising), but that short run effects of recent advertising were negligible.
The Erdem and Keane (1996) paper was influential because: (1) it provided a practical method for estimating complex learning models, (2) it showed that, far from imposing a "straight jacket" on the data, the Bayesian learning structure led to insights about the functional form for state dependence that improved model fit, (3) it generated interesting results about long vs. shortrun effects of advertising, and (4) it gave an economic rationale for the "brand loyalty" observed in scanner panel data. 7 These results generated new interest in structural learning models (and dynamic structural models more generally) within the fields of marketing and economics.
Nevertheless, there was a time lag of roughly five years from Erdem and Keane (1996) to the publication of many additional papers on learning models. But, starting in the early 2000s, there has been an explosion of new work in marketing and economics applying learning models to brand choice and many other problems. Other interesting applications include: (i) demand for new products, (ii) choice of TV shows and movies, (iii) prescription drugs, (iv) durable goods, (v) insurance products, (vi) choice of tariffs (i.e., price/ usage plans) , (vii) fishing locations, (viii) career options, (ix) service quality, (x) childcare options, and (xi) medical procedures.
Some of these applications are based rather closely on the Erdem and Keane (1996) framework with forward-looking Bayesian consumers, while other papers depart from or extend that framework in important ways (often along one of the four key dimensions noted above).
The outline of the survey is as follows: In Section 2 we describe the learning model of Erdem and Keane (1996) in some detail. We will treat their model as a unifying framework to 6 In most applications, imposing structure involves sacrificing fit to some extent (i.e., not surprisingly, structural models usually fit worse than flexible reduced form or statistical/descriptive models). The payoff of imposing the structure is (1) greater interpretability of parameter estimates and (2) the ability to do policy experiments. Erdem and Keane (1996) was a rare instance where a structural model actually fit better than popular competing reduced form models. 7 A key insight of Erdem and Keane (1996) was that uncertainty about quality combined with risk aversion could lead to brand loyal behavior (i.e., persistence in brand choice over time). Loyalty emerges as consumers stick with familiar products (whose attributes are precisely known) to avoid risk. Given equal prices, a familiar brand may be chosen over a less familiar brand even if it has lower expected quality, provided consumers are sufficiently risk averse. In this framework, "loyalty" is the price premium that consumers are willing to pay for greater familiarity (lower risk). Keller (2002) refers to the general framework laid out in Erdem and Keane (1996), and elucidated further in Erdem and Swait (1998), as "the canonical economic model of brand equity." (Of course, there are also a number of psychology-based modelssee Keller (2002) for an overview). discuss the rest of the literature. In general, later developments can be viewed as extending the Erdem-Keane model along certain dimensions (while typically restricting it on others to make those extensions feasible), or applying it in different contexts. Section 3 reviews the subsequent literature on learning models. It is divided in subsections that cover: (i) more sophisticated learning models with myopic consumers, (ii) more sophisticated learning models with forwardlooking consumers, (iii) models for product level/market share data, and (iv) new or novel applications of learning models. Section 4 describes what we consider the key challenges for future research. In Section 5 we summarize and conclude.

The General Structure of the Erdem and Keane (1996) Model
As we noted in the introduction, the papers by Roberts and Urban (1988) and Eckstein, Horsky and Raban (1988) were the first applications of learning models to marketing problems.
The model of Erdem and Keane (1996), henceforth "EK," nests those models in a more general framework. Thus, in this section we describe the EK model in some detail. Readers interested in more detail about the earlier models can refer to a detailed description in online Appendix A.

A Simple Dynamic Learning Model with Gains from Trial Information
Of course, the key feature of learning models is that consumers do not know the attributes of brands with certainty. While this may be true of many attributes, most papers, including EK, have focused on learning about brand quality. In their model, consumers receive signals about quality through both use experience and ad signals. But prior to receiving any information, consumers have a normal prior on brand quality: (1) ( ) .
This says that, prior to receiving any information, consumers perceive that the true quality of brand j, denoted Q j , is distributed normally with mean Q j1 and variance . So in the first period, the consumer's information set is just I 1 = { }. The values of Q j1 and may be influenced by many factors, such as reputation of the manufacturer, pre-launch advertising, etc.
Use experience does not fully reveal quality because of "inherent product variability." This has multiple interpretations. First, the quality of different units of a product may vary.
Second, a consumer's experience of a product may vary across use occasions. For instance, a 5 cleaning product may be effective at removing the type of stains one faces on most occasions, but be ineffective on other occasions. Alternatively, there may be inherent randomness in psychophysical perception. E.g., the same cereal tastes better to me on some days than others.
Given inherent product variability, there is a distinction between "experienced quality" for brand j on purchase occasion t, which we denote , and true quality Q j . Let us assume the "experienced quality" delivered by use experience is a noisy signal of true quality, as in: (2) where for t=1,…,T.
Here is the variance of inherent product variability, which we often refer to as "experience variability." Of course experience signals are consumer i specific. But here and in later equations we will suppress the i subscript whenever possible to save on notation.
Note that we have conjugate priors and signals, as both the prior on quality in (1) and the noise in the quality signals in (2) are assumed to be normal. This structure gives simple formulas for updating perceptions as new information arrives, as we will see below. This is precisely why we assume priors and signals are normal. Few other reasonable distributions would give simple expressions. Also, as signals are typically unobserved by the researcher, it is not clear that more flexible distributions would be identified from choice data alone.
Thus, the posterior for perceived quality, given a single use experience signal (received after the first purchase of brand j), is given by the simple Bayesian updating formulas: (3) Equation (3) describes how a consumer's prior on quality of brand j is updated as a result of the experience signal . The extent of updating is greater the more accurate the signal (i.e., the smaller is ). Equation (4) describes how a consumer's uncertainty declines when he/she receives the signal. The quantity is often referred to as the "perception error variance." Equations (3) and (4) generalize to multiple periods. Let N j (t) denote the total number of use experience signals received prior to the purchase occasion at time t. Then we have: where is an indicator for whether brand j is bought/consumed at time t.
In (5), perceived quality of brand j at time t, Q jt , is a weighted average of the prior and all quality signals received up until the beginning of time t, ∑ . Crucially, this is a random variable across consumers, as some will, by chance, receive better quality signals than others.
Thus, the learning model endogenously generates heterogeneity across consumers in perceived quality of products (even starting from identical priors). This aspect of the model is appealing. It seems unlikely that people are born with brand preferences (as standard models of heterogeneity implicitly assume), but rather that they arrive at their views through heterogeneous experience.
Of course, as Equation (6) indicates, the variance of perceived quality around true quality declines as more signals are received, and in the limit perceived quality converges to true quality.
Still, heterogeneity in perceptions will persist over time, for several reasons: (i) both brands and consumers are finitely lived, (ii) there is a flow of new brands and consumers entering a market, and (iii) as people gather more information the value of trial diminishes, and the incentive to learn about unfamiliar products will become small. Intuitively, once a consumer is familiar with a substantial subset of brands, there is rarely much marginal benefit to learning about all the rest.
In general, learning models must be solved by dynamic programming (DP), because today's purchase affects tomorrow's information set, which affects future utility. The key idea of DP is that, at each time t, the value of choosing option j consists of an immediate payoff, plus the expected present value of the future payoff stream which arises in period t+1 onward. This "forward looking" term is conditional on the option j chosen at time t, as the choice at j alters the consumer's information set, which in turn affects the choices that he/she makes in the future.
In our notation, the information set is I t , the value of choosing alternative j at time t is V(j,t|I t ), the current payoff will be a context-specific utility function, and the expected present value of future payoffs (conditional on I t and j) , or "future component," is .
If choices convey not just utility but also information, it may not be optimal to choose the brand with the highest perceived quality in the current period. To see this, it is useful to consider 7 the special case where the choice is between an old familiar brand (whose attributes are known with certainty) and a new brand. Denote these by j=o,n (for old and new). The information set is { }, where we suppress the values for the old brand which are just Q o and .
Prices are given by P jt for j=o,n. Then values of choosing each brand in the current period are: To gain further insight, it is useful to consider the special case where utility is linear in experienced quality , as in Eckstein et al (1988), thus abstracting from risk aversion, and also linear in price. In that case, (7) and (8) simplify to: where .
Here the e jt for j=o,n are stochastic terms in the utility function that represent purely idiosyncratic tastes for the two brands. These play the same role as the brand specific stochastic terms in traditional discrete choice models like logit and probit. 8 Now, a consumer will choose the new brand over the familiar brand if the value function in equation (9) exceeds that in (10). This means that > 0, where: We will refer to G t as the "gain from trial." It is the increase in expected present value of utility from t+1 until the terminal period T that arises because the consumer obtains information by trying the new brand at time t.
Intuitively, the gain from trial comes from two sources. Most obviously, the consumer may learn that the new brand is better than the old brand. More subtly, suppose the evidence indicates the new brand is inferior to the familiar brand. There is, nevertheless, a large enough price differential such that the consumer will choose the new brand over the familiar brand if the new brand is cheaper by at least that amount. More precise information about the quality of the new brand enables the consumer to set this reservation price differential more accurately.
Here, we give a sketch of a proof that , and hence that G t is positive. This is a very general result of information economics, but it is easiest to show in the linear case. It is also easiest to consider a finite horizon problem with terminal period T. As there is no future, the consumer at time T simply chooses the brand with highest expected utility.
Thus, the utility a consumer with incomplete information (i.e., Q nT ≠ Q n ) receives at T is simply: On the other hand, a consumer with complete information would receive utility: This depends on true quality Q n , not on perceived quality Q nT . Thus, a consumer with incomplete information is in effect making decisions at T using the "wrong" decision rule, so in general he/she will make suboptimal decisions. More formally, letting a * be a noisy measure of a, we have { } { }. This is the key intuition for why information is valuable. A complete proof involves two more steps. First, one needs to show that as Q nT becomes more accurate the consumer's decisions become closer to 9 optimal, so that { } is decreasing in the perception error variance . Second, by backwards induction it can be shown that this is true back to any period t.
Although G t > 0, i.e., more information is better, it is notable that, G t is smaller if (i) the consumer has more information ( smaller) or (ii) use experience signals are less accurate ( larger). Both lower the value of trial. Notice that (11) can be rewritten as "Choose brand n if:" This shows that the trial value G t augments the perceived value of the new brand .
Thus, ceteris paribus, the new brand can command a price premium over the old brand because it delivers valuable trial information. So the model with linear utility (i.e., no risk aversion) generates a "value of information" effect that is opposite to the conventional brand loyalty phenomenon.
[In online Appendix A we give some details on estimation of this model.]

Introducing Risk Aversion and Exogenous Signals
Next, we introduce two key features of the Erdem and Keane (1996) model that generalize the simple setting described above. First, we introduce exogenous signals of quality (e.g., advertising) as an additional source of information besides use experience. Second, we consider utility functions that exhibit risk aversion with respect to variation in brand attributes (focusing again on quality). We should note that both these features were already present in Roberts and Urban (1988), but in a static choice context.
There are numerous ways one can obtain information about a brand other than trial purchase. 9 Examples are advertising, word-of-mouth, magazine articles, dealer visits, etc. For simplicity we will often refer to these as "exogenous" signals, as we may think of them as arriving randomly from the outside environment. (Of course, a consumer may actively seek out such signals, an extension we discuss below). For frequently purchased goods the most important source of information is probably advertising, and this is the source that EK consider.
Let A jt denote an exogenous signal (advertising, word of mouth, etc.) that a consumer receives about brand j at time t (prior to the time t purchase decision). We further assume that: where for t=2,…,T This says the signals A jt provide unbiased but noisy information about brand quality, where the noise has variance . The noise is assumed normal, to maintain conjugacy with the prior in (1).
It is important to compare (14) with (2). The noise in trial experience is from inherent product variability, which is largely a feature of the product itself. The noise in a signal like advertising or word-of-mouth is, in contrast, largely a function of the medium. Presumably some media convey information more accurately than others, and no medium is as accurate as direct use experience. We also stress that the noise in (14) differs fundamentally from that in (2), as inherent product variability affects a consumer's experienced utility from consuming the product, while exogenous quality signals do not. Nevertheless, both types of signal enter the consumer's learning process in the same way. Given only the exogenous signal A jt , we can rewrite (5)-(6) as: where is an indicator for whether a signal for brand j is received at time t, and is the total number of ad signals received for brand j up through and including time t.
It is simple to extend the Bayesian updating rules in (5)-(6) and (15)-(16) to allow for two types of signalsi.e., both use experience and exogenous signals. Our timing convention is that time t ad signals are received before the time t purchase decision, while time t experience signals are received afterwards. For example, perceived quality prior to the purchase decision at t=2 is [ ⁄ ]. Generalizing to multiple periods we have: where . Note that the recency of signals does not matter: In (17)-(18) only the total stock of signals determines a consumer's state. Furthermore, receiving N signals with variance  2 affects the perception variance in the same way as receiving one signal with variance  2 /N.
As we will see, these properties are important for simplifying the solution to consumers' dynamic optimization problem. This is because the consumer's level of uncertainty, as captured by the { } , depends only on the number of signals received, not the order or timing with which they were received. One could imagine scenarios where more recent signals are more salient, or, conversely, where first impressions are most important. These are important potential extensions of the model, but they would make computation much more difficult.
In order to progress further and develop a model that can be taken to the data, one must specify a particular functional form for the utility function. Of course, many functions are possible. Erdem and Keane (1996) assumed a utility function of the form: Here utility is quadratic in the experienced quality of brand j at time t, and linear in consumption of the composite outside good C t = X-P jt , where X is income. The parameter w Q is the weight on quality, r is the risk coefficient, w P is the marginal utility of the outside good, and e jt is an idiosyncratic brand and time specific error term. 10 Note that, as choices only depend on utility differences, and as income is the same regardless of which brand is chosen, income drops out of the model. So we can simply think of w p as the price coefficient.
Given (19), combined with (2) and (18), expected utility is given by: Also, as the Erdem-Keane model was meant to be applied to weekly data, and as consumers may not buy in every week, a utility of the no purchase option must also be specified. EK wrote this as . The time trend captures in a simple way the possibility of changing value of substitutes for the category in question.
We have now specified the complete EK model, and we are in a position to formally state the consumer's problem. Consumers are assumed to be forward-looking, making choices to maximize value functions of the form: where the consumer's information set is given by: A key point is that a consumer's information about a brand may be updated between t and t+1 for two reasons: (i) the consumer buys the brand, or (ii) the consumer receives an exogenous signal about the brand. Henceforth we simply refer to these as "ad signals." In forming we allow for both sources of information. We describe the process in detail in the next section.
With the introduction of risk aversion, the EK model can capture both gains from trial and brand loyalty phenomena. As we discussed earlier, the terms in a dynamic learning model capture the gain from trial information. These are greater for less familiar brands, where the gain from trial is greater. At the same time, the risk terms are also greater for such brands. These two forces work against each other, and which dominates determines whether a consumer is more or less likely to try a new unfamiliar brand vs. a familiar brand. In categories where risk aversion dominates, we would expect to see a high degree of brand loyalty (i.e., persistence in choice behavior). In categories where the gains from trial dominate, we would expect to see a high degree of brand switching (due to experimentation). 11

Solving the Dynamic Optimization (DP) Problem
Here we show how to solve a consumer's dynamic optimization problem. Solving the DP problem is computationally difficult for two reasons: (i) The expected value functions in (21) are high dimensional integrals, and (ii) these integrals must be evaluated at many state points.
The expected value functions in (21) have the form: That is, the consumer at time t knows that, at time t+1, he/she will choose from among the J options the one with the highest value function. The consumer can form the expected maximum over these value functions, because his/her information set and decision at time t (i.e., the (I t , j)) generate a distribution of I t+1 , in the manner described earlier.
However, it is not immediately obvious how (23) helps us to solve the consumers' optimization problem. The Vs on the right hand side of (23) themselves contain expected value functions dated at t+2, that is, functions of the form . So it seems we have only pushed the problem one period ahead. One key insight for solving a dynamic programming problem is to assume there exists a terminal period T beyond which a consumer does not plan. At T, the consumer will simply choose the option with highest expected utility. Thus, we have that: The integral in (25) is feasible to evaluate. Suppose that, hypothetically, the I T and P T were known at T-1. Then, as we see from (20), the only unknowns appearing in (25) would be the logistic errors {e 0T ,…, e JT }. In that case (25) would have a simple closed form given by the wellknown nested logit "inclusive value" formula (see Rust (1994)). This illustrates the point that estimating a finite-horizon dynamic model is very much like estimating a nested logit modelif one thinks of moving down the nesting structure as a process that plays out over time.
Of course, evaluating (25) in the EK model is more difficult, because the I T and P T are not known at T-1. Both experience signals and ad signals may arrive between T-1 and T, causing the consumer to update his/her information set. The expectation in (25) must be taken over the possible I T that may arise as a result of these signals. Specifically, we must: (i) update to using (18) to account for additional use experience, (ii) integrate over possible values of the use experience signal in (2) to take the expectation over possible realizations of Q jT , (iii) integrate over possible ad exposures that may arrive between T-1 and T (i.e., over realizations of for j=1,…,J) to account for ad induced changes in the { }, and (iv) integrate over possible values of the ad signals in (14), as these will lead to different values of the {Q jT }. 12 Clearly the integrals in (25) are high dimensional, and simulation methods are needed.
That is, we integrate by simulation over draws from the distributions of the signal processes. The computational burden increases if consumers learn about multiple brands, and/or have more than one source of information. Memory is also an issue, as all the must be saved.
Having calculated the values of for every possible (I T-1 , j) and saved the resultsa point we return to belowwe can move back to time T-1, where (21) becomes: Note that (26) is just like (24), except for the terms that are appended. But we have already solved for these and saved them in memory, so they are just numbers. So we can now construct the . This enables us to proceed backwards and calculate the time T-1 version of (25), and obtain the . Then we can work back again and obtain the . This backwards induction process is repeated until we have solved the entire dynamic programming problem back to t=1. Detailed descriptions of this process, known as "backsolving" are contained in many sources. See, for instance, Keane et al (2011).
In practice, T is generally chosen to be some time beyond the end of the sample period.
This can be chosen far enough out so that results are not sensitive to the exact value of T.
Unfortunately, the above description is oversimplified as it assumes it is feasible to calculate the value for every possible (I T-1 , j). But note that the number of variables that characterize the state of an agent in (22) is 2·J. Solving a dynamic programming problem exactly requires that one solve the expected value function integrals at every point in the state space, and this is clearly not feasible here, because there are too many state variables. Of course, as the state variables in (22) are continuous, it would be literally impossible to solve for the expected value functions at every state point (as the number of points is infinite). A common approach is to discretize continuous state variables using a fairly fine grid. Say we use G grid points for each state variable. 13 As we have 2·J state variables, this gives grid points, which is impractically large even for modest G and J. This is known as the "curse of dimensionality." A number of ways to deal with this problem have been proposed: To solve the optimization problem in their model (that is, to construct the in (21)), Erdem and Keane (1996) used an approximate solution method developed in Keane and Wolpin (1994). The idea is to evaluate the expected value function integrals at a randomly selected subset of state points (where this set is relatively small compared to the size of the total state space). 14 The expected value functions are then constructed at other points via interpolation.
For instance, one can run a regression of the value functions on the state variables (at the random subset of state points), and use the regression to predict the value functions at other points. We give more detail on how to apply this method to estimate learning models in online Appendix B.
The Keane-Wolpin approximation method, or variants on its basic idea, has become widely used in both economics and marketing in the past 15 years to solve many types of dynamic models. This has greatly increased the richness and complexity of the dynamic models that it is feasible to estimate. We will not give details of these computational methods here, but refer the reader to surveys by Keane, Todd and Wolpin (2011), Aguirregebirria and Mira (2010) and Geweke and Keane (2001) and Rust (1994), among others.
Finally, a common question is how we can solve the DP problem when we do not know the true parameter values, either for the utility function or the stochastic processes that generate signals. The answer is that the DP problem must be solved at each trial parameter value that is considered during the search process for the maximum of the likelihood function. In other words, the DP solution is nested within the likelihood evaluation. We consider the construction of the likelihood function in the next section.

Evaluating the Likelihood Function
In this section we discuss how to form the likelihood function for the EK learning model.
Let θ={w Q , w P , r, {Q j1 , }, , } denote the entire vector of model parameters. Combining Eqs (20) and (21), we have the choice specific value functions: Erdem and Keane assume that the idiosyncratic brand and time specific error terms e jt in (27) are iid extreme value. In this case, the choice probabilities have a simple multinomial logit form: Equations (28) used an interpolating method rather than saving every value, the appropriate may be constructed as needed using the interpolating function. Erdem and Keane (1996) use the latter procedure, because the number of possible states in their model is so large.
To proceed in constructing the likelihood we need some definitions. Let j(t) denote the choice actually made at time t (we continue to suppress the i subscripts to conserve on notation). (Recall that D t-1 and are observed by the econometrician, while and ̌ are not).
Finally, we define ( | ̌ ) as the probability of a person's choice at time t given his/her initial state (I 1 ), the history of use experience prior to time t, and advertising exposures up through time t, as well as the content of those signals. It is worth emphasizing the timing convention that the ads at time t are observed before the time t choice is made.
Unfortunately, we cannot observe the actual content of ad and experience signals. Thus, we must integrate over that content to obtain unconditional probabilities .
Thus, the probability of a choice history for an individual takes the form: In (30) we integrate over all experience and advertising signals that the consumer may have received from t=1,…,T. That is, we integrate over the distribution of { ̌ } .
Clearly, the required order of integration is substantial. 15 To deal with this problem, Erdem-Keane used simulated maximum likelihood (see, e.g., Keane (1993Keane ( , 1994). Specifically, draw D sets of signals { ̌ } for d=1,…,D, using the distributions defined in (2) and (14). Then form the simulated probability: Finally, sum the logs of these probabilities across individuals i=1,…,N.
A key complication is that consumer purchase histories and ad exposures are not usually observed prior to the start of the sample period. This creates an "initial conditions problem." A consumer who likes a particular brand will have bought it often before the sample period starts.
Thus, brand preference is correlated with the information set at t=1. The usual consequence is to exaggerate the impact of lagged purchases on current choices. An exact solution to the initial conditions problem requires integrating over all possible initial conditions when forming the likelihood, but in most cases this is not computationally feasible. Thus, a number of approximate (ad hoc) approaches have been proposed. For example, EK had scanner data for three years, but they used the first two years to estimate the initial conditions for each consumer at the start of the third year, and then used only the third year in estimation.

Identification
The discussion of identification can be confusing, as the word has multiple meanings. It can mean showing the parameters of a model are identified given the assumed model structure.
This may involve both formal proof as well as intuitive discussion of what data patterns drive the estimates. We discuss identification in this "narrow" sense in section 2.5.A.
Identification can also mean analysis of what assumptions are necessary to estimate a model, or just convenient. 16 For example, can assumptions like Bayesian updating or normal signals be relaxed? Even more generally, can one distinguish the learning model from other plausible models that also generate state dependence? How can we tell if consumers are forward looking? We discuss identification in this "broad" sense in section 2.5.B.
Finally, some parameters may be formally identified but difficult to pin down in finite samples. We discuss this issue in Section 2.6, when we discuss the estimates of the EK model.

2.5.A. Identification of Learning Model Parameters (Given the Model Structure)
Some key points about identification become apparent from examining (27) Here, we have set = 0 and Q jt = Q j because there is no uncertainty about quality.
Obviously, we cannot identify β, or the priors {Q j1 , } as they drop out of the model. We also cannot identify , as it is constant across alternatives j=1,…,J. 17 And careful inspection of (32) reveals that r is not identified either, as it cannot be disentangled from the scaling of Q j . (Obviously, if Q j had a known scale this would not be a problem). Thus, r, β, , and the priors {Q j1 , } only affect choice probabilities though the EV(I t+1 |I t , j) terms.
So, in an environment of complete information, all that can be identified are the price 16 This is known as "non-parametric" identification analysis. Unfortunately, this literature has been misinterpreted by many researchers as suggesting it may be possible to obtain "model free evidence" about behavior. In fact, the approach of the non-parametric identification literature is to make a priori assumptions about certain parts of a model, and then show that some other part (e.g., the functional form of utility or an error distribution) is identified without further assumptions. Thus, what is non-parametrically identified is just a part of the model, not all of it. For instance, Matzkin (2007) says the "ideal" of non-parametric estimation is to start with a structural model and then impose only restrictions implied by theory (e.g., continuity, monotonicity, homogeneity, equilibrium conditions). One then uses the data to identify functional forms and distributions that are not pinned down by theory. A related point is that observing data patterns that seem consistent or inconsistent with a model can make that model seem more or less plausible, given our priors. But they can never provide non-parametric evidence that a model is correct. In Section 4.5 we give two examples to illustrate these points. 17 Note that does not enter the value of the no purchase option. However, any shift in can be undone by a shift in , leaving utility differences unchanged. coefficient w p , the products w Q Q j , and the terms and in the value of the no purchase option. 18 Furthermore, as only utility differences matter for choice, we need a normalization to establish a reference alternative. EK set Q j = 1 for one brand, so quality of all other brands are measured relative to brand j. 19 Alternatively, one could fix w Q . To summarize, by observing consumers with essentially complete information (i.e., those with a great deal of experience with all brands), we can identify w Q and the {Q j }, given normalization, as well as w p , and .
The identification of the parameters β, , and r, as well as the priors {Q j1 , }, requires that incomplete information actually exist. In that case, variation in EV(I t+1 |I t , j) and across consumers is generated by variation in the information sets I it . Intuitively, the parameters β, , , r and {Q j1 , } are identified by the extent to which, ceteris paribus, consumers with different information sets are observed to have different choice probabilities. For instance, by comparing (27a) and (32) we can clearly see that variation in across consumers, arising from variation in use experience and ad exposures, enables us to identify r. (This is because w Q is already identified from consumers with complete information, as we noted earlier).
Similarly, variation of I it within consumers over time is also relevant. The learning parameters , and {Q j1 , } determine how the arrival of ad and use experience signals change and the EV(I t+1 |I t , j). Thus, these parameters are pinned down by the extent to which the arrival of signals alters behavior over time. For instance, if behavior is greatly altered by arrival of one use experience signal, it implies that is large and is small.
It is worth stressing that this argument for identification based on comparing behavior of consumers with different amounts of information applies in both dynamic and static models.
Indeed, this is the source of identification of the learning related parameters in the static Bayesian learning model of Roberts and Urban (1988). It is also worth stressing that variation in ad exposures and in prices are plausibly exogenous sources of variation in the I it .
Now we turn to dynamic considerations. Recall from Section 2.1 that consumers will only engage in strategic trial if β>0. But in the typical scanner data set we cannot observe if a 18 The scale normalization on utility is imposed by assuming the scale parameter of the extreme value errors is one. 19 It is worth noting that an alternative normalization would be to set Q j = 0 for one brand. However, this would not let one disentangle the w Q Q j products. Thus, EK instead set Q j = 1 for one brand. Also, with quadratic utility, it is desirable to constrain the largest Q j to fall in the region of increasing utility. EK impose this constraint in estimation by updating the level of Q j at each step (while keeping relative Q values fixed).

20
purchase is a "trial." Thus, aside from the functional forms of utility and the EV functions, the discount factor β is pinned down by variables that affect the EV(I t+1 |I t , j) but do not affect current utility. In the EK model there are two exogenous variables that play this role. These are the brand specific price variances and advertising frequencies. There is no reason for these variables to affect behavior in a static modelin a static model one only cares about the current price and the current stock of information, not the likelihood of future deals or future information arrival.

2.5.B. Identification in the "General" or "Non-Parametric" Sense
As EK discuss in some detail, the Bayesian learning model implies a particular form of state dependence and serial correlation in the errors. This can be seen from careful examination of Equations (17)- (18) and (27). A frequently asked question is how learning behavior can be distinguished from other forms of state dependence/serial dependence.
In his fundamentally important paper on panel data, Chamberlain (1984) defined the relationship between two variables y t and x t as "static" conditional on a latent variable c if (i) y t is independent of lagged x conditional on x t and c, and (ii) x t is strictly exogenous with respect to y conditional on c (i.e., y t does not cause future x). As Chamberlain shows, this "static" condition is actually stronger than the condition that there is no structural state dependence (see Heckman, 1981), as the latter does not require strict exogeneity. 20 Rather remarkably, Chamberlain shows that in nonlinear models (like discrete choice models), one can always find a distribution of the latent variable c such that the relationship between y t and x t is static. In simple terms, one can always find a sufficiently flexible/complex heterogeneity distribution such that state dependence is not needed to explain the data. The key implication is that one cannot construct a non-parametric test of whether state dependence exists. Chamberlain's result is an instance of the Cowles Foundation view that one cannot deduce interesting economic relationships from the data alone. One needs a priori identifying assumptions, regardless of what sort of idealized variation is present in the data. 21 Thus, our interpretation of data will always be subjective, as it is contingent on our model. To be concrete, both the extent and nature of any state dependence we find in discrete choice data will depend on the assumed functional forms for state dependence and heterogeneity (see, e.g., Keane (1997)).
As we described in Section 2.5.A, in the parameterized EK learning model, we identify parameters that describe dynamics from variation in choice behavior across consumers with different information sets (I it ), and within consumers as their information sets change over time.
This variation in I it arises from different histories of use experience, ad exposures and prices.
However, Chamberlain's results imply that differences in behavior due to differences in history (i.e., state dependence) cannot be distinguished non-parametrically from differences in behavior due to a completely general form of heterogeneity. Nor can learning behavior be distinguished non-parametrically from other mechanisms that may induce state dependence.
Thus, functional forms of both state dependence and heterogeneity must be constrained for the learning model to be identified. But this is true of any non-linear dynamic model. For a structural econometrician this is not a limitationa model that simply specifies very general forms of state dependence and/or heterogeneity so as to obtain a good fit to the data is merely a statistical model with no structural/behavioral interpretation. Such a model cannot be used for policy experiments. Furthermore, Occam's razor suggests that we do not wish to work with such general models. What we seek are parsimonious models that fit well, that give useful insights into the data and that can be used for policy experiments.
Recognizing the impossibility of completely non-parametric identification of learning effects, we can still give some contingent answers to the questions we asked at the start of Section 2.5. First, note that the Bayesian updating and normal signaling assumptions can be relaxed. We discuss some papers that do this in Section 3.1.1.
Second, in principle one can distinguish learning from other plausible mechanisms that 21 As Koopmans, Rubin and Leipnik (1950) state: "Suppose … B is faced with the problem of identifying … the structural equations that alone reflect specified laws of economic behavior ... Statistical observation will in favorable circumstances permit him to estimate … the probability distribution of the variables. Under no circumstances whatever will passive statistical observation permit him to distinguish between different mathematically equivalent ways of writing down that distribution … The only way in which he can hope to identify and measure individual structural equations … is with the help of a priori specifications of the form of each structural equation." may generate state dependence (like inventories or switching costs), but only if one is willing to specify parametric forms for the competing models. This is consistent with the Bayesian decision theory view that "one needs a model to beat a model." We return to this point in Section 4.
Third, the questions of whether we can identify the discount factor and whether we can test if consumers are forward-looking are obviously closely related. Interestingly, however, Ching, Erdem and Keane (2012) show that, in the learning model, one can identify whether consumers are forward-looking using only (i) the laws of motion of the state variables and (ii) the form of current utility. But identification of the discount rate requires assumptions about the full structure (i.e., expectation formation), so that one can construct the expected value functions.
To see this, suppose we adopt the Geweke and Keane (2000) method to estimate dynamic models without the need to solve agents' DP problem, and without imposing the full structure of the model. To implement their method we take the value function in (21): (21') for j=0,…,J and replace it by the equation: Here is a polynomial in the state variables that approximates the "future component" of the value function. And π t is a vector of reduced form parameters that characterize the future component. The idea of the Geweke-Keane (GK) method is to estimate the π t jointly with the structural parameters that enter the current period expected utility function.
Notice that, as F is just a flexible function of the state variables, all that is assumed is that consumers understand the laws of motion of the state variables (i.e., how (I t+1 |I t , j) is formed).
They need not form expectations based on the true model. The approach is also agnostic about whether consumers use Bayesian updating or some other method. In general, identification of π t requires exclusion restrictions such that some variable enters F but not U. 22 22 Geweke and Keane (2000) point out that in the absence of exclusion restrictions, one must observe current payoffs (at least partially) in order to identify F. In labor economics, researchers may argue that wages capture much of the current payoff (e.g., Houser, 2003). Or, researchers can control current payoffs in a lab experiment (e.g., Houser, Keane and McCabe, 2004). Recently, Yao et al. (2012) proposed another strategy to identify the discount factor. They argue that if a data set consists of two regimes: a static environment and a dynamic environment, one can first estimate the parameters of the current payoff function using the static environment data, and then hold them fixed We see from (33) that, when the full structure is not imposed, one cost is that we lose identification of the discount factor. The β is subsumed as a scaling factor for the parameters π t of the F function. 23 On the other hand, we can test whether π t = 0, which is a test for forwardlooking behavior. Although the test makes weak assumptions about F, it is not non-parametric, as a functional form must be chosen for the current payoff function. As Ching, Erdem and Keane (2012) show, given the current payoff function, the π t are identified in the learning model because different current choices lead to different values of next period's state variables.

Key Substantive Results of Erdem-Keane (1996)
Erdem and Keane (1996)  Some key issues that arose in estimation are worth discussing, as they are common across many applications of dynamic learning models: First, EK had difficulty obtaining a precise estimate of the weekly discount factor, and so pegged it at 0.995. 24 Identification of the discount factor is often a practical problem in dynamic models, even when it is formally identified. (We discuss this further in Section 4.3). Second, EK also found it difficult to pin down the prior mean of quality. Hence, they constrained it to equal the average true quality level across all brands.
This implies peoples' priors are correct on average. They also constrained the prior uncertainty to be equal across brands, σ j1 = σ 1 , as allowing it to differ did not significantly improve the fit.
when estimating the discount factor using the dynamic environment data. Their approach requires the assumption that the current payoff function remains unchanged across regimes. 23 Recently, several papers have explored using exclusion restrictions to estimate the discount factor. Chevalier and Goolsbee (2009) and Ishihara and Ching (2012) use the resale value of a used good as an exclusion restriction in estimating dynamic demand models for new and used goods. In a dynamic store choice model, Ching, Imai, Ishihara and Jain (2012) use cumulative points earned via a reward program as an exclusion restriction. In a study of sales person productivity, Chung, Steenburgh and Sudhir (2013) use cumulative sales as an exclusion restriction. The ideas in  and Chung et al. (2013) are similar: cumulative points (or sales) do not affect current payoffs until they reach certain cutoffs so that customers (sales reps) can receive a bonus. Fang and Wang (2010) show that even parameters of quasi-hyperbolic discounting can be identified if a dynamic model has exclusion restrictions and one has panel data with at least three periods. 24 In trying to estimate the weekly discount factor, they obtained 1.001 with a standard error of 0.02. This standard error implies a large range of annual discount factors. It is also worth noting that Erdem and Keane set the terminal period for the DP problem at T=100, which is 50 weeks past the end of the data set.
Aside from the dynamic learning model, EK estimated two other models for comparison.
These are a myopic learning model (β = 0), and a reduced form model similar to Guadagni and Little (1983), henceforth GL. The latter is a multinomial logit with an exponentially smoothed weighted average of past purchases (the "loyalty" variable), a similar variable for ad exposures, a price coefficient, brand intercepts, and trends for values of no purchase and small brands.
Strikingly, EK found that both structural learning models fit substantially better than the GL model. 25 This is surprising, as GL specifies flexible (albeit ad hoc) functional forms for effects of past usage and ad exposures on current choice probabilities, while the Bayes learning models impose a very special structure. Specifically, as we saw in (17) and (18), only the sum of past experience or ad exposures matter in the Bayesian models, not the timing of signals.
Another striking result is that advertising is not significant in the GL model, implying advertising has no effect on brand choice. In the EK model there is no one coefficient to capture the effect of advertising. The parameter r is significant and positive, so consumers are risk averse with respect to quality, while the σ 1 , σ ε and σ A imply: (i) consumers have rather precise priors about new brands in the detergent category, and (ii) experience signals are much more accurate than ad signals. But the effect of advertising can only be assessed via simulations. EK used their model to simulate an increase in ad frequency for Surf from 23% to 70%. 26 The simulation was also done for a hypothetical new brand with the characteristics of Surf. The results imply that an increase in advertising has little effect on market share for about 4 months, but the impact is substantial after about 7 or 8 months. Thus, the model implies advertising has little impact in the short run, but sustained advertising is important in the long run. As expected, the impact of advertising is much greater for a new brand (as there is more scope for learning). 27 The advertising simulation results are not surprising in light of the parameter estimates.
As consumers have rather precise priors about brands in the detergent category, and as ad signals are imprecise, it takes sustained advertising over a long period to move priors and/or reduce 25 The dynamic learning model had 16 parameters while the other two models both had 15. EK obtained BIC values of 7531, 7384 and 7378 for GL and the myopic and forward-looking learning models, respectively. 26 "Ad frequency" is weekly probability of a household seeing an ad for a brand. In the data this was 23% for Surf. 27 We have noticed that Figure 1 in Erdem and Keane (1996) contains a typo. The scale on the y-axis in Figure 1, which reports results for the new brand with the myopic model, is incorrectly labeled. It should be labeled in the same way as Figure 5. This doesn't affect any of the results we discuss here.
perceived risk of a brand to a significant degree. 28 A clear prediction is that the higher is prior uncertainty, and the more precise are ad signals, the larger will be advertising effects and the quicker they will become noticeable. Thus, an important agenda for the literature on learning models is to catalogue the magnitudes of prior uncertainty and signal variances across categories.

A Review of the Recent Literature on Learning Models
Here we review developments in learning models subsequent to the foundational work discussed in Section 2. Almost all this work is post-2000, but it already forms a large literature.
We divide the review into (i) more complex learning models with myopic agents, (ii) more complex learning models with forward-looking consumers; (iii) learning models for product level/market share data; (iv) new applications of learning models (beyond brand choice). We should note that our survey focuses on empirical structural learning models where agents are uncertain about product attributes. [There is a literature on dynamic games where agents learn how to play equilibrium strategies, or learn how to coordinate in multiple equilibria settings including social learning environments. This literature is beyond the scope of our survey.]

Models with Myopic Agents
One stream of literature has focused on extending learning models by allowing for more complex learning mechanisms. To make such extensions feasible, it is often necessary to assume that consumers are myopic. We consider such models in the next two sub-sections that cover: (i) models with more complex learning mechanisms and (ii) models with correlated learning.

More Complex Learning Mechanisms
Mehta, Rajiv and Srinivasan (2004) extend the Bayesian model to account for forgetting.
Consumers imperfectly recall prior brand experiences, and the extent of forgetting increases with time. Then, a consumer's state depends on the timing of signals, not just the total number (as in Equation (6)). Thus, it is necessary to assume myopia to make modeling forgetting feasible. Deighton (1984) proposed that advertising has a "transformative" effect whereby it alters consumer assessment of the consumption experience. Mehta, Chen and Narashiman (2008) include this effect in a learning model. They allow information signals from advertising to be biased, and this bias can change how consumers interpret their consumption experience. The identification of such a model is very challenging. Mehta et al. (2008) can achieve identification because their data set includes consumers who have hardly watch TV commercials. Choices of these consumers allow one to identify true brand quality levels because their experience signals are not "contaminated" by the transformative effect of advertising. After controlling for the true mean brand qualities, the choices of the consumers who do watch TV commercials allows them to identify the bias of the advertising signals and the transformative effects of advertising.
Camacho, Donkers and Stremersch (2011) also model perception biases, but in a simpler framework. They argue that some types of experience may be more salient in certain contexts.
For example, a physician may pay special attention to feedback from patients who have just switched treatment. They modify the standard Bayesian model by introducing a salience parameter to capture the extra weight physicians may attach to signals in that case. Using data on asthma drugs, they find evidence that feedback from switching patients receives 7-10 times more weight in physician learning than feedback from other patients. (2011) allow for consumer uncertainty about the precision of quality signals. They update their perception of this precision over time. In particular, consumers who receive a very negative experience signal may change their perception of signal variance. They estimate the model using scanner data that spans the period of a product-harm crisis affecting Kraft Australia's peanut butter division in June 1996. Their model is able to fit the data better than a standard learning model, which assumes consumers know the true signal variance.

Models of Correlated Learning
Another stream of literature models information spillover across brands, or "correlated learning." By this we mean learning about a brand in one category by using the same brand in another category, and/or learning about one attribute (e.g., drug potency) from another (e.g., side effects). This occurs if priors and/or signals are correlated across products or attributes. Erdem (1998) considers a model where priors are correlated across "umbrella brands" (i.e., a brand that operates in multiple categories). She finds evidence that consumers learn via experience across umbrella brands in the toothpaste and toothbrush categories. She shows that brand dilutions can occur if a brand in the "parent" category (toothbrush) is extended to a new product in a different category (toothpaste) and the new product is not well-received. This framework has been extended to study decisions about fishing locations (Marcoul and Weninger,27 2008), and adoption of organic food products (Sridhar, Bezawada and Trivdei, 2012). 29 Other papers have extended learning models to multi-attribute settings where consumers use experience of one attribute to draw inferences about other attributes. Prescription drugs are a good example: Coscelli and Shum (2004) estimate a diffusion model for Omeprazole, an antiulcer drug. It can treat: (i) heartburn, (ii) hypersecretory conditions, (iii) peptic ulcer, and provide (iv) maintenance therapy. In the model, physicians know how signals are correlated across the four conditions. In each patient-physician encounter, a physician only observes a signal of the condition being treated, but he/she uses it to update his/her multi-dimensional prior belief.
Chan, Narasimhan and Xie (2013) also apply a multi-attribute learning model to the drug market. They assume experience signals are correlated on the two dimensions of side-effects and effectiveness. They achieve identification by supplementing revealed preference data with data on self-reported reasons for switching: side-effects or ineffectiveness. Interestingly, they find detailing visits are more effective in reducing uncertainty about effectiveness than side-effects.

More Sophisticated Learning Models with Forward-looking Consumers
Following Erdem and Keane (1996), several papers have made significant contributions in the area of learning models with forward-looking consumers. We discuss these in turn: Ackerberg (2003) deviates from Erdem-Keane in several dimensions. Most notably, he models both informative and persuasive effects of advertising. 30 The persuasive effect is modeled as advertising intensity shifting consumer utility directly. The informational effect is modeled by allowing consumers to draw inferences about brand quality based on advertising intensity. 31 This is quite different from the information mechanism in Erdem-Keane, where ad 29 Hendricks and Sorensen (2009) use a similar idea of information spillover to explain the skewness of music CD sales. They find evidence that a successful new album release by an artist increases the likelihood that consumers purchase older albums of the same artist. 30 The separate identification of informative and persuasive effects of advertising relies on this qualitative implication of learning models: As consumers gather more information over time, the marginal benefits of informative advertising must fall; therefore, if advertising has any impact on brand choice in the long run, it is due to persuasive advertising. To our knowledge, Leffler (1981) is the first paper that proposes this identification strategy. He implements it in a reduced form model using product level sales data for new and old prescription drugs. Narayanan et al. (2005) make use of the same identification argument when estimating their structural model using product level data. Recently,  propose a new identification strategy to attack this problem they argue that informative advertising should affect all products that share the same features/ingredients equally, but persuasive advertising should be brand specific. Ching and Ishihara implement their identification strategy in a prescription drug market where some drugs are made of the same chemical, but with different brand-names. 31 That is, ad frequency itself signals brand quality, as in the theoretical literature on "advertising as burning money" (which only high quality brands can afford to do) (Kihlstrom and Riordan, 1984). content provides noisy signals of quality. Other differences are: (i) he is primarily interested in learning about a new product, and his model allows for heterogeneity in consumers' match value with the new product, and (ii) he assumes it takes only one trial for consumers to learn the true match value. Estimating the model on scanner data for yogurt, Ackerberg (2003) finds a strong, positive informational effect of advertising. But the persuasive effect is not significant.
The key innovation of Crawford and Shum (2005) is to allow for multi-attribute learning.
In an application to prescription drugs, they argue that panel data allows them to identify two effects: (i) symptomatic effects, which impact a patient's per period utility via symptom relief, and (ii) curative, which alter the probability of recovery. They allow physicians/patients to have uncertainty along both dimensions (although they abstract from correlated learning). They also endogenize length of treatment by allowing patients to recover. Their estimates imply substantial patient heterogeneity in drug efficacy. They go on to study the welfare cost of uncertainty relative to the first-best environment with no uncertainty. Welfare questions cannot be addressed without a structural model. However, after estimating their model, Crawford and Shum can simulate removal of uncertainty by setting the initial prior variance to zero, and setting each consumer's prior match value to be the true match value. By conducting this experiment, they find that consumer learning allows consumers to dramatically reduce the costs of uncertainty. Erdem, Keane and Sun (2008) was the first paper to model the quality signaling role of price in the context of frequently purchased goods. They also allow both advertising frequency and advertising content to signal quality (combining features of Ackerberg (2003) and Erdem and Keane, 1996). And they allow use experience to signal quality, so that consumers may engage in strategic sampling. Thus, this is the only paper that allows for these four key sources of information simultaneously. In the ketchup category they find that use experience provides the most precise information, followed by price, then advertising. The direct information provided by ad signals is found to be more precise than the indirect information provided by ad frequency.
The main finding of Erdem, Keane and Sun (2008), obtained via simulation of their model, is that, when price signals quality, frequent price promotions can erode brand equity in the long run. As they note, there is a striking similarity between the effect of price cuts in their model and in an inventory model. In each case, frequent price cuts reduce consumer willingness to pay for a product; in the signaling case by reducing perceived quality, in the inventory case by 29 making it optimal to wait for discounts. We return to this issue in Section 4.
Osborne (2011) is the first paper to allow for both learning and switching costs as sources of state dependence in a forward-looking learning model. This is important because learning is the only source of brand loyalty in Erdem and Keane (1996). So it is possible they only found learning to be important due to omitted switching costs. However, Osborne finds evidence that both learning and switching costs are present in the laundry detergent category. When learning is ignored, cross elasticities are underestimated by up to 45%. 32 Erdem, Keane, Öncü and Strebel (2005) represents a significant extension of previous learning models, as it is the first paper where consumers actively decide how much effort to devote to search before buying a durable. This contrasts with Roberts and Urban (1988) where word-of-mouth (WOM) signals are assumed to arrive exogenously, or Erdem and Keane (1996) where ad signals arrive exogenously. Another novel feature of the paper is that there are several information sources to choose from (WOM, advertisements, magazine articles, etc.) and, in each period, consumers decide how many of these sources to utilize. 33 In their application, Erdem et al. (2005)  Another way for consumers to learn is by observing other consumers' choices, instead of their opinions, i.e., observational learning (Banerjee, 1992). To capture this idea, Zhang (2010) 32 Osborne (2011) allows for a continuous distribution of consumer types. Of course, it is literally impossible to solve the DP problem for each type (which is why the DP literature usually assumes discrete types). Thus, some approximation is necessary here. Osborne is able to estimate his model by adapting the MCMC algorithm developed by Imai, Jain and Ching (2009), and extended by Norets (2009)

Modeling Consumer Learning using Product Level Data
The estimation technique developed by Berry, Levinsohn and Pakes (1995) (BLP) led to a large body of demand analysis that applies static discrete choice models primarily to product 34 Cai et al. (2009) provide interesting evidence for observational learning by studying a natural experiment where customers of a restaurant are given a ranking of some popular dishes. 35 Learning about preferences may appear to be different from learning about attributes. But the two are equivalent as long as one assumes the utility function is linear in attributes and preference weights. 36 To estimate his model, Dickstein uses the Gittin's index approach (Gittins and Jones, 1979). But, as in Eckstein et al (1988), who also use that approach, he needs to assume consumers are risk neutral. It is worth noting that Ferreyra and Kosenok (2011) use a method similar to Gittin's index to estimate a simpler dynamic learning problem. level or market share data. 37 In general, however, BLP cannot be used to estimate the demand systems generated by consumer learning models. Learning models are always dynamic in that current sales affect future demand, regardless of whether consumers are forward-looking or myopic. Demand for one brand in a learning model depends on the whole distribution (across consumers) of perceived quality for all brands. We are skeptical about whether individual heterogeneity distributions can be credibly identified from aggregate (i.e., product level) data.
Furthermore, it is difficult to combine such a complex demand system with a supply side model.
If one wants to estimate a dynamic demand system with consumer learning, and only market share data is available, it is clear that one has to abstract from the endogenous consumer heterogeneity generated by individual level purchase histories (see section 2.1, Equation (5)). To address this issue, Narayanan, Manchanda and Chintagunta (2005) and Ching (2000Ching ( , 2010a propose two related modifications of the EK framework. Narayanan et al. (2005) assume every agent has an identical purchase history, i.e., for each brand, the quantity purchased is equally distributed across agents in each period. Ching (2000Ching ( , 2010a assumes consumers can learn from each other's experiences via social networks or information gathering institutions (e.g., physician networks). As a result, consumers use the same set of experience signals to update their beliefs, and all consumer share a common belief at any point of time. 38 This assumption eliminates the distribution of consumers across different state as the state variables for firms. Both Narayanan et al. (2005) and Ching (2000Ching ( , 2010a capture consumer learning in a parsimonious way, so that, when combined with an oligopolistic supply-side, the size of the state space is manageable. Both papers apply their frameworks to study the demand for prescription drugs. 39,40 As a demand system with consumer learning is always dynamic, we would expect firms to be forward-looking when choosing their marketing-mix. As a first attempt to address this issue, Ching (2010b) extends Ching (2010a) by combining a social learning demand model with a dynamic oligopolistic supply side model. As far as we know, this is the first empirical paper to 37 Ackerberg et al. (2007) provides an excellent survey of this area. Note that product or market level data are more readily available than scanner data for many industries. 38 An interesting paper that studies both across and within consumer learning is by Chintagunta, Jing and Jin (2009). They apply their model to doctors' prescribing decisions for Cox-2 inhibitors, a new class of pain killers, and find that both types of learning are important. 39 Chen et al. (2012) and Moretti (2011) use a similar framework to study the impact of WOM on movie sales. 40 The BLP estimation method can be applied to the demand model of Narayanan et al. (2005), but it cannot be applied to the model of Ching (2000Ching ( , 2010a. Due to space constraints, we will not discuss the details here. Interested readers may refer to Ching (2010a) for a detailed discussion. combine a dynamic demand system with forward-looking firms. In the model, both consumers and firms are uncertain about the quality of generic drugs, but they rely on the same information set to update their belief over time, and hence perceived quality and variance are common for consumers and firms. Equilibrium is Markov-perfect Nash, as in Maskin and Tirole (1988), and Ericson and Pakes (1995). The model is tailored to study the competition between brand-name and generic drugs. Ching applies his model to the market for clonidine. Simulations of the model show that it can rationalize two important stylized facts: (i) the slow diffusion of generic drugs, and (ii) the fact that brand-name firms slowly raise their prices after generic entry. 41 Ching and Ishihara (2010)  Standard Bayesian learning models like Erdem and Keane (1996) cannot generate this pattern as they imply the marginal impact of information signals falls over time (see Equations (17)- (18)).
Thus, CI deviate from the EK framework by introducing three new features: (i) both consumers and firms are uncertain about the quality of the product; (ii) social learning takes place via an intermediary (opinion leader, consumer watch group, etc.), who updates the information set for each brand; (iii) the purpose of detailing is to build up a stock of physicians who are familiar with the most recent information set of the promoted brand. 42 The model generates heterogeneity in information sets, as the fraction of physicians with the most up-to-date information about a brand is a function of its cumulative detailing. Using this framework, CI are able to quantify how effectiveness of detailing changes when a new clinical trial outcome is released.
Lim and Ching (2012) extend the CI framework to a multi-dimensional learning model with correlated beliefs, and apply it to study demand for the major class of anti-cholesterol drugs, statins. The CI model may be applicable to settings besides drugs where the interaction between news and informative advertising is of first-order importance. The model is parsimonious and it could, in principle, be combined with an oligopolistic supply side. But this has not yet been done.
An interesting paper by Hitsch (2006) studies firm learning about the demand for a new product. He abstracts from consumer learning by using a reduced form demand model. That is, unlike other papers we have discussed, consumers have no uncertainty about new products. But a firm needs to learn the true demand parameters. Hitsch considers a one-side learning equilibrium model, and he also abstracts from competition. These simplifications significantly reduce the computational burden of the estimation, and yet the model delivers important new insights to the product launch and exit problem. Finally, it is worth noting that, due to computational burden, no paper has yet estimated a model with both forward-looking consumers and firms.

Other Applications of Learning Models -Services, Insurance, Media, Tariffs, Etc.
Learning models have been applied to many problems other than choice among different brands of a product. In particular, Bayesian learning models have been applied to choice among services, insurance plans, media, tariffs, etc. Here we discuss these types of applications.
Israel (2005) uses a learning model to study the customer-firm relationship in the auto insurance market. This environment is well suited for studying learning because opportunities to learn arrive exogenously when an accident happens. Presumably, when consumers file a claim, they learn something about the customer service of the insurance company. If a consumer leaves the company after filing a claim, it may indicate that they had a negative experience. (2006)  implies a weaker immune system. A treatment that is less effective in reducing CD4 may still be preferable because it has fewer side-effects. Fernandez (2013) extends Chan and Hamilton by allowing patients to be uncertain about whether they are assigned to a treatment or control group.

Chan and Hamilton
Tariff choice is another area where Bayesian learning models have been useful. It is widely believed that consumers are irrational when they choose between flat-rate and per-use plans, as several studies have found that many consumers could save by switching to a per-use option. But these papers tend to look at behavior over a short period. By using a longer period, Miravete (2003) finds strong evidence to contradict the irrational consumer view. Using data from the 1986 Kentucky tariff experiment, he provides evidence that consumers learn their actual usage rates over time, and switch plans in order to minimize their monthly bills.
Narayanan, Chintagunta and Miravete (2007) interpret the same data using a Bayesian learning model with myopic consumers. To explain why consumers make mistakes in choosing an initial plan, they assume they are uncertain about their actual usage. The structural approach allows them to quantify changes in consumer welfare under different counterfactual experiments.
Iyengar, Ansari and Gupta (2007) develop another myopic learning model that is closely related.
But they also allow consumers to be uncertain about the quality of service.
Goettler and Clay (2011) use a Bayesian learning model to infer switching costs for tariff plans. They do not observe consumers switching plans in their data. Identification of switching costs is achieved by assuming consumers are forward-looking, have rational expectations about their own match value, and make plan choice decisions every period after their initial enrollment.
The implied cost to rationalize no switching is quite high ($208 per month). Grubb and Osborne (2012) argue an alternative way to explain infrequent plan switching is that consumers do not consider plan choice every period (as consideration is costly and/or time consuming). They model the consideration decision using the "Price Consideration Model" of Ching, Erdem and Keane (2009), and the switching decision as a Bayesian learning model. Their rich data set allows them to investigate prior mean bias, projection bias and overconfidence.
Finally, learning models have also been extended to study the value of certification systems (Chernew et al. 2008;Xiao 2010)

Limitations of the Existing Literature and Directions for Future Work
In this section we discuss limitations of existing learning models and directions for future research. In our view, the four main limitations are: (1) It is difficult to identify complex models with rich specifications of consumer behavior, (2) it is difficult to disentangle different sources of dynamics, (3) there is no clear consensus on forward-looking vs. myopic consumers, and (4) more work is needed on how to estimate equilibrium models with consumer learning.

Identification in Behaviorally Rich Specifications
In Section 2.5.A we discussed the formal identification of learning models. This topic is also addressed in a number of other papers such as Erdem, Keane and Sun (2008). By formal identification we refer to a proof that a parameter is identified, given the structure of the model, as well as a discussion of any normalizations that are needed to achieve identification. However, an important point (see Section 2.6), is that it is common in complex models for a parameter be formally identified and yet: (i) the intuition for what patterns in the data actually pin it down are not clear, and/or (ii) the likelihood is so close to flat in the parameter that it is not practical to estimate it in practice (what Keane (1992) called "fragile identification"). These problems are not at all special to dynamic learning models, but they deserve further attention in this context. A particularly important issue is that, given only revealed preference (RP) data on purchase decisions and signal exposures, it may be hard to identify models with complex learning mechanisms, or to distinguish among alternative learning mechanisms (i.e., multiple mechanisms may fit the data about equally well).

36
A promising solution to this problem is to combine RP data with stated preference (SP) data that attempts to directly measure the learning process (also known as process data). For example, consider the paper Erdem, Keane, Oncu and Strebel (2005) on how consumers learn about computers. In addition to RP data, they also had data on how people rated each brand in each period leading up to purchase. They treated the SP data as providing noisy measures of consumer's perceptions. This enabled them to identify variances of different information sources. Intuitively, if peoples' ratings tend to move a lot after seeing an information source (and their perceived uncertainty tends to fall a lot), it implies that information source is perceived as accurate. Another paper that combines RP and SP data to aid in identification is Shin, Misra and Horsky (2012), which is an attempt to disentangle preference heterogeneity from learning.
An alternative approach is to combine choice data with direct measures of information signals. Roberts and Urban (1988) did this in their original paper. Ching and Ishihara (2010) and Lim and Ching (2012) use results of clinical trials to measure the content of signals received by physicians, and incorporate this into their structural learning models. In a reduced form study, Ching et al. (2011) use data on media coverage of prescription drugs and find evidence that when patients learn that anti-cholesterol drugs can reduce heart disease risk, they become more likely to adopt them. Kalra et al. (2011) attempt to pin down content of information signals by examining news articles. Chintagunta et al. (2009), in a study of doctor's prescribing decisions for Cox-2 inhibitors, use patient diary data that record actual use experience. 43 In sum, there has been some work in this area but there is obviously much room for further progress. 44

Distinguishing Among Different Sources of Dynamics
Learning is one of many mechanisms that may cause structural state dependence. Other potential sources of state dependence include inertia, switching costs, habit persistence and inventories. In this section we discuss attempts to distinguish among these sources of dynamics.
We particularly emphasize the problem of distinguishing between learning and inventories, because most dynamic structural models have assumed one of these mechanisms as the source of dynamics. Furthermore, and perhaps surprisingly, the behavioral patterns generated by learning can be quite similar to those generated by inventories. Thus it can be very difficult to identify which mechanism generates the state dependence we see in the data.
Learning and inventory models generate dynamics in very different ways. Learning models generate persistence in choices (brand loyalty) as risk aversion leads consumers to stay with "familiar" brands. This familiarity arises endogenously, via information signals that cause consumers to gravitate toward particular brands early in the choice process. Inventory models, in contrast, do not generate persistence in brand choices. Rather they must assume the existence of a priori consumer taste heterogeneity to generate loyalty. Obviously, a great appeal of learning models is they provide a behavioral explanation for the emergence of brand loyalty.
However, once we introduce unobserved taste heterogeneity, the dynamics generated by learning and inventory models are rather hard to distinguish empirically. The similarity of the two models is discussed extensively by Erdem, Keane and Sun (2008). They fit a learning model to essentially the same data used in the inventory model of Erdem, Imai and Keane (2003), and find that both models fit the data about equally well, and make very similar predictions about choice dynamics. For instance, both models predict that, in response to a price cut, much of the increase in a brand's sales is due to purchase acceleration rather than brand switching.
The similarity of the two models is even greater if we allow for price as a signal of quality. Then, both models predict that frequent price promotion will reduce consumer willingness to pay for a product; in the signaling case by reducing perceived quality, in the inventory case by changing price expectations and making it optimal to wait for discounts.
Obviously an important avenue for future research is to determine if learning or inventory effects are of primary importance for explaining consumer choice behavior, or, indeed, if both mechanisms are important. But unfortunately, computational limitations make it infeasible to estimate models with both learning and inventories. There are simply too many state variableslevels of perceived quality and uncertainty for all brands, inventory of all brands, current and lagged prices of all brandsto make solution and estimation feasible. This makes it impossible to nest learning and inventory models and assess the quantitative importance of each mechanism.
Presumably, advances in computation will remove this barrier in the future.  (2010) propose a simple test of the relative importance of learning vs. inventories. They consider the learning mechanism where consumers use price as a signal of quality. They also exploit the fact that inventory models generate "reference" price effects (i.e., choices are based on the current price relative to the reference price of a brand). Their test relies on the interaction between a use experience term and the reference price (operationalized as an average of past prices). In a learning model, higher use experience should be associated with less use of price as a quality signal. Based on this test, they find evidence for both learning and inventory (i.e., reference price) effects for two frequently purchased goods (ketchup and diapers).
As an alternative to nesting learning with other models of dynamics, a simple idea is to estimate a structural learning model and include a lagged choice variable in the payoff functions to capture any "left-over" state dependence in a reduced-form way. This model is identified, as lagged choice does not enter the EK learning model (only cumulative choices matter). However, it is difficult to interpret the lag coefficient. Osborne (2011) adopted this approach and called the lag coefficient "switching costs." But there are many possible explanations, including inertia, inventories, habit persistence and recency effects in learning. Suppose the standard Bayesian model is not literally true, and consumers put extra weight on recent signals. Then a lagged purchase variable may just absorb the misspecification of the learning process. In general, we are skeptical that including non-structural elements in a learning model can be informative about the importance of learning vs. other mechanisms that generate dynamics. We believe that nesting of learning and other mechanisms, and incorporation of process data, are needed to make progress.

Forward-Looking vs. Myopic Consumers
As we discussed in Section 2, the key distinction between forward-looking and myopic models is whether consumers engage in strategic trial. But the evidence on whether consumers are forward-looking is mixed. Indeed, in many applications, researchers have found it difficult to identify the discount factor, because the likelihood is rather flat in this parameter. For instance, in the detergent category, Erdem and Keane (1996) found that increasing the discount factor from 0 (a myopic model) to 0.995 improved the likelihood by only 6 points. That was significant, but if the likelihood is so flat in the discount factor, it is hard to discern forward-looking behavior. 45 As forward-looking models may not provide substantial fit improvements, and as they are much harder to estimate, it is not surprising that many researchers have adopted myopic models, as we saw in Section 3.1. But before taking this path, it is important to emphasize that strategic trial is the distinguishing feature of forward-looking models. In a mature category, consumers may have nearly complete information, leaving little to gain from trial purchase. Then a forwardlooking consumer will behave much like a myopic oneit is impossible to tell the two types apart, and the discount factor is not identified.
Given this observation, the small likelihood improvement that EK found may not be surprising; subjective prior uncertainty is fairly low for detergent, so perceived gains from trial are small. In contrast, in a market with significant uncertainty about product attributes, the rate of trial would be higher. 46 Forward-looking models may provide a superior fit in such markets.
Thus, we think it would be a mistake to infer from results on relatively mature categories that forward-looking models are unnecessary. This decision should be made on a case-by-case basis given the characteristics of the category under consideration. A key agenda item for the literature on learning models is to compare the degree of prior uncertainty across categories, to determine when forward-looking behavior is most important.
Furthermore, we believe developing simple tests for forward-looking behavior (such as the test in the Ching, Erdem and Keane (2012) quasi-structural model) should be an important topic for future research. However, we also believe such tests must be derived explicitly from a theoretical model. Recently, there has been a trend whereby researchers seek to develop "modelfree" tests for the assumptions of a structural model. We comment on this trend in section 4.5.

Integration of Learning Models with Supply Side Models
As we discussed in Section 3.3, there has been significant progress in developing dynamic demand systems with consumer learning that can be estimated using product level data.
Nevertheless, there is clearly a large discontinuity between their models and those that are applied to individual level data. All the models in Section 3.3 abstract from self-learning, and, for reasons of tractability, assume the existence of an information aggregator. 47 In our opinion, self-learning is still an important source of information for frequently purchased goods despite the advance of social networks. Moreover, none of the models in Section 3.3 allow for forwardlooking consumers. Therefore, we believe that developing a richer aggregate dynamic demand system with learning remains a challenging and important area for future research.
Most of the demand analyses that use product level data are motivated by the ultimate goal of combining it with firms' problems, in order to build an equilibrium model to study the long-term outcomes of certain policy changes (e.g., advertising regulations, anti-competitive pricing regulation, merger analysis). The key challenge is how to model the firms' problem when facing such a complex dynamic demand system. The demand system generated by the EK framework (or similar models) is so complex that it is very difficult to analyze even in a monopoly situation. It is not clear if the fully rational approach to modeling firms' decisions is possible given computational costs of keeping track of such a complicated state space.
Thus, an important avenue for future research is to develop a dynamic demand system with learning that is not too costly for firms to use, yet can capture the potential forward-looking and strategic trial behavior of consumers. Hendel and Nevo (2011) take this research direction, but in the context of storable goods and not experience goods. Their demand model is motivated by the dynamic stockpiling models in Erdem, Imai and Keane (2003) and Hendel and Nevo (2006), but is much simpler to estimate, and tractable to combine with forward-looking firms.
They show how their model can be used to study intertemporal price discrimination empirically.

Model-Free Evidence on the Validity of Structural Models?
Structural models in general, and learning models in particular, are often criticized on the grounds that they make a large number of assumptions (e.g., about how consumers learn and form expectations, the functional form of utility, etc.). Identification of these models relies on these functional form assumptions. Critics of structural models often argue that we should prefer "simple methods" and/or "model free" evidence. The debate on this topic is extensive, and beyond the scope of this survey. For further discussion we refer the reader to articles such as Heckman (1997), Keane (2010a,b) and Rust (2010). These authors argue that drawing inferences from data always relies on some set of maintained assumptions. They argue that simple reduced form or statistical models typically rely on just as many assumptions as structural models, the main difference being that the simple models leave many assumptions implicit. Here, instead of repeating their general arguments, we illustrate our point by discussing two example papers that use such "simple" approaches to test for learning behavior. 48 Chintagunta, Goettler and Kim (2012) present reduced-form evidence of forward-looking behavior by physicians. More specifically, when a new drug is just introduced, they focus on the set of physicians who have not yet been exposed to detailing. They run a logit model to predict if a physician will prescribe the new drug to a patient. The key point is they include future detailing as a regressor. Say there is some risk involved in experimenting with the drug now, but future detailing is an opportunity to learn without risk. Hence, they argue, if physicians are forwardlooking, then, the higher is future detailing, the less likely they are to prescribe the drug now. So a negative coefficient on future detailing suggests physicians are forward-looking.
However, this "model-free" test implicitly assumes there is no physician heterogeneity in receptivity to detailing. But it is plausible that some physicians are more skeptical about sales rep presentations, so they require more detailing to be convinced. This could cause sales reps to spend more time with less receptive physicians. Then, the coefficient on future detailing may be negative even if physicians are myopic. More generally, including a future variable in a regression is a Sims strict exogeneity test (Sims 1972). It may just be that a current prescription reduces future detailing. Thus, while the test result is interesting, it is difficult to interpret.
We now turn to our second example. In an attempt to distinguish learning from other sources of state dependence such as switching costs, inertia or habit persistence, Dubé, Hitsch and Rossi (2010) estimate the following simple discrete choice model: (34) .
Note that (34) contains lagged choice, cumulative use experience, N j (t), and their interaction. So it could be viewed as a linear approximation to the more complex nonlinear form implied by the 48 For another example see Ching (2013) for a critique on Moretti (2011). learning model (see Equations (5), (6) and (27)). Now, suppose the learning model is correct. Dubé et al argue that, for experienced consumers who have complete information, lagged choice should not be a predictor of current choice. 49 This is because, when cumulative experience is large, the additional impact of more experience on the perceived variance of a brand is trivial (see Equation (6)). 50 More generally, the fact that use experience N j (t) reduces the effect of lagged purchase implies the interaction coefficient γ 2 should be negative. However, using data on margarine and frozen orange juice, Dubé et al find that more experience does not reduce the lagged choice effect (γ 2 ≈ 0). They interpret this as evidence against consumer learning.
It is tempting to treat this as a "model-free" test, as it does not impose the functional form assumptions required to estimate a fully specified learning model. But this interpretation is not correct. First, the test fails to account for a key feature of the Bayesian learning model: when N j (t) is large, so a consumer knows almost everything about brand j, any further increase in N j (t) has a negligible impact on utility. However, Equation (34) does not allow for this possibility, as ⁄ , which is independent of N j (t). Second, when N j (t) is small, the impact of d j,tpoints, it is possible that γ 2 may be close to zero, even if there is consumer learning in the data.
So again, the test result is interesting, but it is difficult to interpret.
We believe that searching for data patterns that are potentially consistent or inconsistent with a structural model is a useful exercise. It can often provide valuable insights, and can be a useful part of the process of building, validating and improving structural models. However, we do not believe that "simple models" and/or "model free" evidence can ever replace structural models or the key role of theory in empirical work more generally.
It is important to remember that truly "model free" evidence cannot exist. The "simple" empirical work that promises to deliver such evidence always relies on some assumptions. But 49 Given controls for taste heterogeneity (α j ), the lagged choice variable d j,t-1 can matter for several reasons, such as inertia, switching costs, habit persistence, inventories or learning. So finding that lagged choice is significant for consumers with complete information may simply mean that sources of dynamics besides learning are also present. 50 It is important to note that not all learning models imply that choice behavior becomes stationary given sufficient use experience. For instance, as we discussed in Section 3.1.1, Mehta, Rajiv and Srinivasan (2004) extend the basic model to allow forgetting. It is also possible that product attributes change over time. Thus, it is conceptually straightforward to construct learning models where recent experience is more salient for a variety of reasons. these assumptions are often left implicit, due to failure to present an explicit model. Often these implicit assumptions are (i) not obvious, (ii) hard to understand and (iii) very strong. One of the main contributions of structural learning models to both marketing science and economics has been to generate much more interest in the structural paradigm. We hope this will be a long term trend, regardless of future evaluations of the usefulness of the learning model per se.

Summary and Conclusion
In this survey we laid out the basic Bayesian learning model of brand choice, pioneered by Eckstein et al (1988), Roberts and Urban (1988) and Erdem and Keane (1996). We described how subsequent work has extended the model in important ways. For instance, we now have models where consumers learn about multiple product attributes, and/or use multiple information sources, and even learn from others via social networks. And the model has also been applied to many interesting topics well-beyond the case of brand choice, such as how consumers learn about different services, tariffs, forms of entertainment, medical treatments and drugs.
We also identified some limitations of the existing literature. Clearly an important avenue for future research is to develop richer models of learning behavior. For instance, it would be desirable to develop models that allow for consumer forgetting, changes in product attributes over time, a greater variety of information sources, and so on. But such extensions present both computational problems and problems of identification. We suggest it would be desirable to augment RP data with direct measures of consumer perceptions and direct measures of signal content to help resolve these identification problems.
One clear limitation of the existing literature has been the difficulty of precisely estimating the discount factor in dynamic learning models. This makes it difficult to distinguish forward-looking and myopic behavior. We discussed the search for exclusion restrictions (i.e., variables that affect future but not current payoffs) to help resolve this issue.
Another key challenge for future research is to develop models that combine learning with other potentially important sources of dynamics, such as inventories or habit persistence.
We noted it has not been possible to build inventories into dynamic learning models due to computational limitations. However, this course of research is important, because the dynamics generated by inventories can be quite similar to those generated by learning. Thus, it is important to try to distinguish between the two mechanisms. The identification of different sources of 44 dynamics is also a challenge, and we again conclude that progress would be aided by the combination of RP and SP data.
Finally, we point out that integrating learning models of demand with supply side models remains under-explored and should be another important area for future research.
In summary, it is clear that learning models have contributed greatly to our understanding of consumer behavior over the past 20 years. Two of the best examples still come from the original Erdem and Keane (1996) paper: First, that when viewed through the lens of a simple Bayesian learning model the data are consistent with strong long-run advertising affects. Second, that a Bayesian learning model can do an excellent job of capturing observed patterns of brand loyalty. Future work will reveal if such key findings are robust to the extension of these models to include multiple sources of dynamics and behaviorally richer models of learning behavior.