Asymptotic Behavior of Sequence Models

In this paper we study the limiting dynamics of a sequential process that generalizes Pólya’s urn. This process has been studied also in the context of language generation, discrete choice, repeat consumption, and models for the web graph. The process we study generates future items by copying from past items. It is parameterized by a sequence of weights describing how much to prefer copying from recent versus more distant locations. We show that, if the weight sequence follows a power law with exponent α ∈ [0, 1), then the sequences generated by the model tend toward a limiting behavior in which the eventual frequency of each token in the alphabet attains a limit. Moreover, in the case α > 2, we show that the sequence converges to a token being chosen infinitely often, and each other token being chosen only constantly many times.


INTRODUCTION
In this paper we are concerned with a randomized process to produce sequences over a fixed alphabet {1, . . . ,T } of tokens. The process begins with some finite initial history of tokens, and then proceeds by randomly selecting a previous location in the history to copy from, in order to produce the next output. The most recent location has a preference weight w 1 , the second most recent location has weight w 2 , and so forth; more recent locations are preferred. The location from i steps ago is copied-from with probability proportional to w i . This simple process occurs in many settings: • For the case with all weights uniform, w i = 1, the process is exactly the classical Pólya's urn [11], with the initial contents of the urns given by the counts of each token in the history. • If new tokens are introduced with constant probability, but the weights remain uniform, the process is exactly Simon's 1955 copying model [12] used to explain word frequencies in human language. • If the tokens correspond to graph vertices, and new tokens are again introduced with constant probability, the model is an important special case of graph copying models [3,9]. • If the sequence of weights w i in the model are learned, the resulting model has been used to explain repeat consumption behavior in multiple domains [1,4,8].
Perhaps the most fundamental question about this model is: what happens when it runs? In this paper, we wish to understand the limiting behaviors of sequences produced by models like this, especially the following central questions: (1) As the length of the generated sequence grows, does it reach a limiting distribution under some definition? (2) When the limiting distribution exists, does it have positive support over the entire token set, or do certain tokens disappear forever? (3) What can be said about the relationship between the attention weights and the limiting distribution?

Our results
We assume that our model is given some fixed prefix or history x 1 x 2 . . . x h , and then repeatedly predicts x i given x 1 . . . x i−1 . For i > h, every element x i copies directly or indirectly from some position in the history. To study limiting distributions, we will fix some arbitrary subset of positions of interest in the history, and state our results in terms of the long-term occurrences of tokens copied from these positions. First, some definitions: For i > h, let X i be an indicator that is 1 if x i is copied, directly or indirectly, from a "position of interest" and 0 otherwise, and let Let Z * = lim i→∞ Z i . We want to know when Z * exists, and what are its properties.
The following two results are already known: (1) For w i = 1, Z * exists and is beta-distributed (Pólya urn).
To obtain such convergence results, it turns out that our assumption that the weights follow a power law is, in a sense, important. We show non-convergence results (omitted in this version) when the weights may not obey a power law but satisfy weaker analytic properties such as monotonicity or convexity.

RELATED WORK
Urn processes have been studied by classical mathematicians for hundreds of years, but the standard Pólya's urn formulation was developed and analyzed by Eggenberger and Pólya in 1923 [5]. Since then, a number of variations have been proposed, introducing complex rules for modifying the state of balls in urns in response to each draw. However, these processes are designed to be characterized by the state of the urns, so the numerous extensions do not typically consider rules that depend on the order of past draws. Herbert Simon [12] introduced a sequential process to study the emergence of power laws in language, in partial response to work of Zipf [14] six years earlier. In Simon's model, during each timestep, the next token will be copied with some probability from a uniformly-selected past location, and with remaining probability, will be a previously-unseen character. The continued introduction of neologisms into the vocabulary matches observations of natural language text. The process is known to converge in the limit to a distribution over token frequencies that matches a power law, again corresponding to natural language text.
Related to Simon's model, a number of authors [3,9] developed sequential models for the evolution of graphs, intended to reproduce the power law in-degree distribution observed for the web graph [10]. These models also produced new links by selecting existing links to copy from, using a uniform distribution.
Both Simon's copying model and the models of graph evolution relied on the introduction of new vocabulary as the model evolved, and their analysis was fundamentally structured around a growing set of tokens; hence, while the models are similar, the analytical techniques are not applicable to our domain.
More recently, Anderson et al. [1] employed copying models in the context of reconsumption of items: a user might listen to the same song many times, or eat at the same restaurant, and these decisions were shown to be well-modeled by a process that selects an item to re-consume by copying a previous consumption from the past. In their examples, and in follow-on work [4], the preference weights for the ith previous item w i were learned through maximum likelihood estimation. In this domain as in natural language, the particular form of the resulting weights is approximated well by a power law.
Other than the Pólya's urn, the only work of which we are aware that studies the limiting behavior of processes of this form is the work of Anderson et al. [1], who show that the weight sequence w i = 2 −i leads to a limiting distribution in which all tokens but one eventually disappear, leaving a single "winner" who will then be copied forever. In practice, we are not aware of real-world datasets in which the weights decay exponentially, hence our interest in extending this result to power law distributions, beyond α = 0 case of Pólya's urn.

BACKGROUND
. Given the weights and history, the model generates an infinite sequence from T according to the following rule: is the binary indicator function. The model thus captures the process of extending the sequence by randomly choosing a position from the past according to the weights and copying the token in that position. Since the weights are monotonically non-decreasing, tokens from more recent past have higher chance of being copied than tokens from the distant past.
We say that h is the end of the history. Let i ≥ h. By definition of the process, any position i will copy from a random position less than i, according to the weights. In this case we use c(i) to denote the position that i copies from, and c (j+1 In other words, if we treat the sequence of copies that ended in position i as a chain, then C(i) is the set of positions along this chain and f (i) is the final token outside the history in the chain starting from i. An equivalent interpretation is to consider a walk starting at i that jumps to position c(i), then jumps to position c (2) (i), and so on. We use the following random variable to denote the collision event: We will be using Chebyshev's inequality and the Chernoff bound.
Theorem 3.2 (Chebyshev's ineqality). Let X be a random variable with finite expectation and variance. Then, for each c > 0 Theorem 3.3 (Chernoff bound). Let X 1 , . . . , X n be iid 0/1 random variables. Let X = X i . Then, for δ > 0, In this section we show our main result: convergence to a limit. We will show this in two steps: (i) ξ i, j vanishes as h increases assuming w i = i −α , for some 0 ≤ α < 1 and (ii) a delicate covariance analysis that uses the vanishing collisions to establish convergence.

Vanishing collisions
In this section we show that ξ i, j < o h (1), i.e., the collisions vanish with history size. To simplify the exposition, we first introduce the following notation. Let ξ i be the event that the jump from i ends in some position in {1, 2, . . . , ⌊i/2⌋}, i.e., We prove something slightly more general.
and that, We use this to prove that, if α < 1, then the number of steps from a generic position j to a position less than h, i.e., the length |C(i)| of the chain from i is at most O(log j) with high probability. Theorem 4.2. Suppose that 0 < α < 1, and fix h ≥ 1. Then, there exists a constant c = c(α), such that if we let n j = n j (i), j ≤ i, be the number of positions in the range {h + 1, h + 2, . . . , j} in the chain starting in i, we have that Proof. In each position j, the probability that the event ξ j happens is at least p = p(α) = Θ(1). Therefore, by the Chernoff bound, the probability that the event does not happen at least lg j times in 200 1 p ln j trials is at most j −11 . Therefore, the probability that the chain does not reach some position before h after having visited 200 1 p ln j positions is at most j −11 . By the union bound, we get that the probability that there exists some j ≥ h + 1 that does not reach the history after 200 1 p ln j steps is at most j ≥h+1 Using these, we obtain the desired result on vanishing collisions. Theorem 4.3. Let α ∈ [0, 1) and h < i < i ′ be given. Then, Proof. By Theorem 4.2, with probability 1 − o h (1), for each j ≥ h, the walk from i (called i-walk) will visit at most O(ln(j + 1)) positions in the range {h + 1, . . . , j}. Now, consider the walk from i ′ (called i ′ -walk) and let j > h be a generic position visited by the i ′ -walk. Assuming that j has not also been visited by the i-walk, we ask: what is the probability q that the position that the i ′ -walk jumps from j has not been visited by the i-walk? Observe that, with probability 1 − o(1), the i-walk will have visited at most O(ln(j + 1) positions in the range {1, . . . , j − 1}. Then, under the conditioning, the probability that j jumps on some position visited by the i-walk is at most Since, by Theorem 4.2, with probability 1 − o h (1), the i ′ -walk will visit at most O(ln(j + 1)) positions in the range {j, . . . , 2j}, we have

Convergence via bounded covariance
In this section we establish the convergence to the limit for w i = i −α , for α ∈ [0, 1). 1 Fix arbitrarily a set H ⊆ [h]. For k ∈ {h + 1, h + 2, . . . , t }, let X k be 1 if position k ultimately copies from some position in H , and 0 if position k ultimately copies from some position in [h] \ H . Later we will choose H to be the set of positions in the history that contain a given token.
The idea behind the analysis is to bound the variance of the sum of X i 's. This is not immediate since the X i 's are not pairwise independent. To handle this, we focus on the correlation between X i and X j and show that the covariance is vanishing. Once we establish this, the variance bound is relatively easy and the convergence result follows by appealing to the Chebyshev's inequality.
First we analyze the covariance of X i and X j .
. 1 We point out that the result in this section does not require the weights to follow a power law -it only requires the vanishing collisions property of the chosen weights. We have proved this property for w i = i −α , α < 1, in the previous section.

Proof. We have that Cov[X
We aim to split the probability space into ξ ⊕ ξ , where ξ = ξ i, j is the event "f (i) = f (j)". We apply the law of total covariance, to get where the first inequality follows from 0 ≤ X i , X j , Pr[ξ ] ≤ 1, while the second inequality follows from Theorem 4.3. It remains to bound the second term.
Observe that, if ξ happens, i.e., if f (i) f (j), then the walks from i and from j will not meet in any position larger than h. Therefore, Now, under the conditioning f (i) = i ′ and f (j) = j ′ (with i ′ j ′ ), we first claim that X i and X j are independent. Indeed, under that conditioning, X i is 1 if and only if c(f (i)) ∈ H , and X j is 1 if and only if c(f (j)) ∈ H . Since f (i) f (j), the random position that f (i) copies from (i.e., c(f (i))) is independent of the random position that f (j) copies from (i.e., c(f (j))); this shows X i and X j are independent. It follows that Thus, be the number of positions in the range h + 1, . . . , i that ultimately copy from some position in H . We now bound the variance of this random variable using the bound on the covariance that we just established.
Proof. Let P = {h + 1, h + 2, . . . , i}. By linearity of expectation, With a bound on the variance, we apply Chebyshev's inequality to get the convergence result. Theorem 4.6. It holds that For a large enough h, we can then apply the union bound on each (of the constantly many) tokens, to get the following. Corollary 4.7. Suppose that there are T tokens (with T = O(1)). Condition on the sequence of the history to be some σ ∈ [T ] h , and let Z i = (Z i (1), Z i (2), . . . , Z i (T )) be the vector containing in its tth position the fraction of occurrences of token t at time i > h. Then, That is, after having run the process for h steps, the vector of occurrences at any later step will be concentrated around its expectation.

Simulations
We run the process up until a position h, making up the history. Then, keeping the resulting history fixed, we repeatedly, and independently, run the process from that history up until time 10h, keeping track of the final fraction of occurrences of a given token. In Figure 2, we plot the empirical distribution of the fraction of occurrences of that token, for h = 100, 1000. While the expectation is random (it strongly depends on what happens in the h steps of history), once we condition on the first h steps, the final distribution is more concentrated as h becomes larger.  Moreover, in Figure 3, we plot the empirical probabilities of reaching a specific position in a history of length h = 100, for various α's, and from various starting points.

SINGLE WINNER: α > 2
In this section we show that if the weights w i follow a power law with exponent greater than 2, then with probability 1 the process will converge to a "single-winner" limit.
Proof. As in [1], we study the probability that, starting from a given position i + 1, all positions copy from some position greater than or equal i. If this happens, then all the positions greater than i will end up copying from position i, and therefore all positions greater than or equal i will end up containing the same token.
We use the same approach in [1], but generalized to w i = i −α , as follows: (i) Let the process go on for some number of steps i.
(ii) Fix j ≥ 1. Then, the probability that the position i + j copies from some position in {i, i + 1, . . . , i + j − 1} is at least where ζ (·) is the Riemann zeta function.
(iii) Therefore, the probability that for all j ≥ 1, position i + j copies from some position in {i, . . . , i + j − 1} is at least where d can be chosen arbitrarily. In fact one can show that, for each α > 2, it is possible to choose d so that the probability q j is at least a constant c(α) > 0.
In other words, with constant probability (bounded away from 0 and 1), all positions greater than i will end up with the same token that is in position i. (i) Let j = 1 (ii) While true • Flip an independent coin with head probability at least p j • If it is heads, let j = j + 1; otherwise, break.
Observe that once a phase begins, it has constant probability of never ending. Moreover, there is a simple coupling from this process to the original one: we try to begin a streak at i when a phase begins; if the jth coin is heads, then position i + j copies from some position in {i, i + 1, . . . , i + j − 1}. Therefore, if all coins in a phase are heads (i.e., with constant probability), our process will have converged. If not, our process might have converged on the token at i or not. In any case, we run another phase on i ′ , where the process will converge on the token on i ′ with at least constant probability. Since each phase converges with constant probability, our process will finally converge to a single-winning token with probability 1. □

FUTURE DIRECTIONS
There are a number of open questions about our model as stated: (1) Can our results be extended for α ∈ [1, 2]?
(2) Does Z * always exist for any vector of weights ì w? Additionally, there are a set of models with more complex dynamics that show some connections to our simpler model: • Modern sequence models based on attention [2] incorporate more features of the input, and more interactions among tokens of the history; the model we study represents a very special case of an attention-weighted ML sequence model. • There are also models that directly capture the copying of tokens from the input to the output, such as Copynet [7] and Neural Turing Machines [6].
These more complex models differ from ours in multiple respects, raising a number of questions: (1) Can our results cover settings in which attention weights are only indirectly coupled to final probabilities of tokens? Such models may be fundamentally different, as a token may support the appearance of a different token. (2) Can our results extend to introduce weights that are dependent on item embeddings? (3) Can our results cover softmax normalization, rather than the normalization we use? It is easy to see that the results would be different, even for the classical Pólya case (i.e., uniform weights). With softmax and uniform weights, there seems to be a single winner with non-zero probability, which is in contrast with the classical case. (4) Can our results extend to multiple attention heads [13]? (5) Can our results extend to learned attention weights that are dependent on additional elements such as the context and the features of a particular attention position?