Pseudolikelihood Estimation for Social Networks

Interest in log-linear modeling for social-network data has grown steadily since Holland and Leinhardt (1981) proposed their p, model. That model was designed for a single binary relationship (directed graph) representing interactions between individuals. It assumed that interactions between pairs of individuals are mutuajly independent. Subsequent work has extended the model in various ways, including block-modeling and the case of dependence between pairs of individuals. In empirical work it would often be desirable to fit a wide variety of these models, as the differences in predictions or goodness of fit are likely to provide insights into the data. This has not been common practice, however, because estimation for some of the models has been difficult, and the maximum likelihood schemes developed for others involve different computer programs not always available in standard packages. The focus of this article is on a general estimation technique that maximizes the pseudolikelihood, the product of the probabilities of the binary variables, with each probability conditional on the rest of the data. The method is shown to be equivalent to a weighted least squares procedure and thus can be carried out with standard computer packages. In cases where true maximum likelihood estimation is available for comparison the two methods seem to work about equally well. The pseudolikelihood estimation is used in an example where the fits of a large number of different models are compared. Some of these models, such as various Markov block models, have not previously been proposed. In this example (as in others considered) it appears that the p,-type models are overparameterized, and that much more parsimonious models give tolerable fits.


INTRODUCTION
During the last few years the modeling of social-network data has become increasingly popular in statistics. In its simplest form a social network is a square array, or graph, G of binary random variables yij, with the event yij = 1 denoting an arc, or tie, between individuals i and j. In different contexts this might indicate that i knows j, i does business with j, i reports to j, and so on. Numerous probability models for such data have been developed in sociology and other disciplines [see Strauss and Freeman (1989) for a review]. Holland and Leinhardt (1981) were the first to develop a. log-linear model for network data. Their proposed p, model for the graph distribution of G assumes independence of the dyads (yij, yji) but includes parameters for density and reciprocity of arcs and for the individuals' tendencies to emit and attract arcs. Since then numerous generalizations of the model have been developed. These include the case of polytomous data (e.g., Wasserman and Iacobucci 1986), the treatment of multiple relationships (Fienberg, Meyer, andWasserman 1981, 1985;, stochastic block models (Holland, Laskey, and Leinhardt 1983;Wang and Wong 1987;Wasserman and Anderson 1987), and Markov models admitting dyad dependence (Frank and Strauss 1986). The article by Wasserman and Iacobucci contains a useful summary.
Parameter estimation for the models assuming dyad independence can be performed with maximum likelihood. As noted by Holland and Leinhardt (1981), however, the relaxation of that assumption makes estimation difficult, and this has inhibited the development of dyad-dependent models. The problem arises because of an intractable normalizing function in the likelihood, which generally makes maximum likelihood estimation impossible. In this article we propose an estimation method that we call maximum * David Strauss is Professor, Department of Statistics, University of California, Riverside, CA 92521. Michael Ikeda is Mathematical Statistician, Statistical Research Division, U.S. Bureau of the Census, Washington, DC 20233. The authors thank Barry Arnold for some helpful discussions, and two referees and an associate editor for detailed comments that have substantially improved the article. pseudolikelihood estimation. The pseudolikelihood function is simply the product of the probabilities of the yij, with each probability conditional on the rest of the data. The method avoids the technical difficulty inherent in the maximum likelihood approach and can be performed with standard statistical packages.
The number of log-linear models available has become large enough that a classification scheme is useful. Here, we consider a classification based on three factors: (a) whether the model assumes dyad independence; (b) whether it is suitable for symmetric arrays (undirected graphs), with yij = yj,, or for asymmetric arrays (digraphs); (c) whether it is a block model, with parameters corresponding to an a priori grouping of individuals. This leads to the eightfold classification shown in Table 1. Models for most of the categories are already familiar, and we introduce some new models for the other cases. All of the models in Table 1can conveniently be fitted with the pseudolikelihood method.
In most of this article we restrict attention to models for a single relationship, represented by a binary array {yij). In Section 2 we begin by specifying the models in terms of the conditional probabilities of the yij, given the rest of the data. This conditional form proves convenient for our classification of the models; it is also the natural way to express the models when pseudolikelihood estimation is used. In Section 3 we develop the method and show its relationship to both a logistic regression procedure and maximum likelihood. It turns out that the first two are equivalent, and they are equivalent to maximum likelihood in the symmetric dyad-independence case. In Section 4 we analyze some well-known data of Sampson (1968); by fitting a variety of models. In data analysis it is often advisable to consider a wide range of possible models, and our aim here is to point out and illustrate several issues that arise in practice.
Strauss a n d Ikeda: Pseudolikelihood Estimation for Social Networks

2.
MODELS Let G be a realization of a g x g random binary array (graph); that is, G = {ylj:i # j; 1 5 i, j 5 g). The dyad Dij is the ordered pair (yi,, y,,). We write G; for the realization G with yij set to 1, G; for the realization with yil set to 0, and Cij (complement) for a specification of {y,: (r, s) # (i, j)). When no confusion arises the subscripts on C will be suppressed. We are concerned with log-linear models of the form where 8 is a vector of parameters and x(G) a corresponding vector of graph statistics. For a symmetric graph the components of x might be the number of lines 2 y,,, the number of triads C C I: yijyjkyk,, and so on. The normalizing function Z(8) is the sum of exp{Qtx(G)) over all 2g(g-l) possible graphs. For graphs with g I6, say, it is feasible to compute Z explicitly, but for large graphs in the dyad-dependent case the Z function is intractable. As a result, maximum likelihood estimation for the dyaddependent models is not available. From (2.1), however, it follows that Pr(y,, = 1 1 C) = Pr(G-)/{Pr(G+) + Pr(G-)), a form that does not involve Z. With the notation logit(t) = log{t/(lt)) we have, more compactly, where Ax,, = x(G+) -x(G-) is the vector of changes in x(G) when yil changes from 1 to 0. We refer to a valid specification (2.2) as a logit model.
The conditional probabilities (2.2) are not necessarily compatible in the sense of being consistent with some joint probability distribution Pr(G). Arnold and Press (1989) gave sufficient conditions for compatibility. In the present context, however, one would normally know Pr(G) in advance or be able to deduce it from inspection of (2.2). It is worth noting that the logit models are identifiable in the sense that the distribution Pr(G) corresponding to (2.2) is unique. This too follows from a result of Arnold and Press (1989). Their method of proof, in our context, is to construct a Markov chain of graphs G, with the conditional probabilities from (2.2) as transition probabilities. The state space is the set of all possible graphs, and the transitions are the possible changes of a single Y,~, either to y,, or 1 -y,,. The pairs ij are taken cyclically In some fixed order. The chain is evidently aperiodic and irreducible, and thus has a long-run distribution that must be the unique distribution consistent with the logit model.
One might ask why the models are expressed in terms of conditional probabilities of the variables y,, rather than the dyads D,, since the latter have been the traditional modeling unit. The reason is that if we define a pseudolikelihood as the product of the conditional likelihoods of dyads rather than yij's, the maximization would no longer be equivalent to a regression procedure that can be performed with standard computing packages. Much of the advantage of the pseudolikelihood approach would thus be lost. The choice between the two conditional specifications is in any event only a matter of notation: As we have seen, there is a one-to-one correspondence between graph models (2.1) and compatible logit models of form (2.2), and a similar result holds for compatible specifications of dyad-conditional probabilities. Thus each logit model is a dyad model, and vice versa.
We now consider the model classification of Table 1and express various cases in logit form. Holland and Leinhardt (1981) defined their p, model by Here rn = 4 2,2, yi1y,,, the number of mutual arcs, y + + is the total number of arcs, and so on. The parameters y , p, {a,), and {p,) relate to overall mutuality, density, expansiveness, and attractiveness, respectively. (Holland and Leinhardt used the symbols p and 8 in place of our y and p.) Side conditions 2 ai = C = 0 ensure that the model To obtain a symmetric form, appropriate for an undirected graph, note that in this case m = y ++ and yi+ = y+,. This suggests that in (2.3) y + p be replaced by a new parameter p, and a, + pibe replaced by ai.The resulting logit model is A side condition for the a, is required for identifiability; the natural choice is C a, = 0. Wasserman and Galaskiewicz (1984), Wang and Wong (1987), and others proposed a directed-graph model that retains the dyad independence but introduces parameters 206 Journal of the American Statistical Association, March 1990 associated with a block structure. In logit form, the model may be written as = P + VYjl + ffi + bj + 2 2 Aklbij.kl, (2.6) k I where the Akl are the block parameters and bij,,, = 1 if (i, j) belongs to block k, 1 = 0 otherwise. (2.7) A block-diagonal form of (2.6) arises when Akl = 0 for k # I. Wang and Wong considered a single block-parameter model, where Akl = 1 if k = 1 and 0 otherwise. A version of (2.6) for undirected graphs is easily constructed as before. It is We now turn to models that do not assume dyad independence. The general class is too complicated to be useful for data analysis, and we restrict ourselves to Markov models (Frank and Strauss 1986). For such models Pr(yjj = 1 I C) depends solely on those y, such that at least one of r and s is equal to i or j. One of the simplest, called the (p, a , z) model by Frank and Strauss, is the undirected-graph model Here R is the number of arcs (cases where y,, = I), S is the number of two-stars (i.e., distinct subscripts i, j, k such that yjjy, = I), and T is the number of triads (i, j, k) such that yijyjkyki = 1. AS before, p is a parameter related to overall density; a and z correspond to clustering and to transitivity of arcs. The logit form is logit Pr( y,, = 1 I C) = p + GAS + TAT, (2.10) where AS is the change in the number of two-stars when yij changes from 1 to 0, for example. This form conveniently displays how the odds of an arc between i and j depend on the neighbors.
Frank and Strauss considered directed Markov graphs as well. The range of possible models then becomes very broad; for example, two-stars may be "outstars" of form {ij, ik) (both arcs originating from the same vertex i), "instars" of form {ji, ki), or of mixed form {ij, jk). The number of each type could be taken as a graph statistic, each with its own parametet. Similarly, there are several triad-count statistics (Frank and Strauss 1986;Holland and Leinhardt 1976) that may be included. In practice one would usually require a model with rather few parameters. We do not attempt to list the possible cases, but instead give some examples in Section 4. Of course, it is possible to create a model combining the Markov property with the expansiveness and attractiveness parameters of the p1model. One simple version that encompasses both (2.4) and (2.10) is where we might take S to be the number of two-stars of form {ij, ik) and T to be the number of {ij, jk, ki) triads, for example. This may be called a ( p , , a , t ) model. A natural question to consider in practice is whether (2.11) gives a substantially better fit than (2.4) or (2.10).
Finally, we propose a class of Markov block models. These may be defined in a way that parallels the block models (2.6). For example, consider a symmetric array {yi,) partitioned into blocks BkI. One simple Markov blockdiagonal model is where the number of two-stars within block k, and Tck)= YhiYijYhj, (2.14) h,i,j the number of transitive triads within block k. Of course, one could make the parameters a k (or the parameters zk) equal, include parameters for off-diagonal blocks, and include pl-type parameters ai, p,, and so on. Equation (2.12) specifies one example of a Markov block model for undirected graphs. It is straightforward to define corresponding Markov block models for directed graphs as well. We see some of these models in the example of Section 4. The logit form of (2.12) is where AS$;) is the increase in the number of two-stars within block k resulting from a change of y,, from 0 to 1, for example. The increase will be 0 unless i and j are both in block k. In that case AS! ) is just (yih + yjh), with the sum over h in block k. For some parameter values the Markov models without blocks are unsuitable for large data sets because of the possibility of degeneracy (Strauss 1986). To see this, consider for simplicity the (p, a , z) model (2.9) with a = 0. It can be shown that if z is positive, then as the number of vertices g tends to infinity the probability that an arbitrarily large proportion of lines are present tends to 1. Loosely speaking, a sufficiently large graph is sure to be almost complete, whatever the value of p, if z is positive.
An analogous situation obtains for the case a > 0, r = 0. This means that for large graphs the model is unlikely to generate data sets displaying more than one cluster of arcs. Nevertheless, if the data display clustering (or cliquing) and if it seems legitimate to take the corresponding block structure as given, then the Markov block models may be a plausible choice. As we shall see, they are no harder to Strauss and Ikeda: Pseudolikelihood Estimation for Social Netwc fit than the dyad-independence block models, and which are preferable for a given data set is largely an empirical question.

ESTIMATION
We have noted that for dyad-dependence models the awkward normalizing constant Z(8) in the likelihood (2.1) generally makes maximum likelihood estimation impossible. The dyad-dependence models are in many respects analogous to interactive models on the rectangular lattice, such as the Ising model; there too an intractable normalizing constant rules out the possibility of maximum likelihood estimation (Strauss 1986). Our proposed estimation method is related to procedures suggested by Besag (1974Besag ( , 1975 for the lattice case. We define the pseudolikelihood function to be and a maximum pseudolikelihood estimator (MPE) to be a value of 8 that maximizes (3.1). Since no conditional probability in (3.1) involves Z(8), PL(B) should be much easier to maximize than the true likelihood (2.1). The MPE generally differs from the maximum likelihood estimator (MLE) except when the conditional probabilities in (3.1) are independent of C . This occurs in the symmetric block model (2.8) and its special case, the symmetric pl mode1 (2.5).
We now show that maximization of (3.1) can be performed by a logistic regression. More formally, we have the following theorem.
Theorem. For a given logit model of form (2.2), maximization of (3.1) is equivalent to a maximum likelihood fit of logistic regression to the mode1 (2.2) for independent observations y,,. It can be implemented as an iteratively reweighted Gauss-Newton least squares procedure.
Proof. It is convenient to replace the indexes (i, j) with a single index r. Let Pr denote the conditional probability of y,, given Cr [as defined by (2.2)], and set Qr = 1 -Pr. Denote the components of 8 by Ok (k = 1, 2, . . .). This can be written as where wr = l/(PrQr), provided that the w, are treated as constants in the differentiation. But this is iteratively reweighted least squares, with the weights wr recomputed at each step from the current value of 8.
This result is (almost) a special case of a more general result on maximum likelihood for the generalized linear model (e.g., see McCullagh and Nelder 1983, sec. 2.5). It shows that maximization of (3.1) can be performed by a standard logistic-regression computer routine, even though the variables y,, in (2.2) are not conditionally independent. We have used the version in the BMDP package throughout our work. To implement it one constructs a set of g(g -1) observations, each consisting of a binary "dependent" variable y,, and a vector of "independent" variables Axij. The latter are given by x(G+)x(G-), as in (2.1), and are easily constructed from the data matrix G. We note that logistic regression had previously been suggested as an ad hoc procedure by Frank and Strauss (1986) in connection with the (p, 0,z) model (2.10).
Maximum likelihood estimation is feasible for all of the dyad-independent models discussed here; the methods of Darroch andRatcliff (1972) andFeinberg et al. (1985), among others, have been used to implement it. Nevertheless, if one wishes to fit both dyad-dependent and dyadindependent models to the same data set, it seems desirable to do so with a single computer package (as can be done with the MPE). In addition, much of the supplementary information automatically supplied by the program will be useful, even though the true model is not a standard logistic regression. For example, the number of yij correctly classified by the regression function with various cut points will usually be of interest. The information on which explanatory variables in the logistic regression contribute most of the predictions of the yij is sometimes useful as well. One caution should be noted, however: The quoted standard errors of the estimated parameters do not apply, because the g(g -1) observations in the regression are certainly not independent.
Since it is difficult to compare the MLE and MPE analytically, we offer some experimental comparisons. For Markovian models MLE is only really feasible when there is a single parameter, in which case a graphical method is available (Strauss 1986). Using a Metropolis-type simulation method given in that article, we generated realizations of the (p, a , z) model with p = z = 0 and various values of a . Results are shown in Table 2. For each value of a there were five replicates, with the mean and root mean squared error computed for the MPE and the MLE. The two methods appear to give estimators that are about equally good. A similar conclusion applies to results in Table 3, which corresponds to p = a = 0 and various values of z. In addition, we performed some experiments with smaller graphs, of size g = 15 and 20. As expected, both methods gave more variable estimates than those for the larger graphs, but the root mean squared errors for the two methods were again quite close in all cases.
For the multiparameter Markov models it becomes difficult to perform systematic comparisons because the MLE is not available. In the case of the dyad-independence models, however, the MPE and MLE may be readily compared, and they have consistently given similar results. A typical case is the Sampson data of Section 4. We estimated the parameters of the p, model with both methods and derived the two sets of unconditional fitted probabilities. For this 18 x 18 data set there were 18 x 17 = 306 pairs. In 254 cases the absolute difference of the fitted probabilities was less than .05, and in 299 cases less than .lo. A similar procedure with the p, block-diagonal model (2.6) gave even closer agreement, with corresponding counts of 292 and 304. The correlation between the two sets of predictors was .935 in the first case and .997 in the second. We found similar agreement between the two sets of parameter estimates (see Sec. 4). The consistency of the MPE and MLE deserves some comment. Consistency is not a meaningful term unless the parameter set remains fixed as the number of vertices g tends to infinity. This condition is not satisfied by the general p, model, for example. The (p, o, z) model (2.9) does satisfy it, but as we have seen, that model is sometimes degenerate. In certain simple cases matters can be resolved satisfactorily. For example, consider, the pl model where all the parameters cui are constrained to be equal and the parameters / I , are constrained to be equal. Then, the common values may both be taken to be 0, and the counts of mutual, asymmetric and null dyads follow a trinomial distribution. It is easy to show that the maximum likelihood estimates of V/ and p in (2.3) are then almost surely consistent. It can be verified directly that the MPE estimates are identical to the MLE's for this case, even though the likelihood and pseudolikelihood functions are not equal. Similar conclusions hold for the block-model extension (2.6) of this simple case.
If we have a random sample of n realizations from a g x g graph model, we can compute an MPE by maximiz- ing the product of the n pseudolikelihoods. The behavior of this estimator as n tends to infinity follows from general results of Arnold and Strauss (1988): The estimator is consistent and asymptotically normal, with asymptotic efficiency given by certain information quantities. Although these results are encouraging for the use of the MPE, note that the case of one realization from a "large" graph is more common than a large random sample of n graphs of a fixed order g. We noted earlier that interactive lattice models share many features of the dyad-dependence models. Thus it is worth remarking that Geman and Graffigne (1987) proved the consistency of the MPE as the lattice becomes large. Parameter identifiability can be an issue in block modeling. For example, suppose that we have a partition of integers I , . . . , g into disjoint subsets, and let B be the set of pairs (ij) such that i and j belong to the same subset. Given a symmetric graph G, let be the number of transitive triads, and let TBbe the number of such triads with ij, jk, and ik belonging to B. Consider the Markov block model which is equivalent to The parameters in (3.5) are certainly identifiable. Nevertheless, suppose that we observe a graph with y,, = 0 whenever (ij) does not belong to B. Then, the columns ATB and AT in (3.6) are identical, and the logistic regression is ill-conditioned. A similar issue would arise in maximum likelihood estimation of (3.5). In applications there should be no problem, because one would not contemplate a model such as (3.5) unless the block structure were already known, and in that case the redundancy of the term zT should be apparent by inspection of the data.

AN EXAMPLE
To illustrate the methods and some special points that arise in applications, we consider the well-known monastery data of Sampson (1968). In Table 4 the 1s indicate some degree of liking between two monks. The partition lines correspond to a by-now traditional block structure for the 18 monks; for example, see Wasserman and Anderson (1987). Table 5 summarizes the fitting of the directed-graph models of Section 2; some explanation of the symbols for the models is given in the footnote. The MPE has been used in all cases. The table shows various summary statistics associated with the models. The maximized pseudolikelihood is one criterion for model comparisons. Even for the Markovian models this can be interpreted as a true likelihood for the logistic regression model (3.2), so [following Holland and Leinhardt (1981) and Wang and Wong (1987)l we might refer twice the difference of the logpseudolikelihood to the x2distribution as an informal test 209 18 Strauss a n d Ikeda: Pseudolikelihood Estimation for Social Networks Table 4. Sampson's (1968) statistic. The sum of absolute residuals 2 ly,, -pi,/is a useful measure of fit as well. The weighted sum of squared residuals, which is the quantity minimized by MPE, is not shown because it seems to provide little additional information. The final column shows the maximum number of correct predictions of the value of y,,, based on the logistic regression with known C,' s and an optimal cut point. This is a standard feature in discriminant analysis, and it is an option with logistic regression in most computer packages. The total number of predictions is 18 x 17 = 306. Predicted probabilities derived from MPE are generally conditional on the C,'s and thus may not agree closely with the unconditional predictions from MLE (except, of course, when the methods are equivalent). Note that the comparisons discussed in the following are based on conditional prediction. Model 1, the one-parameter Bernoulli model, is included to provide a baseline for comparisons; its fit is poor.
Model 2 is a p , model with all a, and j3,set to 0. This leaves just the mutuality and density parameters I, U and p in (2.3). Model 3 is the p,, and according to the informal pseudolikelihood ratio test its improvement in fit over Model 2 is insufficient to justify the additional 34 parameters. (The same conclusion could be reached by comparison of the true log-likelihoods, -133.9 and -118.5.) The inclusion of an additional parameter zc, corresponding to a count of cyclic triads with y,,yjkyki = 1, gives Model 4, which fits only slightly better.
We now turn to the block models, taking the block structure of Table 4 as given a priori. Model 5 (denoted by p l , A) is a one-parameter extension to pl [see (2.6)]. It was previously fitted to this data by Wang and Wong (1987). The relatively high log-pseudolikelihood of -79.9 is close to the true log-likelihood of -82.1. For comparison, the values of {a;) obtained by MLE and MPE are plotted in Figure l a ; the sets of estimators are in good agreement.  The same is true for the estimates of {/I,),shown in Figure yi,yjk.]The ( y , p, A) model, 7, has 34 fewer parameters lb. than Model 5, yet its fit is not much worse. As before, the Model 5 fits the data substantially better than the others informal ,y2 test indicates that inclusion of the {ai) and {/I,) considered so far. An almost identical fit is provided by is unnecessary. the Markov block model, Model 6. [See (2.12); the prime The remaining cases in the table are Markov block modon the clustering parameter a indicates a count within els. The number of potential models is very large, and we blocks only, and the suffix M denotes mixed pairs of form have reported only a few of them. Model 8 is the best-