Functional Data Analysis

With the advance of modern technology, more and more data are being recorded continuously during a time interval or intermittently at several discrete time points. They are both examples of “functional data”, which have become a commonly encountered type of data. Functional Data Analysis (FDA) encompasses the statistical methodology for such data. Broadly interpreted, FDA deals with the analysis and theory of data that are in the form of functions. This paper provides an overview of FDA, starting with simple statistical notions such as mean and covariance functions, then covering some core techniques, the most popular of which is Functional Principal Component Analysis (FPCA). FPCA is an important dimension reduction tool and in sparse data situations can be used to impute functional data that are sparsely observed. Other dimension reduction approaches are also discussed. In addition, we review another core technique, functional linear regression, as well as clustering and classiﬁcation of functional data. Beyond linear and single or multiple index methods we touch upon a few nonlinear approaches that are promising for certain applications. They include additive and other nonlinear functional regression models, and models that feature time warping, manifold learning, and empirical diﬀeren-tial equations. The paper concludes with a brief discussion of future directions.


Introduction
Functional data analysis (FDA) deals with the analysis and theory of data that are in the form of functions, images and shapes, or more general objects.The atom of functional data is a function, where for each subject in a random sample one or several functions are recorded.While the term "functional data analysis" was coined by Ramsay (1982) and Ramsay & Dalzell (1991), the history of this area is much older and dates back to Grenander (1950) and Rao (1958).Functional data are intrinsically infinite dimensional.The high intrinsic dimensionality of these data poses challenges both for theory and computation, where these challenges vary with how the functional data were sampled.On the other hand, the high or infinite dimensional structure of the data is a rich source of information, which brings many opportunities for research and data analysis.
First generation functional data typically consist of a random sample of independent real-valued functions, X1(t), . . ., Xn(t), on a compact interval I = [0, T ] on the real line.Such data have also been termed curve data (Gasser et al. 1984;Rice & Silverman 1991;Gasser & Kneip 1995).These real-valued functions can be viewed as the realizations of a one-dimensional stochastic process, often assumed to be in a Hilbert space, such as L 2 (I).Here a stochastic process X(t) is said to be an L 2 process if and only if it satisfies E( I X 2 (t)dt) < ∞.While it is possible to model functional data with parametric approaches, usually mixed effects nonlinear models, the massive information contained in the infinite dimensional data and the need for a large degree of flexibility, combined with a natural ordering (in time) within a curve datum facilitate non-and semi-parametric approaches, which are the prevailing methods in the literature as well as the focus of this paper.Smoothness of individual random functions (realizations of a stochastic process), such as existence of continuous second derivatives, is often imposed for regularization, and is especially useful if nonparametric smoothing techniques are employed, as is prevalent in functional data analysis In this paper, we focus on first generation functional data with brief a discussion of next generation functional data in Section 6.Here next generation functional data refers to functional data that are part of complex data objects, and possibly are multivariate, correlated, or involve images or shapes.Examples of next generation func-tional data include brain and neuroimaging data.A separate entry on functional data approaches for neuroimaging data is available in this issue of the Annual Reviews (Link to John Aston's contribution).For a brief discussion of next generation functional data, see page 23 of a report (http://www.worldofstatistics.org/wos/pdfs/Statistics&Science-TheLondonWorkshopReport.pdf) of the London workshop on the Future of Statistical Sciences held in November 2013.
Although scientific interest is in the underlying stochastic process and its properties, in reality this process is often latent and cannot be observed directly, as data can only be collected discretely over time, either on a fixed or random time grid.The time grid can be dense, sparse, or neither; and may vary from subject to subject.Originally, functional data were regarded as samples of fully observed trajectories.A slightly more general assumption is that functional data are recorded on the same dense time grid of ordered times t1, . . ., tp for all n subjects.If the recording is done by an instrument, such as an EEG or fMRI recording device, the time grid is usually equally spaced, that is tj − tj−1 = tj+1 − tj for all j.In asymptotic analysis, the spacing tj+1 − tj is assumed to approach zero as n tends to infinity, hence p = pn is a sequence that tends to infinity.While large p leads to a high-dimensional problem, it also means more data are available, so here this is a blessing rather than a curse.This blessing is realized by imposing a smoothness assumption on the L 2 processes, so that information from measurements at neighboring time points can be pooled to overcome the curse of dimensionality.Thus, smoothing serves as a tool for regularization.
While there is no formal definition of "dense" functional data, the convention has been to declare functional data as densely (as opposed to sparsely) sampled when pn converges to infinity fast enough to allow the corresponding estimate for the mean function µ(t) = EX(t), where X is the underlying process, to attain the parametric √ n convergence rate for standard metrics, such as the L 2 norm.Sparse functional data arise in longitudinal studies where subjects are measured at different time points and the number of measurements ni for subject i may be bounded away from infinity, i.e., sup 1≤i≤n ni < C < ∞ for some constant C. A rigorous definition of the types of functional data based on their sampling plans is still lacking.See Zhang & Wang (2014) for a possible approach, with further details in Section 2 below.
In reality, many observed data are contaminated by random noise, referred to as measurement errors, which are often assumed to be independent across and within subjects.Measurement errors can be viewed as random fluctuations around a smooth trajectory, or as actual errors in the measurements.A strength of FDA is that it accommodates measurement errors easily, as for each subject one observes repeated measurements.An interesting, but not surprising, phenomenon in FDA is that the methodology and theory, such as convergence rates, varies with the measurement schedule (sampling plan).
Intriguingly, sparse and irregularly sampled functional data that we synonymously refer to as longitudinal data, such as the CD4 count data for which a Spaghetti plot is shown in Figure 1 for 25 subjects, typically require more effort in theory and methodology as compared to densely sampled functional data, such as the traffic data in Figure 5, which are recorded continuously.For the CD4 count data, a total of 369 subjects were included with the number of measurements ranging from 1 to 12, with median (mean) number of measurements 6 (6.44).This is an example of sparse functional data measured at an irregular and different time schedule for each individual.Functional data that are observed continuously without errors are the easiest type to handle as theory for stochastic processes, Spaghetti plot for sparsely recorded CD4 count data for 25 subjects such as functional laws of large numbers and functional central limit theorems, are readily applicable.A comparison of the various approaches will be presented in Section 2. One of the challenges in functional data analysis is the inverse nature of functional regression and most functional correlation measures.This is triggered by the compactness of the covariance operator, which leads to unbounded inverse operators.This challenge will be discussed further in Section 3, where extensions of classical linear and generalized linear models to functional linear and generalized functional linear models will be reviewed.Since functional data are intrinsically infinite dimensional, dimension reduction is key for data modeling and analysis.The principal component approach will be explored in Section 2 while several approaches for dimension reduction in functional regression will be discussed in Section 3.
Clustering and classification of functional data are useful and important tools in FDA with wide ranging applications.Methods include extensions of classical k-means and hierarchical clustering, Bayesian and model-based approaches to clustering, as well as classification via functional regression based and functional discriminant analysis.These topics will be explored in Section 4. The classical methods for functional data analysis have been predominantly linear, such as functional principal components or the functional linear model.As more and more functional data are being generated, it has emerged that many such data have inherent nonlinear features that make linear methods less effective.Sections 5 reviews some nonlinear approaches to FDA, including time warping, non-linear manifold modeling, and nonlinear differential equations to model the underlying empirical dynamics.
A well-known and well-studied nonlinear effect is time warping, where in addition to the common amplitude variation one also considers time variation.This creates a basic nonidentifiability problem.Section 5.1 will provide a discussion of these foundational issues.A more general approach to model nonlinearity in functional data that extends beyond time warping and includes many other nonlinear features that may be present in longitudinal data is to assume that the functional data lie on a nonlinear (Hilbert) manifold.The starting point for such models is the choice of a suitable distance and ISOMAP (Tenenbaum, De Silva & Langford 2000) or related methods can then be employed to uncover the manifold structure and define functional manifold means and components.These approaches will be described in Section 5.2.Modeling of time-dynamic systems with differential equations that are learned from many realizations of the trajectories of the underlying stochastic process and the learning of nonlinear empirical dynamics such as dynamic regression to the mean or dynamic explosivity is briefly reviewed in Section 5.3.Section 6 concludes this review with a brief outlook on the future of functional data analysis.
Research tools that are useful for handing functional data include various smoothing methods, notably kernel, local least squares and spline smoothing for which various excellent reference books exist (Wand & Jones 1995;Fan & Gijbels 1996;Eubank 1999;de Boor 2001), functional analysis (Conway 1994;Hsing & Eubank 2015) and stochastic processes (Ash & Gardner 1975).Several software packages are publicly available to analyze functional data, including software at the Functional Data Analysis website of James Ramsay (http://www.psych.mcgill.ca/misc/fda/), the fda package on the CRAN project of R (R Core Team 2013) (http://cran.r-project.org/web/packages/fda/fda.pdf), the Matlab package PACE on the website of the Statistics Department of the University of California, Davis (http://www.stat.ucdavis.edu/PACE/),and the R package refund on functional regression (http://cran.r-project.org/web/packages/refund/index.html).
This review is based on a subjective selection of topics in FDA that the authors have worked on or find of particular interest.We do not attempt to provide an objective or comprehensive review of this fast moving field and apologize in advance for any omissions of relevant work.Interested readers can explore the various aspects of this field through several monographs (Bosq 2000;Ramsay & Silverman 2005;Ferraty & Vieu 2006;Wu & Zhang 2006;Ramsay, Hooker & Graves 2009;Horvath & Kokoszka 2012;Hsing & Eubank 2015) and review articles (Rice 2004;Zhao, Marron & Wells 2004;Müller 2005Müller , 2008;;Ferraty & Vieu 2006).Several special journal issues were devoted to FDA including a 2004 issue of Statistica Sinica (issue 3), a 2007 issue in Computational Statistics and Data Analysis (issue 3), and a 2010 issue in Journal of Multivariate analysis (issue 2).

Mean and Covariance Function, and Functional Principal Component Analysis
In this section, we focus on first generation functional data that are realizations of a stochastic process X(•) that is in L 2 and defined on the interval I with mean function µ(t) = E(X(t)) and covariance function Σ(s, t) = cov(X(s), X(t)).The functional framework can also be extended to L 2 processes with multivariate arguments.The realization of the process for the ith subject is Xi = Xi(•), and the sample consists of n independent subjects.For generality, we allow the sampling schedules to vary across subjects and denote the sampling schedule for subject i as ti1, . . ., tin i and the corresponding observations as Xi = (Xi1, . . ., Xin i ), where Xij = Xi(tij).In addition, we allow the measurement of Xij to be contaminated by random noise eij with E(eij) = 0 and var(eij) = σ 2 ij , so the actual observed value is Yij = Xij + eij, where eij are independent across i and j and often termed "measurement errors".
It is often assumed that the errors are homoscedastic with σ 2 ij = σ 2 , but this is is not strictly necessary, as long as σ 2 ij = var(e(tij)) can be regarded as the discretization of a smooth variance function σ 2 (t).We observe that measurement errors are realized only at those time points tij where measurements are being taken.Hence these errors do not form a stochastic process e(t) but rather should be treated as discretized data eij.However, in order to estimate the variance σ 2 ij of eij it is often convenient to assume that there is a latent smooth function σ(t) such that σij = σ 2 (tij).
Estimation of Mean and Covariance Functions.When subjects are sampled at the same time schedule, i.e., tij = tj and ni = m for all i, the observed data are mdimensional multivariate data, so the mean and covariance can be estimated empirically at the measurement times by the sample mean and sample covariance, μ(tj) = 1 n n i=1 Yij, and , for k = l.Data that are missing completely at random (for further details on missingness see Little & Rubin 2014) can be handled easily by adjusting the available sample size at each time point tj for the mean estimate or by adjusting the sample sizes of available pairs at (t k , t l ) for the covariance estimate.An estimate of the mean and covariance functions on the entire interval I can then be obtained by smooth interpolation of the corresponding sample estimates or by mildly smoothing over the grid points.Once we have a smoothed estimate Σ of the covariance function Σ, the variance of the measurement error at time tj can be estimated as σ2 (tj When the sampling schedule of subjects differs, the above sample estimates cannot be obtained.However, one can borrow information from neighboring data and across all subjects to estimate the mean function, provided the sampling design combining all subjects, i.e. {tij : 1 ≤ i ≤ n, 1 ≤ j ≤ ni}, is a dense subset of the interval I. Then a nonparametric smoother, such as a local polynomial estimate (Fan & Gijbels 1996), can be applied to the scatter plot {(tij, Yij) : i = 1, . . ., n, and j = 1, . . ., ni} to smooth Yij over time; this will yield consistent estimates of µ(t) for all t. Figure 2 shows the scatter plot of the pooled CD4 counts for all 369 AIDS patients, together with the estimated mean function based a local linear smoother with bandwidth 0.3 year.The shape of the mean function reveals that CD4 counts were stable around 1,000 six months before seroconversion (time 0) but decline sharply six months before and after seroconversion, and then stabilize again after one year of seroconversion.
Likewise, the covariance can be estimated on I × I by a two-dimensional scatter plot smoother {(t ik , t il ), u ikl : i = 1, . . ., n; k, l = 1, . . ., ni, k = l} to smooth u ikl against (t ik , t il ), where u ikl = (Y ik − μ(t ik ))(Y il − μ(t il )) are the "raw" covariances.We note that the diagonal raw covariances where k = l are removed from the 2D scatter plot prior to the smoothing step because these include an additional term that is due to the variance of the measurement errors in the observed Yij.Indeed, once an estimate Σ for Σ is obtained, the variance σ 2 (t) of the measurement errors can be obtained by smoothing Yij − μ(tij) 2 − Σ(tij) against tij across time.Figure 3 displays the scatter plot of the raw covariances and the smoothed estimate of the covariance surface Σ(•, •) using a local linear bivariate smoother with bandwidth of 1.4 years together with the smoothed estimate of var(Y (t)).The estimated variance of σ 2 (t) is the distance between the estimated var(Y (t)) and the estimated covariance surface.Another estimate for σ 2 under the homoscedasticity assumption is discussed in Yao, Müller & Wang (2005a).
The above smoothing approach is based on a scatter plot smoother which assigns equal weights to each observation, therefore subjects with a larger number of repeated observations receive more total weight, and hence contribute more toward the estimates of the mean and covariance functions.An alternative approach employed in Li & Hsing (2010) is to assign equal weights to each subject.Both approaches are sensible.A question is Pooled CD4 count data for 369 subjects and estimated mean function which one would be preferred for a particular design and whether there is a unified way to deal with these two methods and their theory.These issues were recently explored in a manuscript (Zhang & Wang 2014), employing a general weight function and providing a comprehensive analysis of the asymptotic properties on a unified platform for three types of asymptotics, L 2 and L ∞ (uniform) convergence as well as asymptotic normality of the general weighted estimates.Functional data sampling designs are further partitioned into three categories, non-dense (designs where one cannot attain the √ n rate), dense (where one can attain the √ n rate but with a non-neglible asymptotic bias), and ultra-dense (where one can attain the √ n rate without asymptotic bias).Sparse sampling scenarios where ni is uniformly bounded by a finite constant are a special case of non-dense data and lead to the slowest convergence rates.These designs are also referred to as longitudinal designs.The differences in the convergence rates also have ramifications for the construction of simultaneous confidence bands.For ultra dense or some dense functional data, the weighing scheme that assigns equal weights to subjects is generally more efficient than the scheme that assigns equal weight per observation, but the opposite holds for many other sampling plans, including sparse functional data.
Hypothesis Testing and Simultaneous Confidence Bands for Mean and Covariance Functions.Hypothesis testing for the comparison of mean functions µ is of obvious interest.Fan & Lin (1998) proposed a two-sample test and ANOVA test for the mean functions, with further work by Cuevas, Febrero & Fraiman (2004) and Zhang (2013).Other two sample tests have been proposed for distributions of functional data (Hall & Van Keilegom 2007) and for covariance functions (Panaretos, Kraus & Maddocks 2010;Boente, Rodriguez & Sued 2011).
Another inference problem that has been explored is the construction of simultaneous confidence bands for dense (Degras 2008(Degras , 2011;;Wang & Yang 2009;Cao, Yang & Todem 2012) and sparse (Ma, Yang & Carroll 2012) functional data.However, the problem has Raw covariances (dots) and fitted smooth covariance surface, obtained by omitting the data on the diagonal, where the diagonal forms a ridge due to the measurement errors in the data.The variance of the measurement error σ 2 (t) at time t is the vertical distance between the top of the diagonal ridge and the smoothed covariance surface at time t.
not been completely resolved for functional data, due to two main obstacles: The infinite dimensionality of the data and the nonparametric nature of the target function.For the mean function µ, an interesting "phase transition" phenomenon emerges: For ultra-dense data the estimated mean process √ n(μ(t)−µ(t)) converges to a mean zero Gaussian process W (t), for t ∈ I, so standard continuous mapping leads to a construction of a simultaneous confidence band based on the distribution of sup t W (t). When the functional data are dense but not ultra dense, the process √ n(μ(t) − µ(t)) can still converge to a Gaussian process W (t) with a proper choice of smoothing parameter but W is no longer centered at zero due to the existence of asymptotic bias as discussed in Section 1.
This resembles the classical situation of estimating a regression function, say m(t), based on independent scalar response data, where there is a trade off between the bias and variance, so that optimally smoothed estimates of the regression function will have an asymptotic bias.The conventional approach to construct a pointwise confidence interval is based on the distribution of rn( m(t) − E( m(t))) , where m(t) is an estimate of m(t) that converges at the optimal rate rn.This means that the asymptotic confidence interval derived from it is targeting E( m(t)) rather than the true target m(t) and therefore is not really viable for inference for m(t).
In summary, the construction of simultaneous confidence bands for functional data requires different methods for ultra-dense, dense, and sparse functional data, where in the latter case one does not have tightness and the rescaling approach of Bickel & Rosenblatt (1973) may be applied.The divide between the various sampling designs is perhaps not unexpected since ultra dense functional data essentially follow the paradigm of parametric inference, where the √ n rate of convergence is attained with no asymptotic bias, while dense functional data attains the parametric rate of √ n convergence albeit with an asymptotic bias, which leads to challenges even in the construction of pointwise confidence intervals.Unless the bias is estimated separately, removed from the limiting distribution, and proper asymptotic theory is established, which usually requires regularity conditions for which the estimators are not efficient, the resulting confidence intervals need to be taken with a grain of salt.This issue is specific to the bias-variance trade off that is inherited from nonparametric smoothing.Sparse functional data follow a very different paradigm as they allow no more than nonparametric convergence rates, which are slower than √ n, and the rates depend on the design of the measurement schedule and properties of mean and covariance function as well as the smoother (Zhang & Wang 2014).The phenomenon of nonparametric versus parametric convergence rates as designs get more regular and denser have been characterized as a "phase transition" (Hall, Müller & Wang 2006;Cai & Yuan 2011).
Functional Principal Component Analysis (FPCA).Principal component analysis (Jolliffe 2002) is a key dimension reduction tool for multivariate data that has been extended to functional data and termed functional principal component analysis (FPCA).Although the basic ideas were conceived in Grenander (1950); Karhunen (1946); Loève (1946) and Rao (1958), a more comprehensive framework for statistical inference for FPCA was first developed in a joint Ph.D. thesis of Dauxois and Pousse (1976) at the University of Toulouse (Dauxois, Pousse & Romain 1982).Since then, this approach has taken off to become the most prevalent tool in FDA.This is partly because FPCA facilitates the conversion of inherently infinite-dimensional functional data to a finite-dimensional vector of random scores.Under mild assumptions, the underlying stochastic process can be expressed as a countable sequence of uncorrelated random variables, the functional principal components (FPCs) or scores, which in many practical applications are truncated to a finite vector.Then the tools of multivariate data analysis can be readily applied to the resulting random vector of scores, thus accomplishing the goal of dimension reduction.
Specifically, dimension reduction is achieved through an expansion of the underlying but often not fully observed random trajectories Xi(t) in a functional basis that consists of the eigenfunctions of the (auto)-covariance operator of the process X.With a slight abuse of notation we define the covariance operator as Σ(g) = I Σ(s, t)g(s)ds, for any function g ∈ L 2 , using the same notation for the covariance operator and covariance function.Because of the integral form, the (linear) covariance operator is a trace class and hence compact Hilbert-Schmidt operator (Conway 1994).It has real-valued nonnegative eigenvalues λj, because it is symmetric and non-negative definite.Under mild assumptions, Mercer's theorem implies that the spectral decomposition of Σ leads to Σ(s, t) = ∞ k=1 λ k φ k (s)φ k (t), where the convergence holds uniformly for s and t ∈ I, λ k are the eigenvalues in descending order and φ k the corresponding orthogonal eigenfunctions.Karhunen and Loève (Karhunen 1946;Loève 1946) independently discovered the FPCA expansion where A ik = I (Xi(t) − µ(t))φ k (t)dt are the functional principal components (FPCs) of Xi, sometimes referred to as scores.The A ik are independent across i for a sample of independent trajectories and are uncorrelated across k with E(A ik ) = 0 and var(A ik ) = λ k .
The convergence of the sum in (1) holds uniformly in the sense that sup Expansion (1) facilitates dimension reduction as the first K terms for large enough K provide a good approximation to the infinite sum and therefore for Xi, so that the information contained in Xi is essentially contained in the K-dimensional vector Ai = (Ai1, . . ., AiK ) and one works with the approximated processes (2) Analogous dimension reduction can be achieved by expanding the functional data into other function bases, such as spline, Fourier, or wavelet bases.What distinguishes FPCA is that among all basis expansions that use K components for a fixed K, the FPC expansion explains most of the variation in X in the L 2 sense.When choosing K in an estimation setting, there is a trade off between bias (which gets smaller as K increases, due to the smaller approximation error) and variance (which increases with K as more components must be estimated, adding random error).So a model selection procedure is needed, where typically K = Kn is considered to be a function of sample size n and Kn must tend to infinity to obtain consistency of the representation.This feature distinguishes the theory of FPCA from standard multivariate analysis theory.
The estimation of the eigencomponents (eigenfunctions and eigenvalues) in the FPCA framework is straightforward, once mean and covariance of the functional data have been estimated.To obtain the spectral decomposition of the covariance operator, which yields the eigencomponents, one simply approximates the estimated auto-covariance surface cov(X(s), X(t)) on a grid of time points, thus reducing the problem to the corresponding matrix spectral decomposition.The convergence of the estimated eigen-components is obtained by combining results on the convergence of the covariance estimates that are achieved under regularity conditions with perturbation theory (see Chapter VIII of Kato (1980)).
For situations where the covariance surface cannot be estimated at the √ n rate, the convergence of the estimated eigen-components is typically influenced by the smoothing method that is employed.Consider the sparse case, where the convergence rate of the covariance surface corresponds to the optimal rate at which a smooth two-dimensional surface can be estimated.Intuition suggests that the eigenfunction, which is a one-dimensional function, should be estimable at the one-dimensional optimal rate for smoothing methods.An affirmative answer is provided in Hall, Müller & Wang (2006), where eigenfunction estimates were shown to attain the better (one-dimensional) rate of convergence, if one is undersmoothing the covariance surface estimate.This phenomenon resembles a scenario encountered in semiparametric inference, e.g. for a partially linear model (Heckman 1986), where a √ n rate is attainable for the parametric component if one undersmooths the nonpararmetric component before estimating the parametric component.This undersmoothing can be avoided so that the same smoothing parameter can be employed for both the parametric and nonparametric component if a profile approach (Speckman 1988) is employed to estimate the parametric component.An interesting and still open question is how to construct such a profile approach so that the eigenfunction is the direct target of the estimation procedure, bypassing the estimation of the covariance function.
Another open question is the optimal choice of the number of components K needed for the approximation (2) of the full Karhunen-Loève expansion (1), which gives the best trade-off between bias and variance.There are several ad hoc procedures that are routinely applied in multivariate PCA, such as the scree plot or the fraction of variance explained by the first few PC components, which can be directly extended to the functional setting.Other approaches are pseudo-versions of AIC (Akaike information criterion) and BIC (Bayesian information criterion) (Yao, Müller & Wang 2005a), where typically in practice the latter selects fewer components.Cross-validation with one-curve-leave-out has also been investitgated (Rice & Silverman 1991), but tends to overfit functional data by selecting too large K in (2).A third open question is the optimal choice of the tuning parameters for the smoothing steps in the context of FDA.
The aforementioned FPCA methods are not robust against outliers because principal component analysis involves second order moments.Outliers for functional data have many different facets due to the high dimensionality of these data.They can appear as outlying measurements at a single or several time points, or as an outlying shape of an entire function.Current approaches to deal with outliers and contamination and more generally visual exploration of functional data include exploratory box plots (Hyndman & Shang 2010;Sun & Genton 2011) and robust versions of FPCA (Crambes, Delsol & Laksaci 2008;Gervini 2008;Bali et al. 2011;Kraus & Panaretos 2012;Boente & Salibián-Barrera 2014).Due to the practical importance of this topic, more research on outlier detection and robust FDA approaches is needed.
Applications of FPCA.The FPCA approach motivates the concept of modes of variation for functional data (Jones & Rice 1992), a most useful tool to visualize and describe the variation in the functional data that is contributed by each eigenfunction.The k−th mode of variation is the set of functions that are viewed simultaneously over the range of α, usually for A = 2, substituting estimates for the unknown quantities.Often the eigencomponents and associated modes of variation have compelling and sometimes striking interpretations, such as for the evolution of functional traits (Kirkpatrick & Heckman 1989) and in many other applications (Kneip The first eigenfunction explains 84.41% of the total variation of the data and the second one an additional 12.33% of the data.The remaining two eigenfunctions account for less than 4% of the total variation and are not important.Here the first eigenfunction is nearly constant in time implying that the largest variation between subjects is in the subject specific average magnitude of the CD4 counts, so the random intercept captures the largest variation of the data.The second eigenfunction shows a variation around a piecewise linear time trend with a break point near 2.5 year after seroconversion reflecting that the next largest variation between subjects is a scale difference between subjects along the direction of this piecewise linear function. FPCA also facilitates functional principal component regression by projecting functional predictors to their first few principal components, then employing regression models with vector predictors.Since FPCA is an essential dimension reduction tool, it is also useful for classification and clustering of functional data (see Section 4).
Last but not least, FPCA facilitates the construction of parametric models that will be more parsimonious.For instance, if the first two principal components explain over 90% of the variation of the data then one can approximate the original functional data with only two terms in the Karhunen-Loeve expansion (1).This can be illustrated with the CD4 counts data, for which a parametric mixed effects model with a piecewise linear time trend (constant before -0.5 year and after 1 year of seroconversion, and linear decline in between) for the fixed effects and a random intercept may suffice to capture the major trend of the data.If more precision is preferred, one could include a second random effect for the piecewise linear basis function with a breakpoint at 2.5 years.This underscores the advantages to use a nonparametric approach such as FDA prior to a model-based longitudinal data analysis for data exploration.The exploratory analysis then may suggest viable parametric models that are more parsimonious than FPCA.

Correlation and Regression: Inverse Problems and Dimension Reduction for Functional Data
As mentioned in Section 1, a major challenge in FDA is the inverse problem, which stems from the compactness of the covariance operator Σ that was defined in the previous section, Σ(g) = I Σ(s, t)g(s)ds, for any function g ∈ L 2 .If there are infinitely many nonzero, hence positive eigenvalues, then the covariance operator is a one-to one function and the inverse operator of Σ can be determined but it is an unbounded operator and the range space of the covariance operator is a compact set in L 2 .This creates a problem to define a bijection, as the inverse of Σ is not defined on the entire L 2 space.Therefore regularization is routinely adopted, for any procedure that involves the inverse on a compact operator.
Examples where inverse operators are central include regression and correlation measures for functional data, as Σ −1 appears in these methods.This inverse problem was for example addressed for functional canonical correlation in He, Müller & Wang (2000) and He, Müller & Wang (2003), where a solution was discussed under certain constraints on the decay rate of the eigenvalues and the cross covariance operator.

Functional Correlation
Different functional correlation measures have been discussed in the literature.Functional Canonical Correlation Analysis serves here to demonstrate some of the problems that one encounters in FDA as a consequence of the non-invertibility of compact operators.
Functional Canonical Correlation Analysis (FCCA).Let (X, Y ) be a pair of random functions in L 2 (IX ) and L 2 (IY ) respectively.The first functional canonical correlation coefficient ρ1 and its associated weight functions (u1, v1) are defined as follows, using the notation f1, f2 = I f1(t)f2(t)dt for any f1, f2 ∈ L 2 (I), subject to var( u, X ) = 1 and var( v, Y ) = 1.Analogously for the kth, k > 1, canonical correlation ρ k and its associated weight functions (u k , v k ), subject to var( u, X ) = 1, var( v, Y ) = 1, and that the pair is uncorrelated to all previous pairs (Uj, Vj) = ( uj, X , vj, Y ), for j = 1, . . ., k − 1.Thus, FCCA aims at finding projections in directions u k of X and v k of Y such that their linear combinations (inner products) U k and V k are maximally correlated, resulting in the series of functional canonical components (ρ k , u k , v k , U k , V k ), k ≥ 1, directly extending canonical correlations for multivariate data.Because of the flexibility in the direction u1, which is infinite dimensional, overfitting may occur if the number of sample curves is not large enough.Formally, this is due to the fact that FCCA is an ill-posed problem.Introducing the cross-covariance operator ΣXY : for v ∈ L 2 (IY ) and analogously the covariance operators for X, ΣXX , for Y , ΣY Y , and using cov( u, X , v, Y ) = u, ΣXY Y , the kth canonical component in (4) can be expressed as Then ( 6) is equivalent to an eigenanalysis of the operator Existence of the canonical components is guaranteed if the operator R is compact.However, the inverse of a covariance operator and the inverses of Σ Y Y are not bounded since a covariance operator is compact under the assumption that the covariance function is square integrable.A possible approach (He, Müller & Wang 2003) is to restrict the domain of the inverse to the range AX of Σ 1/2 XX so that the inverse of Σ 1/2 XX can be defined on AX and is a bijective mapping AX to BX , under some conditions (e.g., Conditions 4.1 and 4.5 in He, Müller & Wang (2003)) on the decay rates of the eigenvalues of ΣXX and ΣY Y and the cross-covariance.Under those assumptions the canonical correlations and weight functions are well defined and exist.
An alternative way to get around the above ill-posed problem is to restrict the maximization in ( 3) and ( 4) to discrete l 2 spaces that are restricted to a reproducing kernel Hilbert space instead of working within the entire L 2 space (Eubank & Hsing 2008).In addition to theoretical challenges to overcome the inverse problem, FCCA requires regularization in practical implementations, as only finitely many measurements are available for each subject.If left unregularized, the first canonical correlation will always be one.Unfortunately, the canonical correlations are highly sensitive to the regularization parameter and the first canonical correlation often tends to be too large as there is too much freedom to choose the weights u and v.This makes it difficult to interpret the meaning of the first canonical correlation.The overfitting problem can also be viewed as a consequence of the high-dimensionality of the weight function and was already illustrated in Leurgans, Moyeed & Silverman (1993), who were the first to explore penalized functional canonical correlation analysis.Despite the challenge with overfitting, FCCA can be employed to implement functional regression by using the canonical weight functions u k , and v k as bases to expand the regression coefficient function (He, Müller & Wang 2000;He et al. 2010).
Another difficulty with the versions of FCCA proposed so far is that it requires densely recorded functional data so the inner products in (4) can be evaluated with high accuracy.Although it is possible to impute sparsely observed functional data using the Karhunen-Loève expansion (1) before applying any of the canonical correlations, these imputations are not consistent and this leads to a biased correlation estimation.This bias may be small in practice but finding an effective FCCA for sparsely observed functional data is still of interest and remains an open problem.
Other Functional Correlation Measures.The regularization problems for FCCA have motivated the study of alternative notions of functional correlation.These include singular correlation and singular expansions of paired processes (X, Y ).While the first correlation coefficient in FCCA can be viewed as ρFCCA = sup u = v =1 corr( u, X , v, Y ), observing that it is the correlation that induces the inverse problem, one could simply replace the correlation by covariance, i.e., obtain project functions u1, v1 that attain sup u = v =1 cov( u, X , v, Y ).Functions u1, v1 turn out to be the first pair of the singular basis of the covariance operator of (X, Y ) (Yang, Müller & Stadtmüller 2011).This motivates to define a functional correlation as the first singular correlation Another natural approach that also avoids the inverse problem is to define functional correlation as the cosine of the angle between functions in L 2 .For this notion to be a meaningful measure of alignment of shapes, one first needs to subtract the integrals of the functions, i.e., their projections on the constant function 1, which corresponds to a "static part".Again considering pairs of processes (X, Y ) = (X1, X2) and denoting the projections on the constant function 1 by is the "dynamic part" for each random function.The cosine of the L 2 -angle between the dynamic parts then provides a correlation measure of functional shapes.These ideas can be formalized as follows (Dubin & Müller 2005).Defining standardized curves either by or alternatively by also removing , the cosine of the angle between the standardized functions is The resulting dynamic correlation and other notions of functional correlation can also be extended to obtain a precision matrix for functional data.This approach has been developed by Opgen-Rhein & Strimmer (2006) for the construction of a graphical model for gene time course data.

Functional Regression
Functional regression is an active area of research and the approach depends on whether the responses or covariates are functional or vector data and include combinations of (i) functional responses with functional covariates, (ii) vector responses with functional covariates, and (iii) functional responses with vector covariates.An approach for (i) was introduced by Ramsay & Dalzell (1991) who developed the functional linear model (FLM) (15) for this case, where the basic idea already appears in Grenander (1950), who derives this as the regression of one Gaussian process on another.This model can be viewed as an extension of the traditional multivariate linear model that associates vector responses with vector covariates.The topic that has been investigated most extensively in the literature is scenario (ii) for the case where the responses are scalars and the covariates are functions.Reviews of FLMs are Müller (2005Müller ( , 2011)); Morris (2015).Nonlinear functional regression models will be discussed in Section 5.In the following we give a brief review of the FLM and its variants.
Functional Regression Models with Scalar Response.The traditional linear model with scalar response Y ∈ R and vector covariate X ∈ R p can be expressed as using the inner product in Euclidean vector space, where β0 and β contain the regression coefficients and e is a zero mean finite variance random error (noise).Replacing the vector X in (8) and the coefficient vector β by a centered functional covariate X c = X(t) − µ(t) and coefficient function β = β(t), for t ∈ I, one arrives at the functional linear model which has been studied extensively (Cardot, Ferraty & Sarda 1999, 2003;Hall & Horowitz 2007;Hilgert, Mas & Verzelen 2013).
An ad hoc approach is to expand the covariate X and the coefficient function β in the same functional basis, such as the B-spline basis or eigenbasis in (1).Specifically, consider an orthonormal basis ϕ k , k ≥ 1, of the function space.Then expanding both X and β in this basis leads to and model ( 9) is seen to be equivalent to the traditional linear model ( 8) of the form where in implementations the sum on the r.h.s. is replaced by a finite sum that is truncated at the first K terms, in analogy to (2).
To obtain consistency for the estimation of the parameter function β(t), one selects a sequence K = Kn of eigenfunctions in (10) with Kn → ∞.For the theoretical analysis, the method of sieves (Grenander 1981) can be applied, where the Kth sieve space is defined to be the linear subspace spanned by the first K = Kn components.In addition to the basis-expansion approach, a penalized approach using either P-splines or smoothing splines has also been studied (Cardot, Ferraty & Sarda 2003).For the special case where the basis functions ϕ k are selected as the eigenfunctions φ k of X, the basis representation approach in (8) is equivalent to conducting a principal component regression, albeit with an increasing number of principal components.In this case, however, the basis functions are estimated rather than pre-specified, and this adds an additional twist to the theoretical analysis.
The simple functional linear model ( 9) can be extended to multiple functional covariates X1, . . ., Xp, also including additional vector covariates Z = (Z1, . . ., Zq), where Z1 = 1, by where Ij is the interval where Xj is defined.In theory, these intervals need not be the same.Although model ( 11) is a straightforward extension of (9), its inference is different due to the presence of the parametric component θ.A combined least squares method to estimate θ and βj simultaneously in a one step or profile approach (Hu, Wang & Carroll 2004), where one estimates θ by profiling out the nonparametric components βj, is generally preferred over an alternative back-fitting method.Once the parameter θ has been estimated, any approach that is suitable and consistent for fitting the functional linear model ( 9) can easily be extended to estimate the nonparametric components β k by applying it to the residuals Y − θ, Z .
Extending the linear setting with a single index I X c (t)β(t)dt to summarize each functional covariate, a nonlinear link function g can be added in (9) to create a functional generalized linear model (either within the exponential family or a quasi-likelihood framework and a suitable variance function) This Generalized Functional Linear Model has been considered when g is known (James 2002;Cardot, Ferraty & Sarda 2003;Cardot & Sarda 2005;Wang, Qian & Carroll 2010;Dou, Pollard & Zhou 2012) and when it is unknown (Müller & Stadtmüller 2005;Chen, Hall & Müller 2011).When g is unknown and the variance function plays no role, the special case of a single-index model has further been extended to multiple indices, the number of which is possibly unknown.Such "multiple functional index models" typically forgo the additive error structure imposed in ( 9) -( 12), where g is an unknown multivariate function on R p+1 .This line of research follows the paradigm of sufficient dimension reduction approaches, which was first proposed for vector covariates as an off-shoot of sliced inverse regression (SIR) (Duan & Li 1991;Li 1991), and has been extended to functional data in Ferré & Yao (2003); Ferré & Yao (2005); Cook, Forzani & Yao (2010) and to longitudinal data in Jiang, Yu & Wang (2014).
Functional Regression Models with Functional Response.For a function Y on IY and a single functional covariate X(t), s ∈ IX , two major models have been considered, and where β0(s) and α0(s) are non-random functions that play the role of functional intercepts, and β(s) and α(s, t) are non-random coefficient functions, the functional slopes.Model ( 14) implicitly assumes that IX = IY and is most often referred to as "varyingcoefficient" model.Given s, Y (s) and X(s) follow the traditional linear model, but the covariate effects may change with time s.This model assumes that the value of Y at time s depends only on the current value of X(s) and not the history {X(t) : t ≤ s} or future values, hence it is a "concurrent regression model".A simple and effective approach to estimate β is to first fit model ( 14) locally in a neighborhood of s using ordinary least squares to obtain an initial estimate β(s), and then to smooth these initial estimates β(s) across s to get the final estimate β (Fan & Zhang 1999).In addition to such a two-step procedure, one-step smoothing methods have been also studied (Hoover et al. 1998;Wu & Chiang 2000;Eggermont, Eubank & LaRiccia 2010;Huang, Wu & Zhou 2002), as well as hypothesis testing and confidence bands (Wu, Chiang & Hoover 1998;Huang, Wu & Zhou 2004), with review in Fan & Zhang (2008).More complex varying coefficient models include the nested model in Brumback & Rice (1998), the covariate adjusted model in S ¸entürk & Müller (2005), and the multivariate varying-coefficent model in Zhu, Fan & Kong (2014), among others.
Model ( 15) is generally referred to as functional linear model (FLM), and it differs in crucial aspects from the varying coefficient model (14): At any given time s, the value of Y (s) depends on the entire trajectory of X.It is a direct extension of traditional linear models with multivariate response and vector covariates by changing the inner product from the Euclidean vector space to L 2 .This model also is a direct extension of model ( 9) when the scalar Y is replaced by Y (s) and the coefficient function β varies with s, leading to a bivariate coefficient surface.It was first studied by Ramsay & Dalzell (1991), who proposed a penalized least squares method to estimate the regression coefficient surface α(s, t).When IX = IY , it is often reasonable to assume that only the history of X affects Y , i.e., that α(s, t) = 0 for s < t.This has been referred to as the "historical functional linear model" (Malfait & Ramsay 2003), because only the history of the covariate is used to model the response process.This model deserves more attention.
When X ∈ R p and Y ∈ R q are random vectors, the normal equation of the least squares regression of Y on X is cov(X, Y ) =cov(X, X)β, where β is a p × q matrix.Here a solution can be easily obtained if cov(X, X) is of full rank so its inverse exists.An extension of the normal equation to functional X and Y is straightforward by replacing covariance matrices by their corresponding covariance operators.However, an ill-posed problem emerges for the functional normal equations.Specifically, if for paired processes (X, Y ) the crosscovariance function is rXY (s, t) = cov(X(s), Y (t)) and rXX (s, t) = cov(X(s), X(t)) is the auto-covariance function of X, we define the linear operator, RXX : L 2 × L 2 → L 2 × L 2 by (RXX β)(s, t) = rXX (s, w)β(w, t)dw.Then a "functional normal equation" takes the form (He, Müller & Wang 2000) Since RXX is a compact operator in L 2 , its inverse is not bounded, leading to an ill-posed problem.Regularization is thus needed in analogy to the situation for FCCA described in Section 3.1 (He, Müller & Wang 2003).The functional linear model ( 9) is similarly ill-posed, however not the varying coefficient model ( 14), because the normal equation for the varying-coefficient model can be solved locally at each time point and does not involve inverting an operator.
Due to the ill-posed nature of the functional linear model, the asymptotic behavior of the regression estimators varies in the three design settings.For instance, a √ n rate is attainable under the varying-coefficient model ( 14) for completely observed functional data or dense functional data possibly contaminated with measurement errors, but not for the other two functional linear models ( 9) and ( 15) unless the functional data can be represented by a finite number of basis functions.The convergence rate for (9) depends on how fast the eigenvalues decay to zero and on regularity assumptions for β (Cai & Hall 2006;Hall & Horowitz 2007), even when functional data are observed continuously without error.An interesting phenomenon is that prediction for model (9) follows a different paradigm in which √ n convergence is attainable if the predictor X is sufficiently smooth and the eigenvalues of predictor processes are well behaved (Cai & Hall 2006).Estimation for β and asymptotic theory for model (15) were explored in Yao, Müller & Wang (2005b); He et al. (2010) for sparse functional data.
As with scalar responses, both the varying coefficient model ( 14) and functional linear model (15) can accommodate vector covariates and multiple functional covariates.Since each component of the vector covariate can be treated as a functional covariate with a constant value, we only discuss the extension to multiple functional covariates, X1, . . ., Xp, noting that interaction terms can be added as needed.The only change we need to make on the models is to replace the term β(s)X(s) in ( 14) by p j=1 βj(s)Xj(s) and the term β(s, t)X(t)dt in (15) by p j=1 I X j βj(s, t)Xj(t)dt, where IX j is the domain of Xj.If there are many predictors, a variable selection problem may be encountered, and when using basis expansions it is natural to employ a group lasso or similar constrained multiple variable selection method under sparsity or other suitable assumptions.
Generalized versions can be developed by adding a pre-specified link function g in models ( 14) and ( 15).For the case of the varying coefficient model and sparse functional data this has been investigated in S ¸entürk & Müller (2008) for the generalized varying coefficient model and for model ( 15) and dense functional data in James & Silverman (2005) for a finite number of expansion coefficients for each function.Jiang & Wang (2011) considered a setting where the link function may vary with time but the β in the index does not vary with time.The proposed dimension reduction approach in this paper expands the MAVE method by Xia et al. (2002) to functional data.
Random Effects Models.In addition to targeting fixed effects regression, the nonparametric modeling of random effects is also of interest.Here the random effects are contained in the stochastic part e(t) of ( 14) and ( 15).One approach is to extend the FPCA approach of Section 2 to incorporate covariates (Cardot & Sarda 2006;Jiang, Aston & Wang 2009;Jiang & Wang 2010).These approaches are aiming to incorporate low dimensional projections of covariates to alleviate the curse of dimensionality for nonparametric procedures.One scenario where it is easy to implement covariate adjusted FPCA is the case where one has functional responses and vector covariates.One could conduct a pooled FPCA combining all data as a first step and then to use the FPCA scores obtained from the first stage to model covariate effects through a single-index model at each FPCA component (Chiou, Müller & Wang 2003).At this time, such approaches require dense functional data, as for sparse data individual FPC scores cannot be estimated consistently.

Clustering and classification of functional data
Clustering and classification are useful tools for traditional multivariate data analysis and are equally important yet more challenging in functional data analysis.We take daily vehicle speed trajectories at a fixed location as realizations of random functions as a motivating example for illustrating clusters of vehicle speed patterns.The data were recorded by a dual loop detector station located near Shea-Shan tunnel on National Highway 5 in Taiwan for 76 days during July-September 2009.The vehicle speed measures (km/hour) were averaged over 5-minute intervals.Figure 5 displays the patterns of vehicle speed for two clusters obtained by the k-centers subspace projection method to be described below.As indicated by Figure 6, Cluster 1 characterizes holidays while Cluster 2 pinpoints weekdays, reflecting the traffic patterns at the location.
In the terminology of machine learning, functional data clustering is an unsupervised learning process while functional data classification is a supervised learning procedure.Cluster analysis aims to group a set of data such that data objects within clusters are more similar than across clusters with respect to a metric.In contrast, classification assigns a new data object to a pre-determined group by a discriminant function or classifier.Functional classification typically involves training data containing a functional predictor with an associated multi-class label for each data object.The discrimination procedure of functional classification is closely related to functional cluster analysis, even though the goals are different.When the structures or centers of clusters can be established in functional data clustering, the criteria used for identifying clusters can also be used for classification.Methodology for clustering and classification of functional data has advanced rapidly during the past decades, due to rising demand for such methods in data applications.Given the vast literature on functional clustering and classification, we focus in the following on only a few typical methods.

Clustering of functional data
For vector-valued multivariate data, hierarchical clustering and the k-means partitioning methods are two classical and popular approaches.Hierarchical clustering is an algorithmic approach, using either agglomerative or divisive strategies, that requires a dissimilarity measure between sets of observations, which informs which clusters should be combined or when a cluster should be split.In the k-means clustering method, the basic idea hinges on cluster centers, the means for the clusters.The cluster centers are established through algorithms aiming to partition the observations into k clusters such that the within-cluster sum of squared distances, centering around the means, is minimized.Classical clustering concepts for vector-valued multivariate data can typically be extended to functional data, where various additional considerations arise, such as discrete approximations of distance measures, and dimension reduction of the infinite-dimensional functional data objects.In particular, k-means type clustering algorithms have been widely applied to functional data, and are more popular than hierarchical clustering algorithms.It is natural to view cluster mean functions as the cluster centers in functional clustering.Specifically, for a sample of functional data {Xi(t); i = 1, . . ., n}, the k-means func-tional clustering aims to find a set of cluster centers {µ c ; c = 1, . . ., L}, assuming there are L clusters, by minimizing the sum of the squared distances between {Xi} and the cluster centers that are associated with their cluster labels {Ci; i = 1, . . ., n}, for a suitable functional distance d.That is, the n observations {Xi} are partitioned into L groups such that 1 is minimized over all possible sets of functions {µ c n ; c = 1, . . ., L}, where µ c n (t) = n i=1 Xi(t)1 {C i =c} /Nc, and Nc = n i=1 1 {C i =c} .The distance d is often chosen as the L 2 norm.Since functional data are discretely recorded, frequently contaminated with measurement errors, and can be sparsely or irregularly sampled, a common approach to minimize ( 16) is to project the infinite-dimensional functional data onto a low dimensional space of a set of basis functions, similarly to functional correlation and regression.
There is a vast amount of literature on functional data clustering during the past decade, including methodological development and a broad range of applications.Some selected approaches to be discussed below include k-means type clustering in Section 4.1.1,subspace projected clustering methods in Section 4.1.2and model-based functional clustering approaches in Section 4.1.3.4.1.1.Mean functions as cluster centers.The traditional k-means clustering for vectorvalued multivariate data has been extended to functional data using mean functions as cluster centers, and one can distinguish two typical approaches, as follows.
Functional Clustering via Functional Basis Expansion.As described in Section 2, given a set of pre-specified basis functions {ϕ1, ϕ2, . ..} of the function space, the first K projections {B ik } of the observed trajectories onto the space spanned by the set of basis functions can be used to represent the functional data, where B ik = X c i , ϕ k , k = 1, . . ., K. The distributional patterns of {B ik } then reflect the clusters in function space.Therefore, a typical functional clustering approach via functional basis expansion is to represent the functional data by the set of coefficients in the basis expansion, which requires a judicious choice of the basis functions, and then applying available clustering algorithms for multivariate data, such as the k-means algorithm, to partition the estimated sets of coefficients.When clustering the fitted sets of coefficients {B ik } with the k-means algorithm, one obtains cluster centers { Bc 1 , . . ., Bc K } on the projected space, and thus the set of cluster centers in the function space {μ c ; c = 1, . . ., L}, where μc (t) = K k=1 Bc k ϕ k (t).Such two stage clustering has been adopted in Abraham et al. (2003) using B-spline basis functions and Serban & Wasserman (2005) using Fourier basis functions coupled with the k-means algorithm, as well as Garcia-Escudero & Gordaliza (2005) using B-splines with a robust trimmed k-means method.Abraham et al. (2003) derived the strong consistency property of this clustering method, which has been implemented with various basis functions, such as P splines (Coffey, Hinde & Holian 2014), a Gaussian orthonormalized basis (Kayano, Dozono & Konishi 2010), and the wavelet basis (Giacofci et al. 2013).
Functional Clustering via FPCA.In contrast to the functional basis expansion approach that requires a pre-specified set of basis functions, the finite approximation FPCA approach using the FPCs as described in Section 2 employs data-adaptive basis functions that are determined by the covariance function of the functional data.Then distributions of the sets of FPC scores {A ik } (see ( 2)) indicate different cluster patterns, while the overall mean function µ(t) does not affect clustering, and the scores {A ik } play a similar role as the basis coefficients {B ik } for clustering.Peng & Müller (2008) used a k-means algorithm on the FPC scores, employing a special distance adapted to clustering sparse functional data, and Chiou & Li (2007) used a k-means algorithm on the FPC scores as an initial clustering step for the subspace projected k-centers functional clustering algorithm.When the mean functions as the cluster centers are sufficient to define the clusters, this step is sufficient.However, when covariance structures also play a role to distinguish clusters, taking mean functions as cluster centers is not adequate, as will be discussed in the next subsection.
4.1.2.Subspaces as cluster centers.Since functional data are realizations of random functions, it is natural to use differences in the stochastic structure of random functions for clustering.This idea is particularly sensible in functional data clustering, utilizing the Karhunen-Loève representation in (1).More specifically, the truncated representation (2) of random functions in addition to the mean includes the sum of a series of linear combinations of the eigenfunctions of the covariance operator with the FPC scores as the weights.The subspace spanned by the components of the expansion, the mean function and the set of the eigenfunctions, can be used to characterize clusters.Therefore, clusters of the data set are identified via subspace projection such that cluster centers hinge on the stochastic structure of the random functions, rather than the mean functions only.
The FPC subspace-projected k-centers functional clustering approach was considered in Chiou & Li (2007), using subspaces as cluster centers.Let C be the cluster membership variable, and the FPC subspace S c = {µ c , φ c 1 , . . ., φ c Kc }, c = 1, . . ., L, assuming that there are L clusters.The projected function of Xi onto the FPC subspace S c can be written as One aims to find the set of cluster centers {S c ; c = 1, . . ., L}, such that the best cluster membership of Xi, c * (Xi), is determined by minimizing the discrepancy between the projected function Xc i and the observation Xi, In contrast, k-means clustering aims to find the set of cluster sample means as the cluster centers, instead of the subspaces spanned by {S c ; c = 1, . . ., L}.The initial step of the subspace-projected clustering procedure uses only µ c , which reduces to the k-means functional clustering.In subsequent iteration steps, the mean function and the set of eigenfunctions for each cluster is updated and used to identify the set of cluster subspaces {S c }, until iterations converge.This functional clustering approach simultaneously identifies the structural components of the stochastic representation for each cluster.The idea of the k-centers function clustering via subspace projection was further developed to clustering functional data with similar shapes based on a shape function model with random scaling effects (Chiou & Li 2008).
More generally, in probabilistic clustering the cluster membership of Xi may be determined by maximizing the conditional cluster membership probability given Xi, P C|X (c | Xi), such that c * (Xi) = arg max c∈{1,...,L} This criterion requires modeling of the conditional probability P C|X (• | •).It can be achieved by a generative approach that requires a joint probability model or alternatively through a discriminative approach using, for example, a multi-class logit model (Chiou 2012).
For the k-means type or the k-centers functional clustering algorithms, the number of clusters is pre-specified.The number of clusters for subspace projected functional clustering can be determined by finding the maximum number of clusters while retaining significant differences between pairs of cluster subspaces.Li & Chiou (2011) developed the forward functional testing procedure to identify the total number of clusters under the framework of subspace projected functional data clustering.
4.1.3.Functional clustering with mixture models.Model-based clustering (Banfield & Raftery 1993) based on mixture models is widely used in clustering vector-valued multivariate data and has been extended to functional data clustering.In this approach, the mixture model determines the cluster centers.Similarly to the k-means type of functional data clustering, typical mixture model-based approaches to functional data clustering in a first step project the infinite dimensional functional data onto low-dimensional subspaces.An example is James & Sugar (2003), who applied functional clustering models based on Gaussian mixture distributions to the natural cubic spline basis coefficients, with emphasis on clustering sparsely sampled functional data.Similarly, Jacques & Preda (2013, 2014) applied the idea of Gaussian mixture modeling to FPCA scores.All these methods are based on truncated expansions as in (2).
Random effects modeling also provides a model-based clustering approach, using mixed effects models with B-splines or P-splines, for example to cluster time-course gene expression data (Coffey, Hinde & Holian 2014).For clustering longitudinal data, a linear mixed model for clustering using a penalized normal mixture as random effects distribution has been studied (Heinzl & Tutz 2014).Bayesian hierarchical clustering also plays an important role in the development of model-based functional clustering, typically assuming Gaussian mixture distributions on the sets of basis coefficients fitted to individual trajectories.Dirichlet processes are frequently used as prior for the mixture distributions and also to deal with the uncertainty in the cluster numbers (Angelini, De Canditiis & Pensky 2012;Rodriguez, Dunson & Gelfand 2009;Petrone, Guindani & Gelfand 2009;Heinzl & Tutz 2013).

Classification of functional data
While functional clustering aims at finding clusters by minimizing an objective function such as ( 16) and ( 18), or more generally, by maximizing the conditional probability as in ( 19), functional classification assigns a group membership to a new data object with a discriminant function or a classifier.Popular approaches for functional data classification are based on functional regression models that feature class labels as responses and the observed functional data and other covariates as predictors.This leads to regression based functional data classification methods, for example, functional generalized linear regression models and functional multiclass logit models.Similar to functional data clustering, most functional data classification methods apply a dimension reduction technique using a truncated expansion in a pre-specified function basis or in the data-adaptive eigenbasis.

Functional regression for classification.
For regression-based functional classification models, functional generalized linear models (James 2002;Müller 2005;Müller & Stadtmüller 2005) or more specifically, functional binary regression, such as functional logistic regression, are popular approaches.For a random sample {(Zi, Xi); i = 1, . . ., n}, where Zi represents a class label, Zi ∈ {1, . . ., L} for L classes, associated with functional observations Xi, a classification model for an observation X0 based on functional logistic regression is where γ 0k is an intercept term and γ 1k (t) the coefficient function of the predictor X0(t) and . This is a functional extension of the baseline odds model in multinomial regression (McCullagh & Nelder 1983).
Given a new observation X0, the model-based Bayes classification rule is to choose the class label Z0 with the maximal posterior probability among {P r(Z0 = k | X0); k = 1, . . ., L}.More generally, Leng & Müller (2006) used the generalized functional linear regression model based on the FPCA approach.When the logit link is used in the model, it becomes the functional logistic regression model, several variants of which have been studied (Araki et al. 2009;Matsui, Araki & Konishi 2011;Wang, Ray & Mallick 2007;Zhu, Vannucci & Cox 2010;Rincon & Ruiz-Medina 2012).

Functional discriminant analysis for classification.
In contrast to the regression-based functional classification approach, another popular approach is based on the classical linear discriminant analysis method.The basic idea is to classify according to the largest conditional probability of the class label variable given a new data object by applying the Bayes rule.Suppose that the kth class has prior probability π k , K k=1 π k = 1.Given the density of the kth class, f k , the posterior probability of a new data object X0 is given by the Bayes formula, Developments along these lines include a functional linear discriminant analysis approach to classify curves (James & Hastie 2001), a functional data-analytic approach to signal discrimination, using the FPCA method for dimension reduction (Hall, Poskitt & Presnell 2001) and kernel functional classification rules for nonparametric curve discrimination (Ferraty & Vieu 2003;Chang, Chen & Ogden 2014;Zhu, Brown & Morris 2012).Theoretical support and a notion of "perfect classification" standing for asymptotically vanishing misclassification probabilities has been introduced in Delaigle & Hall (2012) for linear and Delaigle & Hall (2013) for quadratic functional classification.

Nonlinear Methods for Functional Data
Due to the complexity of functional data analysis, which blends stochastic process theory, functional analysis, smoothing and multivariate techniques, most research at this point has focused on linear functional models, such as functional principal components and functional linear regression, which are reviewed in Sections 2 and 3. Perhaps owing to the success of these linear approaches, the development of nonlinear methods has been slower, even though in many situations linear methods are not fully adequate.A case in point is the presence of time variation or time warping that has been observed for many functional data (Kneip & Gasser 1992;Gasser & Kneip 1995).This means that observation time itself is randomly distorted and time variation may constitute the main source of variation for some functional data (Wang & Gasser 1997;Ramsay & Li 1998).Efficient models will then need to reflect the nonlinear features of the data.

Nonlinear Regression Models
The classical functional regression models are linear models, as described in Section 3.2, see equations ( 9), ( 9), ( 11), (15).Direct nonlinear extensions still contain a linear predictor, but combine it with a nonlinear link function, in a similar fashion as the generalized linear model (McCullagh & Nelder 1983).These are the generalized functional linear model and single index models ( 12) and ( 13).From a theoretical perspective, the presence of a nonlinear link functions complicates the analysis of these models, e.g., requiring to decompose such models into a series of p−dimensional approximation models with p → ∞ (Müller & Stadtmüller 2005).
There have been various developments towards fully nonparametric regression models for functional data (Ferraty & Vieu 2006), which lie at the other end of the spectrum in comparison to the functional linear model.These models extend the concept of nonparametric smoothing to the case of predictor functions, where for scalar responses Y one considers functional predictors X, aiming at E(Y | X) = g(X) for a smooth regression function g, for example extending kernel smoothing to this situation.The idea is to replace differences in the usual Euclidean predictor space by a projected pseudo-distance in a functional predictor space, so that the scaled kernel K( x−y h ) with a bandwidth h becomes K( d(x,y) h ), where d is a metric in the predictor space (Ferraty & Vieu 2006).Due to the infinite nature of the predictors, when choosing d as the L 2 distance, such models are subject to a serious form of "curse of dimensionality", as functional predictors are inherently infinite-dimensional.This "curse" is quantifiable in terms of unfavorable small ball probabilities in function space (Delaigle & Hall 2010).What this means is that an appropriate choice of the metric d that avoids the "curse" is essential, and whether such a choice is possible for a given functional data set and how to implement remains an open problem.In some cases, when data are clustered in lower-dimensional manifolds, the rates of convergence pertaining to the lower dimension will apply (Bickel & Li 2007), counteracting the "curse".
More generally, to bypass the "curse" and the metric selection problem, it is of interest to consider nonlinear models, which are subject to some structural constraints that do not overly infringe flexibility.One can aim at models that retain polynomial rates of convergence, while they are more flexible than, say, functional linear models.Such models are particularly useful when diagnostics for the functional linear model indicate lack of fit (Chiou & Müller 2007).An example are generalized functional linear models (12) as well as extensions to single index models (Chen, Hall & Müller 2011) that provide enhanced flexibility and structural stability while model fits converge at polynomial rates.
Additive Models for Functional Data.Beyond single index models, another powerful dimension reduction tool is the additive model (Stone 1985;Hastie & Tibshirani 1986), which also has been extended to functional data (Lin & Zhang 1999;You & Zhou 2007;Carroll et al. 2009;Lai, Huang & Lee 2012).In these models it is generally assumed that the time effect is additive, which is sometimes restrictive.Modeling additive components that are bivariate functions of time and a covariate (Zhang, Park & Wang 2013), this restriction can be avoided.The two-dimensional smoothing needed for the bivariate functions each component may be replaced by one-dimensional smoothing steps, if one further assumes that each of the additive components is the product of an unknown time effect and an unknown covariate effect (Zhang & Wang 2015), which leads to easy interpretation and implementation.
Alternatively, one can utilize the functional principal components A k , as defined in (1), for dimension reduction of the predictor process or processes X, and then assume that the regression relation is additive in these components.While the linear functional regression model with scalar response can be written as E(Y | X) = EY + ∞ k=1 A k β k with an infinite sequence of regression coefficients β k , the functional additive model is where the component functions are required to be smooth and to satisfy E(f k (A k )) = 0 (Müller & Yao 2008;Sood, James & Tellis 2009).This model can be characterized as frequency-additive.A key feature that makes this model not only easy to implement but also accessible to asymptotic analysis, is if the functional principal components A k are assumed to be independent, as would be the case for Gaussian predictor processes, where µY = EY .This implies that simple onedimensional smoothing of the responses against the FPCs leads to consistent estimates of the component functions f k (Müller & Yao 2008), so that the usual backfitting that is normally required for additive modeling is not needed.For the functional linear model ( 9), already uncorrelatedness of the FPCs of the predictor processes suffices for the representation E(Y − µY | A k ) = β k A k , motivating to decompose functional linear regression into a series of simple linear regressions (Chiou & Müller 2007;Müller et al. 2009).
Projections on a finite number of directions for each of potentially many predictor functions provide an alternative additive approach that is of practical interest when the projections are formed by taking into consideration the relation between X and Y , in contrast to other functional regression models, where the predictors are formed merely based on the auto-covariance structure of predictor processes X (James & Silverman 2005;Chen, Hall & Müller 2011;Fan et al. 2014).
Still other forms of additive models have been considered for functional data.While model ( 22) can be characterized as frequency-additive, as it is additive in the FPCs, one may ask the question whether there are time-additive models.It is immediately clear that since the number of time points on an interval domain is uncountable, an unrestricted timeadditive model E(Y | X) = t∈[0,T ] ft(X(t)) is not feasible.Addressing this conundrum by assuming that the functions ft are smoothly varying in t and considering a sequence of time-additive models on increasingly dense finite grids of size p leads to the sequence where fj(x) = g(tj, x) for a smooth bivariate function g.In the limit p → ∞ this becomes the continuously additive model (Müller, Wu & Yao 2013) (24) This model can be implemented with a bivariate spline representation of the function g; a very similar model was introduced under the name functional generalized additive model in McLean et al. (2014).Nonlinear or linear models, where individual predictor times are better predictors than functional principal components, i.e., regression models with time-based rather than frequency-based predictors, can be viewed as special cases of the continuously additive model ( 24), where only a few time points and their associated additive functions fj(X(tj)) are assumed to be predictive (Ferraty, Hall & Vieu 2010).
Optimization and Gradients With Functional Predictors.In some applications one aims to maximize the response E(Y | X) in terms of the predictor function X.An example is the evolution of reproductive trajectories X in medflies, measured in terms of daily egg-laying, where evolution may work to maximize a desirable outcome such as lifetime reproductive success Y , as this conveys an evolutionary advantage.While enhanced egg-laying at all ages enhances lifetime reproductive success, measured as total number of eggs produced during the lifetime of a fly, it also promotes mortality, through the "cost of reproduction".However, shorter lifespan implies reduced total number of eggs.The optimal egg-laying trajectory is therefore not simply the maximal egg-laying possible at all ages but a complex trade-off between maximizing daily egg-laying and the cost of reproduction in terms of mortality (Müller et al. 2001).
To address the corresponding functional maximization problem, gradients with respect to functional predictors X are of interest (Hall, Müller & Yao 2009).Extending the functional additive model ( 22), one can introduce additive gradient operators with arguments in L 2 at each predictor level X ≡ {A1, A2, . ..},

Γ
(1) These additive gradient operators serve to find directions in which responses increase, thus enabling a maximal descent algorithm in function space (Müller & Yao 2010a).
Polynomial Functional Regression.Finally, just as the common linear model can be embedded in a more general polynomial version, a polynomial functional model that extends the functional linear model has been proposed (Yao & Müller 2010), with quadratic functional regression as the most prominent special case.With centered predictor processes X c , this model can be written as and in addition to the parameter function β that it shares with the functional linear model it also features a parameter surface γ.The extension to higher order polynomials is obvious.These models can be equivalently represented as polynomials in the corresponding FPCs.
A natural question is whether the linear model is sufficient or needs to be extended to a model that includes a quadratic term.A corresponding test was developed by Horváth & Reeder (2013).

Time Warping, Dynamics and Manifold Learning for Functional Data
In addition to amplitude variation, many functional data are best described by assuming that additional time variation is present, i.e, the time axis is distorted by a smooth random process.A classical example are growth data.In human growth, the biological age of different children varies and this variation has a direct bearing on the growth rate that follows a general shape, but with subject-specific timing, which manifests itself for example in the subject-specific timing of the two major growth spurts, the pubertal and the prepubertal growth spurt (Gasser et al. 1984).
Time Variation and Curve Registration.If both amplitude and time variation are present in functional data, they cannot be separately identified, so additional assumptions that break the non-identifiability are crucial if one wishes to identify and separate these two components, which jointly generate the observed variation in the data.An example when time warping is identifiable is briefly discussed in the next section.In the presence of time warping, which is also known as curve registration or curve alignment problem, crosssectional mean functions are inefficient and uninterpretable, because if important features such as peak locations randomly vary from curve to curve, ignoring the differences in timing when taking a cross-sectional mean will distort these features.Then the mean curve will not resemble any of the sample curves and is not useful as a representative of the sample of curves (Ramsay & Li 1998).
Early approaches to time-warped functional data included dynamic time warping (Sakoe & Chiba 1978;Wang & Gasser 1997) for the registration of speech and self-modeling nonlinear regression (Lawton & Sylvestre 1971;Kneip & Gasser 1988), where in the simplest case one assumes that the observed random functions can be modeled as shift-scale family of an unknown template function, where shift and scale are subject-specific random variables.A traditional method for time-warped functional data is the landmark method.Special features such as peak locations in functions or derivatives are aligned to their average location and then smooth transformations from the average location to the location of the feature for a specific subject are introduced (Kneip & Gasser 1992;Gasser & Kneip 1995).If well-expressed features are present in all sample curves, the landmark method serves as a gold standard for curve alignment.A problem is that landmarks may be missing in some sample functions or may be hard to identify due to noise in the data.
The mapping of latent bivariate time warping and amplitude processes into random functions has been studied systematically, leading to the definition of the mean curve as the function that corresponds to the bivariate Fréchet mean of both time warping and amplitude processes (Liu & Müller 2004), which can be exemplified with relative area-under-the curve warping, where the latter has been shown to be particularly well suited for samples of random density functions (Kneip & Utikal 2001;Zhang & Müller 2011) Recent approaches include registration by minimizing a Fisher-Rao metric (Tucker, Wu & Srivastava 2013;Wu & Srivastava 2014), alignment of event data by dynamic time warping (Arribas-Gil & Müller 2014), and joint models for amplitude and time variation or for combinations of regression and time variation (Kneip & Ramsay 2008), where adopting a joint perspective leads to better interpretability of time warping models for spoken language (Hadjipantelis et al. 2015) or better performance of functional regression in the presence of warping (Gervini 2015).
Pairwise Warping.As a specific example of a warping approach, we discuss a pairwise warping approach that is based on the idea that all relevant information about time warping resides in pairwise comparisons and the resulting pairwise relative time warps (Tang & Müller 2008).Starting with a sample of n i.i.d.smooth observed curves Y1, Y2, ..., Yn (with suitable modifications for situations where the curves are not directly observed but only noisy measurements of the curves at a grid of discrete time points are available) we postulate that where the Xi are i.i.d.random functions that represent amplitude variation and the hi are the realizations of a time warping process h that yields warping functions that represent time variation, are strictly monotone and invertible and satisfy hi(0) = 0, hi(T ) = T.The time warping functions map time onto warped time.Traditionally, time is assumed to flow forward and therefore warping functions are strictly monotone increasing.However, a recent time warping approach that allows time to flow backwards has been shown to be useful for modeling declines in house prices as time reversals (Peng, Paul & Müller 2014).
To break the non-identifiability, which is a characteristic of time warping models as already mentioned, Tang & Müller (2008) make the assumptions that the overall curve variation is (at least asymptotically) dominated by time variation, i.e., Xi(t) = µ(t) + δZi(t), where δ vanishes for increasing sample size n, the Zi are realizations of a smooth square integrable process and E{h(t)} = t, for t ∈ [0, 1].Then warping functions may be represented in a suitable basis that ensures monotonicity and has associated random coefficients in the expansion, for example monotonically restricted piecewise linear functions.If curve Yi has the associated time warping function hi then the warping function g ik that transforms the time scale of curve Yi towards that of Y k is g ik (t) = hi{h −1 k (t)}, and analogously, the pairwise-warping function of curve Y k towards Yi is g ki (t) = h k {h −1 i (t)}.Because warping functions are assumed to have average identify, , and, as g ik (t) = hi{h −1 k (t)}, we find that h −1 k (t) = E{g ik (t) h k }, which motivates corresponding estimators by plugging in estimates of the pairwise warping functions.This shows that under certain regularity assumptions the relevant warping information is indeed contained in the pairwise time warpings.Functional Manifold Learning.A comprehensive approach to time warping and other nonlinear features of functional data such as scale or scale-shift families that simultaneously handles amplitude and time warping features is available through manifold learning.A motivation for the use of functional manifold models is that image data that are dominated by random domain shifts lie on a manifold (Donoho & Grimes 2005).Similar warping models where the warping component corresponds to a random time shift have been studied for functional data (Silverman 1995;Leng & Müller 2006).Such data have low-dimensional representations in a transformed space but are infinite-dimensional in the traditional functional basis expansion including the eigenbasis expansion (1).While these expansions will always converge in L 2 under minimal conditions, they are inefficient in comparison with representations that take advantage of the manifold structure.
When functional data include time warping or otherwise lie on a nonlinear lowdimensional manifold that is situated within the ambient infinite-dimensional functional Hilbert space, desirable low-dimensional representations can be obtained through manifold learning; the resulting nonlinear representations are particularly useful for subsequent statistical analysis.Once a map from an underlying low-dimensional Euclidean space into functional space has been determined, this gives the desired manifold representation.Among the various nonlinear dimension reduction methods that employ manifold learning (Roweis & Saul 2000), Isomap (Tenenbaum, De Silva & Langford 2000) can be easily implemented and has been shown to be a useful and versatile method for functional data analysis.Specifically, a modified Isomap learning algorithm that adds a penalty to the empirical geodesic distances to correct for noisy data, and employs kernel smoothing to map data from the manifold into functional space provides a flexible and broadly applicable approach to lowdimensional manifold modeling of time-warped functional data (Chen & Müller 2012).This approach targets "simple" functional manifolds M in L 2 that are "flat", i.e., isomorphic to a subspace of Euclidean space, such as a Hilbert space version of the "Swiss Roll".An essential input for Isomap is the distance between functional data.The default distance is the L 2 distance in function space, but this distance is not always feasible, for example when functional data are only sparsely sampled.In such cases, the L 2 distance needs to be replaced by a distance that adjusts to sparsity (Peng & Müller 2008).
The manifold M is characterized by a coordinate map ϕ : R d → M ⊂ L 2 , such that ϕ is bijective, and both ϕ, ϕ −1 are continuous and isometric.For a random function X the mean µ in the d-dimensional representation space and the manifold mean µ M in the functional L 2 space are characterized by µ = E{ϕ −1 (X)}, µ M = ϕ −1 (µ).
The isometry of the map ϕ implies that the manifold mean µ M is uniquely defined.
In addition to obtaining a mean, a second basic task in FDA is to quantify variation.In analogy to the modes of variation that are available through eigenfunctions and FPCA (Castro, Lawton & Sylvestre 1986;Jones & Rice 1992), one can define manifold modes of variation X M j,α = ϕ µ + α(λ M j ) 1 2 e M where •, • is the inner product in R d and ϑj are uncorrelated r.v.s with mean 0 and variance λ M j , the functional manifold components (Chen & Müller 2012).This representation is a genuine dimension reduction of the functional data to the finite dimension d, while for functional data that lie on a nonlinear manifold the Karhunen-Loève representation usually will require a large number of components to provide a good approximation.
In practical applications, the presence of functional manifolds is often evident when plotting the second functional principal components (FPCs) versus the first FPCs, where curved shapes such as "horseshoes" will appear and typically there are few data in the vicinity of the mean (which is always 0 for all principal components, which are uncorrelated).An example is in the right upper panel of Fig. 7, where the functional principal components falling near the mean have low density, while they cluster around the periphery.
This example is based on a small sample (n = 20) of house price index curves showing the boom-bust cycle 2000-2009.These data were previously discussed and modeled in Peng, Paul & Müller (2014) and are displayed in the left upper panel of Fig. 7.The curves show mostly amplitude variation with some variation in the timing of the peaks, many of which have a bimodal appearance, indicating that when the boom cycle peaked, there was a first possible in the case of sparsely sampled data, where direct differentiation of trajectories would not be possible (Liu & Müller 2009).
(29) This is a linear differential equation with a time-varying function β(t) and a drift process Z.Here Z is a Gaussian process such that Z(t), X(t) are independent at each t.If Z is relatively small, the equation is dominated by the linear part and the function β.Then the behavior of β characterizes different dynamics, where one can distinguish dynamic regression to the mean for those t where β(t) < 0 and explosive behavior for those t where β(t) > 0.
In the first case, deviations of X(t) from the mean function µ(t) will diminish, while in the second case they will increase: An individual with a value X(t) above the mean will tend to move even higher above the mean under the explosive regimen but will move closer to the mean under dynamic regression to the mean.Thus the function β that is estimated from the observed functional data can be used to characterize the underlying empirical dynamics.
A nonlinear version of dynamics learning can be developed for the case of non-Gaussian processes (Verzelen, Tao & Müller 2012).This is of interest whenever linear dynamics is not applicable, and is based on the fact that one always has a function f such that with E{Z(t) | X(t)} = 0 almost surely.Generally the function f will be unknown.It can be consistently estimated from the observed functional data by nonparametrically regressing derivatives X (1) against levels X and time t.This can be implemented with simple smoothing methods.The dynamics of the processes is then jointly determined by the function f and the drift process Z. Nonlinear dynamics learning is of interest to understand the characteristics of the underlying stochastic system and can also be used to determine whether individual trajectories are "on track", for example in applications to growth curves.

Outlook and Future Perspectives
FDA has widened its scope from a relatively narrow focus on the analysis of samples of fully observed functions to much wider applicability.An example is longitudinal data analysis, where FDA provides a rich nonparametric methodology for a field that has been dominated by parametric random effects models for a long time.Also of special interest are recent developments in the interface of high-dimensional and functional data.These include: Combining functional elements with high-dimensional covariates, such as modeling predictor times that exercise an individual predictor effect on an outcome that goes beyond the functional linear model (Kneip & Sarda 2011), predictor selection among high-dimensional functional principal component scores and baseline covariates in functional regression models (Kong et al. 2015), or converting high-dimensional data outright into functional data, where the latter has been referred to as Stringing (Wu & Müller 2010;Chen et al. 2011) and is based on a uni-or multi-dimensional scaling step to order predictors along locations on an interval or low-dimensional domain.The stringing method then assigns the value of the respective predictor to the location of the predictor on the interval, for all predictors.The distance of the predictor locations on the interval matches as closely as possible a distance measure between predictors that can be derived from correlations.Combining locations and predictor values and potentially also adding a smoothing step then converts the high-dimensional data for each subject or item to a random function.These functions can be summarized through their functional principal components, leading to an effective dimension reduction that is not based on sparsity and that works well for strongly correlated predictors.Many recent developments in FDA have not been covered in this review.These include functional designs and domain selection problems and also dependent functional data such as functional time series, with many recent interesting developments, e.g.Panaretos & Tavakoli (2013).Another area that has gained recent interest are multivariate functional data.Similarly, in some longitudinal studies one observes for each subject repeatedly observed and therefore dependent functional data rather than scalars.There is also recently rising interest in spatially indexed functional data.These problems pose novel challenges for data analysis (Horvath & Kokoszka 2012).
While this review has focused on concepts and not on applications, as for other growing statistical areas, a driving force of recent developments in FDA has been the appearance of new types of data that require adequate methodology for their analysis.This is leading to the current development of "second generation" functional data that include more complex features than the first generation functional data that have been the emphasis of this review.Examples of recent applications include continuous tracking and monitoring of movement and health data, data that are recorded continuously over time by arrays of sensors, such as traffic flow data, continuously recorded climate and weather data, transcription factor count modeling along the genome, and the analysis of auction data, volatility and other financial data with functional methods. Figure1 Figure 2 Figure 3

Figure 4
Figure 4First four eigenfunctions for CD4 data

Figure 5
Figure 5Observations superimposed on the estimated mean functions of daily vehicle speed recorded by a dual loop vehicle detector station for holidays (left panel for Cluster 1) and non-holidays (middle panel for Cluster 2), and the estimated cluster-specific mean functions (right panel) for comparison.