\ifnum\pageno=1 \input macros.tex \pageno=200\immediate\openout\inx=chap2.index\makecontents\draft{\date (Rob)} \fi %\proofmodefalse \def\gsmooth{{s}} \def\vgsmooth{{\vec s}} \def\symNN{N^S} \def\NN{N} \def\pepscat{1.1} \def\pepadd{1.2} \def\X{\bf X} \chapter 2 {Smoothing} \medskip \Section{What is a smoother?} A smoother is a tool for summarizing the dependence of a response measurement $Y$ on one or more predictor measurements $X_1,\ldots, X_p$. It produces an estimate of this dependence that is less variable than $Y$ itself; hence the name {\sl smoother}. An important property of a smoother is its {\sl nonparametric} nature; that is, it doesn't assume a rigid form for the dependence. For this reason, a smoother is often referred to as a tool for nonparametric regression. The running mean (or moving average) is a simple example of a smoother, while a regression line is not thought of as a smoother because of its rigid parametric form. \input spansizes.tex \par\figname{\allsmooths} We call the estimate produced by a smoother a {\sl smooth}. The most important example of smoothing involves a single predictor setting, usually referred to as {\sl scatterplot smoothing}. Smoothers have two main uses. The first use is for description. For example, a scatterplot smoother can be used to enhance the visual appearance of the scatterplot of $Y$ vs $X$, to help our eyes pick out the trend in the plot. Figure~\pepscat\ in Chapter~1 shows a plot of \name{log(C-peptide)} versus \name{age} for the diabetes data. It seems that \name{log(C-peptide)} has a strong dependence on \name{age} and a scatterplot smoother will be helpful in describing this relationship. Figure~\allsmooths\ shows a number of scatterplot smooths of these data; each described in turn in this chapter. The second use of a smoother is as an estimator of the dependence of $\ev(Y)$ on the predictors, and thus as a building block for the estimation of additive models, discussed in the remainder of the book. In this chapter we give a brief overview of some useful smoothers. Most of our discussion concerns scatterplot smoothers; at the end of the chapter we give a brief description of multiple-predictor smoothers. The simplest smoother occurs in the case of a {\sl categorical} predictor, for example, \name{sex} (male, female) or \name{colour} (red, blue, green etc.) To smooth $Y$ we can simply average the values of $Y$ in each category. This satisfies our requirements for a scatterplot smooth: it captures the trend of $Y$ on $X$ and is smoother than the $Y$-values themselves. While the reader might not normally think of this as {\sl smoothing}, this simple averaging process is the conceptual basis for smoothing in the most general setting, that of an ordered (noncategorical) predictor. The problem here is often the lack of replicates at each predictor value. Most smoothers attempt to mimic category averaging through {\sl local averaging}, that is, averaging the $Y$-values of observations having predictor values close to a target value. The averaging is done in {\sl neighbourhoods} around the target value. There are two main decisions to be made in scatterplot smoothing:\smallskip {\parindent 20pt \item{(i)}how to average the response values in each neighbourhood, and \item{(ii)} how big to take the neighbourhoods. }\smallskip \input allsmooths.tex \par The question of how to average within a neighbourhood is really the question of which {\sl brand} of smoother to use, because smoothers differ mainly in their method of averaging. In this chapter we describe a number of different smoothers and compare them informally. Formal recommendations on how to choose among smoothers are difficult to make because few systematic comparisons have been made so far in the literature. The question of how big to make the neighbourhoods is discussed in the next chapter. The issue underlying this question is very important, however, so we briefly discuss it here. The size of the neighbourhood is typically expressed in terms of an adjustable {\sl smoothing parameter}. Intuitively, large neighbourhoods will produce an estimate with low variance but potentially high bias, and conversely for small neighbourhoods (Fig.~\spansizes). Thus there is a {\sl fundamental tradeoff between bias and variance}, governed by the smoothing parameter. This issue is exactly analogous to the question of how many predictors to put in a regression equation. In the next chapter we discuss this important issue and the practical question of how to choose the smoothing parameter, based on the data, to trade bias against variance in an optimal way. We also discuss some other important topics, such as linear and nonlinear smoothers, the incorporation of observation weights into --- and inference for --- the fitted smooth. All the smoothers we shall describe can easily be {\em robustified} by replacing the averaging or least-squares operation by a more robust procedure. Details can be found in Chapters~3 and 9. Bibliographic notes on smoothers are given at the end of the next chapter. \Sectionskip \Section{Scatterplot smoothing: definition} \Mark{SCATTERPLOT SMOOTHING} Suppose we have response measurements $\vec y=(y_1,\ldots,y_n)^T$ at design points $\vec x=(x_1,\ldots, x_n)^T$. We assume that each of $\vec y$ and $\vec x$ represent measurements of variables $Y$ and $X$. In most cases it is useful to think of $Y$, and sometimes $X$, as having been generated by some random mechanism, but this is not necessary for the discussion here. In particular we don't need to assume that the pairs $(x_i,y_i)$ are a random sample from some joint distribution. For example, the $X$-values might be preset dose levels of a drug. Since $Y$ and $X$ are noncategorical we don't expect to find many replicates at any given value of $X$. For convenience we assume that the data are sorted by $X$ and for the present discussion that there are no tied $X$-values, so that $x_1 < \cdots < x_n$. A trivial remedy when there are ties is to use weighted smoothers, which we discuss in Chapter~3. We denote a scatterplot smooth of $\vec y$ against $\vec x$ by $\gsmooth=\smooth(\vec y\given \vec x)$, where $\gsmooth=\gsmooth(x)$ is a function of $X$. Usually the recipe that defines $\gsmooth(x_0)$, which is the function $\smooth(\vec y\given \vec x)$ evaluated at $x_0$, will be defined for all $x_0$, but at other times is defined only at $x_1,\ldots,x_n$, the sample values of $X$. In this latter case some kind of interpolation is necessary in order to obtain estimates at other $X$-values. \Sectionskip \Section{Parametric Regression} A regression line, estimated for example by least-squares, provides an estimate of the dependence of $\ev(Y)$ on $X$. It does so, however, by assuming a rigid form for this dependence, and thus it may or may not be appropriate for a given set of data. If the dependence is linear or close to it, the regression line provides a concise and useful summary. For the data in Fig.~\allsmooths, however, the regression line is clearly inappropriate and creates a misleading impression. In a sense the regression line is an {\sl infinitely smooth} function, and not surprisingly many of the scatterplot smoothers that we discuss approach the linear regression line as one extreme. The other extreme is usually some function that interpolates the data. Other nonlocal parametric fits, such as polynomial regression estimates, share the same pros and cons as the regression line. They are useful if they are appropriate for the data at hand but potentially misleading otherwise. Although parametric fitting certainly isn't the solution to the scatterplot smoothing problem, we will still make use of it for comparative purposes and for possible summaries of smooths estimated by other methods. Regression splines are a less rigid form of parametric fitting which are closer in spirit to a smoother; we discuss them later in this chapter. \Sectionskip \Section{Bin smoothers} A bin smoother mimics a categorical smoother by partitioning the predictor values into a number of disjoint and exhaustive regions, then averaging the response in each region. Formally, we choose cutpoints $c_0 < \cdots < c_K$ where $c_0=-\infinity$ and $c_K=\infinity$, and define $$R_i=\{j;c_i\leq x_j < c_{i+1}\};\qquad i=1,\ldots, K$$ the indices of the data points in each region. Then $\gsmooth=\smooth(\vec y\given \vec x)$ is given by $\gsmooth(x_0)=\ave_{j\in R_i} (y_j)$ if $x_0 \in R_i$. Typically one chooses five regions, for example, and picks the cutoff points so that there is approximately an equal number of points in each region. The bin smooth shown in Fig.~\allsmooths\ was constructed in this way and illustrates the limitation of the method. The estimate is not very smooth because it jumps at each cut point. Unless we have a prior reason to believe such discontinuities exist, this estimate is not very appealing. One way to make it more smooth is through the use of overlapping regions, as we'll see in the next section. \Sectionskip \Section{Running-mean and running-lines smoothers} Assume that our target value $x_0$ equals one of the $x_j$s, say $x_i$. If we had replicates at $x_i$, we could simply use the average of the $Y$-values at $x_i$ as our estimate $\gsmooth(x_i)$. Now we are assuming that we don't have replicates, so instead we can average $Y$-values corresponding to $X$-values close to $x_i$. How do we pick points that are {\sl close} to $x_i$? A simple way is to choose $x_i$ itself, as well as the $k$ points to the left and $k$ points to the right of $x_i$ that are closest in $X$-value to $x_i$. This is called a {\sl symmetric nearest neighbourhood} and the indices of these points will be denoted by $\symNN(x_i)$. Then we could define the {\sl running mean} $$\gsmooth(x_i)=\ave_{j\in \symNN(x_i)} (y_j).\eqn{\runm}$$ If it is not possible to take $k$ points to the left or right of $x_i$, we take as many as we can. A formal definition of a symmetric nearest neighbourhood is $$\symNN(x_i)= \{ \max(i-k ,1),\ldots,i-1,i,i+1,\ldots,\min(i+ k ,n)\}.\eqn{\neigh}$$ It is not obvious how to define the symmetric nearest neighbours at target points $x_0$ other than the $x_i$ in the sample. We could simply interpolate linearly between the fit of the two values of $X$ in the sample adjacent to $x_0$. Alternatively we could ignore symmetry and take the $r$ closest points to $x_0$, regardless of which side they are on; this is called a {\sl nearest neighbourhood}. This handles arbitrary $x_0$ in a simple and clean way. The pros and cons of these two types of neighbourhoods are discussed later. \input rlsmooth.tex \par This simple smoother is also called a {\sl moving average}, and is popular for evenly-spaced time-series data. Although it is valuable for theoretical calculation because of its simplicity, in practice it does not work very well. It tends to be so wiggly that it hardly deserves the name {\sl smoother.} Apart from looking unpleasant, it tends to flatten out trends near the endpoints and hence can be severely biased. For example, Fig.~\allsmooths\ shows a running mean smooth with $2k+1=11$ or 25\% of the 43 observations. A simple generalization of the running mean alleviates the bias problem: we compute a least-squares line instead of a mean in each neighbourhood. The {\sl running lines smoother} is defined by $$\gsmooth(x_0)=\hat\alpha(x_0) +\hat \beta(x_0) x_0\eqn{\runl}$$ where $\hat\alpha(x_0)$ and $ \hat \beta(x_0) $ are the least-squares estimates for the data points in $\symNN(x_0)$. Figure~\rlsmooth\ shows the local line computed in two different neighbourhoods, one in the interior, the other near the boundary. As we would expect, the fit in the interior is dominated largely by the mean and the slope plays a small role, whereas near the boundary the slope is important for picking up the trend in the asymmetric neighbourhood. A running-lines smooth with a neighbourhood size of $2k+1=13$ or 30\% for the diabetes data, is shown in Fig.~\allsmooths. It seems to capture the trend in the data quite nicely but is still somewhat jagged. The parameter $k$ controls the appearance of the running-lines smooth. Large values of $k$ tend to produce smoother curves while small values tend to produce more jagged curves. It is more convenient to think not in terms of $k$ but instead in terms of $w=(2k+1)/n$, the proportion of points in each neighbourhood, called the {\it span}. We denote by $[\symNN(x_i)]$ the number of points in $\symNN(x_i)$. In the extreme case, if $w= 2$, so that each neighbourhood contains all of the data (note that $w=1$ won't work because of endpoint effects), the running-lines smooth is the least-squares line. On the other hand, if $w=1/n$, each neighbourhood consists only of one data point and hence the smoother interpolates the data. In the next chapter we discuss the quantitative effects of varying the span and a data-based criterion for choosing it. The running-lines smoother can be computed in O(n) operations (Exercise~2.2), and there is a simple, effective method for span selection. These can be important advantages when it is used as a building block in the iterative algorithms discussed later in this book. On the other hand, it tends to produce curves that are quite jagged so that a second stage of smoothing might be necessary. One way to improve the appearance of the running-lines smooth is through the use of a {\sl weighted} least-squares fit in each neighbourhood. The running-lines smoother can produce jagged output because points in a given neighbourhood are given equal (nonzero) weight while points outside of the neighbourhood are given zero weight. Thus as the neighbourhoods move from left to right, there are discrete changes in the weight given to the leftmost and rightmost points. We can alleviate this problem by giving the highest weight to $x_i$, and weights smoothly decreasing as we move further away from $x_i$. ^{Cleveland's (1979)} implementation of a locally-weighted running-lines smoother, {\sl loess}, is popular and (essentially) works in this way. A full description of locally-weighted running-line smoothers is given later in this Chapter; Fig.~\allsmooths\ shows one applied to the diabetes data. It is smoother than the unweighted running-lines smoother. The cost of using smooth weights is a computational one, as the shortcuts available for the unweighted running-lines smoother (Exercise~2.2) do not work for locally weighted running-lines smoothers. Henceforth we will refer to the unweighted running-lines smoother simply as the running-lines smoother. \Sectionskip \Section{Kernel smoothers} A kernel smoother uses an explicitly defined set of local weights, defined by the {\sl kernel}, to produce the estimate at each target value. Usually a kernel smoother uses weights that decrease in a smooth fashion as one moves away from the target point. \input kernelsm.tex \par The weight given to the $j$th point in producing the estimate at $x_0$ is defined by $$S_{0j}={c_0\over \lambda} d\left(\big\vert {{x_0-x_j}\over{\lambda}} \big\vert\right)\eqn{\kernel}$$ where $d(t)$ is an even function defined for $\abs{t}>0$. The parameter $\lambda$ is the window width, and the constant $c_0$ is usually chosen so that the weights sum to unity, although there are slight variations on this. A natural candidate for $d$ is the standard Gaussian density: this gives the so-called { Gaussian kernel} smoother. Other popular kernels, with some theoretical justification (^{Eubank, 1988}), are the Epanechnikov kernel $$d(t)=\cases{ {3\over 4} (1-t^2),& for $\abs{t}\leq 1$;\cr 0& otherwise\cr}$$ which minimizes (asymptotic) mean squared error, and the minimum variance kernel $$d(t)=\cases{ {3\over 8} (3-5t^2),& for $\abs{t}\leq 1$;\cr 0& otherwise\cr}$$ which minimizes the asymptotic variance of the estimate. Note that the weight given to an observation is a function only of its {\sl metric} distance from $x_0$, while the weights used by the nearest-neighbour smoothers are typically a function of both {\sl metric} and {\sl rank} distance. Kernel smoothers also exhibit biased endpoint behaviour. Special kernels have been developed to overcome this bias; a simple approach is to use kernel weights in a locally-weighted straight-line fit. A Gaussian-kernel smooth for the diabetes data is shown in Fig.~\allsmooths, where once again the parameter $\lambda$ was chosen so that the approximate degrees of freedom of the fitted smooth was five, the same as that of the other smooths in Fig.~\allsmooths. \sectionskip \section{Computational issues} One can visualize the action of the kernel smooth as sliding the weight function along the $x$-axis in short steps, each time computing the weighted mean of $y$. The smooth is thus similar to a convolution between the kernel and an empirical step function defined on the data. This is indeed the case, although the practical details obscure the resemblance. For example, the kernel is usually truncated at the ends of the data, unless the data itself is periodic. Typically the kernel smooth is computed as $$\gsmooth(x_0)={\sum_{i=1}^n d({x_0-x_i \over \lambda})y_i\over \sum_{i=1}^n d({x_0-x_i \over \lambda})}, \eqn{\NadWat}$$ and so both the numerator and denominator are convolutions. If the $x$-values are evenly spaced (and preferably a power of two in number), great savings can be made by using the FFT (fast Fourier transform) in performing the calculations. Although evenly spaced data are rarely the case in general regression contexts, reasonable approximations can often reduce the data to a fine grid, where the FFT can be put to work. The details are laid out in Exercise~2.9. \Sectionskip\Section{Running medians and enhancements} A somewhat different approach to smoothing is based on improvements to the simple running mean smoother. First, the running mean is replaced by a running median to make the smoother resistant to outliers in the data. Then the appearance of the estimate is enhanced by applying compound operations known as {\em Hanning}, {em splitting} and {\em twicing}, in various combinations. We will not discuss these smoothers here because they are mainly useful for evenly spaced data (sequences), especially time series, and they don't typically provide an adequate amount of smoothing for our purposes. The interested reader can pursue the references given in the Bibliographic notes. We focus on smoothers for unevenly spaced data (scatterplots), sometimes called {\em regression smoothers}. In our view, another disadvantage of the enhanced running median smoothers is that they are highly nonlinear functions of the response, and thus it is difficult to assess the amount of fitting that they do. This is an important consideration for the inferential stage of a data analysis. Many of the smoothers that we discuss are {\em linear} (section~3.4.2), and this facilitates an approximate assessment of their {\em degrees of freedom} (section~3.5). We return to the time series setting for our discussion of the Fourier analysis of smoothers in section~3.7, and a seasonal decomposition procedure in section~8.5. \input hatplot.tex \Sectionskip \Section{Equivalent kernels} The smoothers described in this chapter are defined in quite different ways, with more different ways to come. Their {\sl equivalent kernels} are one way to compare them on common ground. All the smoothers studied in this chapter are {\sl linear} in $Y$, which means that the fit at a point $x_0$ can be written as $\gsmooth(x_0)=\sum_{j=1}^n S_{0j}y_j$, and the $S_{0j}$ depend on all the $x_i$ and on the smoothing parameter $\lambda$. Thus the weight in the fit at $x_0$ which is associated with the point $x_j$ is $S_{0j}$, and this sequence of weights is known as the {\em equivalent kernel} at $x_0$. A plot of $S_{0j}$ against $x_j$ shows which observations have an influence on the fit at $x_0$. Looking at Fig.~\hatplot, we see that the running mean smoother has weights either zero or $1/m$, where $m$ is the number of observations in the neighbourhood. We have used the same diabetes data as was used in Fig.~\allsmooths. For each smoother we have computed the equivalent kernel centered at two points: one near the boundary and one at the centre. The weights at the boundary are larger and warn us that end effects may be a problem (as indeed they are). For the running-line smoother the weights change linearly within the window (Exercise~2.3). For both the running mean and line smoothers the weights drop off abruptly to zero outside the neighbourhood, and account for their jagged appearance. The {\em loess} smooth on the other hand has a strictly local neighbourhood yet the weights die down smoothly to zero. We need to calibrate the smoothers so that they are doing approximately the same amount of smoothing, since the value of the smoothing parameter will clearly widen or narrow the equivalent kernels. We do this using the {\sl equivalent degrees of freedom}, which we describe in the next chapter. All the smoothers in Fig.~\hatplot\ have been calibrated to do about five degrees of freedom worth of smoothing. In the top panel we see the equivalent kernel for the quartic polynomial fit. As we might expect it spreads its influence everywhere. We will return to Fig.~\hatplot\ as we encounter the remaining smoothers. \Sectionskip \Section {Regression splines} Polynomial regression has limited appeal due to the global nature of the fit, while in contrast the smoothers we have seen so far have an explicit local nature. Regression splines offer a compromise by representing the fit as a {\sl piecewise} polynomial. The regions that define the pieces are separated by a sequence of {\sl knots} or breakpoints, $\xi_1,\ldots,\xi_K$. In addition, it is customary to enforce the piecewise polynomials to join smoothly at these knots. Although many different configurations are possible (the bin smoother was one), a popular choice consists of piecewise cubic polynomials constrained to be continuous and have continuous first and second derivatives at the knots. Apparently our eyes are skilled at picking up second order and lower discontinuities, but not higher. \input BS.tex \par To illustrate this point, Fig.~\BS\ shows four piecewise cubic fits to some simulated data, with 2 interior knots and different orders of continuity. Apart from being smoother, the fit in panel (d) required fewer parameters (6 altogether) than the others. By allowing more knots, the family of curves becomes more flexible. For any given set of knots, the smooth is computed by multiple regression on an appropriate set of basis vectors. These vectors are the basis functions representing the particular family of piecewise cubic polynomials evaluated at the observed values of the predictor $X$. This approach is attractive because of its computational and statistical simplicity; for example standard parametric inferential methods can be used to test the importance of any of the parameters. A variant of polynomial splines are the natural splines; although they are defined for all piecewise polynomials of odd degree, we discuss the natural {\sl cubic} splines. These are cubic splines with the additional constraint that the function is linear beyond the boundary knots. To enforce this condition we have to impose the two constraints in each of the boundary regions: $f'''=f''=0$, which reduces the dimension of the space from $K+4$ to $K$ if there are $K$ knots. In practice it is common to supply an additional knot at each extreme of the data, and impose the linearity beyond them. Then with $K$ interior knots (and two boundary knots), the dimension of the space of fits is $K+2$. Natural splines have less flexibility at the boundaries, but this tends to be a plus since the fitted values of regular regression splines have high variance near the boundary. Furthermore, for the same number of parameters, we get two more interior knots. \sectionskip \section{Computational aspects} The main difficulty when working with regression splines is selecting the number and position of the knots. A very simple approach (referred to as cardinal splines) requires a single parameter, the {\sl number} of interior knots. The positions are then chosen uniformly over the range of the data. A slightly more adaptive version will place the knots at appropriate quantiles of the predictor variable; e.g., three interior knots would be placed at the three quartiles. More adventurous schemes use data driven criteria to select the number and positions of the knots. The main challenge is to come up with a sensible procedure, while avoiding a combinatorial nightmare. We describe some specific proposals in Chapter~9. Another computational aspect concerns the choice of basis functions for representing the splines (a vector space) for a given set of knots. Suppose these interior knots are denoted by $\xi_1<\cdots<\xi_K$, and for notational simplicity we augment the set with two boundary knots $\xi_0$ and $\xi_{K+1}$. A simple choice of basis functions for piecewise-cubic splines, known as the truncated power series basis, derives from the parametric expression for the smooth $$\gsmooth(x)= \beta_0+\beta_1x+\beta_2x^2+\beta_3x^3+\sum_{j=1}^K\theta_j(x-\xi_j)_+^3.\eqn{\tpseries}$$ where the $a_+$ denotes the positive part of $a$. Evidently (Exercise~2.4) \tpseries\ has the required properties:\smallskip {\parindent 20pt \item{(i)} $\gsmooth$ is cubic polynomial in any subinterval $[\xi_j,\xi_{j+1})$, \item{(ii)} $\gsmooth$ has two continuous derivatives, and \item{(iii)} $\gsmooth$ has a third derivative that is a step function with jumps at $\xi_1,\ldots,\xi_K$. }\smallskip We can write \tpseries\ as a linear combination of $K+4$ basis functions $P_j(x)$: $P_1(x)=1$, $P_2(x)=x$, and so on. Each of these functions must also satisfy the three conditions and be linearly independent in order to qualify as a basis. Even without \tpseries\ it is clear that we require $K+4$ parameters to represent piecewise cubics; four for each of $(K+1)$ cubics, less three per interior knot due to the constraints. To actually smooth some data pairs $\{x_i,y_i\}$, we would construct a regression matrix with $K+4$ columns, each column corresponding to a function $P_j$ evaluated at the $n$ values of $x$. Although \tpseries\ has algebraic appeal, it is not the recommended form for computing the regression spline. Even though a particular cubic piece is designed to accommodate one interval, it is evaluated at all points to the right of its knot and the numbers usually get large. \input BSbasis.tex \par The $B$-spline basis functions provide a numerically superior alternative basis to the truncated power series. Their main feature is that any given basis function $B_j(x)$ is nonzero over a span of at most five distinct knots. In practice this means that their evaluation rarely gets out of hand, and the resulting regression matrix is banded. Of course the $B_j$ are themselves piecewise cubics, and we need $K+4$ of them if we want to span the space. Their definition is rather simple (in terms of divided differences), but unhelpful for fully understanding their properties, so we omit it here and refer the reader to ^{de~Boor~(1978)}. Figure~\BSbasis\ shows the evaluated $B$-splines used to compute the fits in Fig.~\BS(d). In summary, regression splines are attractive because of their computational neatness, {\sl when the knots are given}. Standard linear model estimation can be applied, which will be found to be very convenient in later chapters when we use regression splines in additive models. However the difficulty of choosing the number and position of the knots tends to be a drawback of this approach. When a small number of knots is used, the smoother can show some disturbing nonlocal behaviour. Figure~\hatplot\ shows the equivalent kernels for a natural cubic regression spline with three interior knots. The right-hand kernel does not look much different from that for the quartic polynomial! With more knots this global influence would be dampened, but we often don't have that many \df\ to spare. Another problem with regression splines is that the smoothness of the estimate can not easily be varied continuously as a function of a single smoothing parameter; this is an attractive, if only approximate, property of most of the other smoothers that we discuss. \Sectionskip \Section{Cubic smoothing splines} This smoother is not constructed explicitly like those described so far, but instead emerges as the solution to an optimization problem. Consider the following problem: among all functions $f(x)$ with absolutely continuous first derivatives and integrable second derivatives, find the one that minimizes $$\sum_{i=1}^n\{y_i-f(x_i)\}^2 +\lambda\int_{-\infty}^\infty \{f''(t)\}^2 \, dt\eqn{\qb}$$ where $\lambda$ is a fixed constant. This criterion satisfies the requirements for a scatterplot smoother that we mentioned earlier. The first term measures closeness to the data while the second term penalizes curvature in the function. Now it is one thing to state a criterion like $\qb$ and quite another to find the optimizing function. Remarkably, it can be shown that $\qb$ has an explicit, unique minimizer and that minimizer is a {\sl natural cubic spline} with knots at the unique values of $x_i$ (Exercise~2.6). At face value it seems that the family is overparametrized, since there are as many as $n-2$ interior knots. This would result in $n+2$ parameters, although the constraints on each end bring it down to $n$. We'll see however that the coefficients are estimated in a constrained way as well, and this can bring the {\sl effective} dimension down dramatically. A cubic smoothing spline fitted to the diabetes data is shown in Fig.~\allsmooths. It looks much like the running-lines smooth but is less jagged. The parameter $\lambda$ plays the same role as the span in the running-lines smooth. Large values of $\lambda$ produce smoother curves while smaller values produce more wiggly curves. At the one extreme, as $\lambda\rightarrow \infinity$, the penalty term dominates, forcing $f''(x)=0$, and thus the solution is the least-squares line. At the other extreme, as $\lambda\rightarrow 0$, the penalty term becomes unimportant and the solution tends to an interpolating twice-differentiable function. We discuss methods for choosing $\lambda$ in Chapter~3. \sectionskip \section{Computational aspects} Using the fact that the solution to $\qb$ is a natural cubic spline with $n-2$ interior knots, we can represent it in terms of a basis for this space of fits. For computational convenience, we will use the unconstrained $B$-spline basis, and write $\gsmooth(x)=\sum_1^{n+2} \gamma_i B_i(x)$, where $\gamma_i$ are coefficients and the $B_i$ are the cubic $B$-spline basis functions. As written here, $s$ lies in an $(n+2)$-dimensional space; however, the natural splines are a subspace. We can now simply replace $f$ in \qb\ by $s$ and perform the integration. Then defining the $n\times (n+2)$ matrix $ \bB$ and $(n+2)\times (n+2)$ matrix $\fat{\Omega}$ by $$ B_{ij}=B_j(x_i)$$ and $$\fat{\Omega}_{ij}=\int_{-\infty}^\infty B_i''(x)B_j''(x) \, dx,$$ we can rewrite the criterion $\qb$ as $$(\vec y-\bB\fat\gamma)^T(\vec y-\bB\fat\gamma) +\lambda\fat \gamma^T\fat{\Omega} \fat \gamma\eqn{\qd}$$ Although at face value it seems that there are no boundary derivative constraints, it turns out that the penalty term automatically imposes them. Setting the derivative with respect to $\gamma$ equal to zero gives $$(\bB^T\bB+\lambda\fat\Omega)\hatfat\gamma=\bB^T\vec y\eqn{\eqq}$$ Since the columns of $\bB$ are the evaluated $B$-splines, in order from left to right and evaluated at the {\sl sorted} values of $X$, and the cubic $B$-splines have local support, $\bB$ is lower 4-banded. Consequently the matrix $\vec M=(\bB^T\bB+\lambda\fat{\Omega})$ is 4-banded and hence its Cholesky decomposition $\vec M=\bL\bL^T$ can be computed easily. One then solves $\bL\bL^T\hatfat\gamma=\bB^T\vec y$ by back-substitution to give $\hatfat\gamma$ and hence the solution $\hat\gsmooth$ in O(n) operations. For theoretical purposes, it is convenient to rewrite the solution vector $\vgsmooth$ ($\gsmooth$ evaluated at each of the $n$ values of $X$ in the sample) in another form. Let $\bN$ be an $n\times n$ nonsingular {\sl natural-spline} basis matrix for representing the solution (Exercise~2.5). Denote by $\hatfat{\beta}$ the transformed version of $\hatfat\gamma$ corresponding to the change in basis. Then we can write $$\vgsmooth=\bN\hatfat{\beta}=\bN(\bN^T\bN+\lambda\fat{\Omega})^{-1}\bN^T\vec y= (\bI+\lambda \bK)^{-1}\vec y\eqn{\qf}$$ where $\bK={\bN^{-T}}\fat{\Omega} \bN^{-1}$. In terms of the candidate fitted vector $\vec f$ and $\bK$, the cubic smoothing spline $\vgsmooth $ minimizes $$(\vec y-\vec f)^T(\vec y-\vec f) +\lambda\vec f^T\bK \vec f\eqn{\nqd}$$ over all vectors $\vec f$. It is appropriate to call the term $\vec f^T\bK\vec f$ a roughness penalty, since it can be shown to be a quadratic form in second differences. The cubic smoothing spline, because of the implicit way it is defined, doesn't appear to use {\sl local averaging}. However, the equivalent-kernel weights in Fig.~\hatplot\ show that it does possess local behaviour quite similar to kernels or locally-weighted lines. The equivalent kernel is nowhere nonzero, but is close to zero far from the target point. Other examples of equivalent kernels are given in ^{Buja, Hastie and Tibshirani (1989)}, and an asymptotic form for the equivalent kernel of smoothing splines is derived by ^{Silverman (1984)}. One might say, then, that a cubic smoothing spline is approximately a {\sl kernel} smoother. The smoothing parameter $\lambda$ controls the shape of the kernel or weight function. As one would expect, when $\lambda$ is decreased the weights tend to concentrate more around the target point, while as $\lambda$ is increased, they spread out more. We discuss the properties of smoothing splines in more detail in Chapters~5 and section~9.3.6. \Sectionskip \Section{Locally weighted running-line smoothers} Here we describe in more detail the locally-weighted smoother of ^{Cleveland (1979)}, currently called {\sl loess} in the S statistical-computing language. Although any polynomial can be fitted locally, we discuss local lines. A locally-weighted straight-line smooth $\gsmooth(x_0)$ using $k$ nearest neighbours is computed in a number of steps:\smallskip {\parindent 20pt \item{(i)} The $k$ nearest neighbours of $x_0$ are identified, denoted by $\NN(x_0)$. \item{(ii)}$\Delta(x_0)=\max_{N(x_0)}\abs{x_0-x_i}$ is computed, the distance of the furthest near-neighbour from $x_0$. \item{(iii)}Weights $w_i$ are assigned to each point in $\NN(x_0)$, using the {\sl tri-cube} weight function: $$W\biggl({\abs{x_0-x_i}\over {\Delta(x_0)}}\biggr) \eqn{\clev}$$ where $$W(u)=(1-u^3)^3 \quad\for\quad 0\leq u < 1 \eqn{\tri}$$ and zero otherwise. \item{(iv)} $\gsmooth(x_0)$ is the fitted value at $x_0$ from the weighted least-squares fit of $y$ to $x$ confined to $\NN(x_0)$ using the weights computed in (iii). }\smallskip \input loessm.tex \par Figure~\loessm\ depicts the situation, using the diabetes data once again. Notice that the kernel is truncated at one end due to the metric asymmetry of the neighbourhood. Cleveland also discusses the use of a robust regression within each neighbourhood, to protect against outliers. This effectively works by repeatedly smoothing the data, and at each iteration down-weighting points with large residuals. See Chapters~3 and 9 for more details. A locally-weighted running-line smooth is shown in Fig.~\allsmooths. The number of nearest-neighbours, usually expressed as a percentage or {\sl span} of the data points, is the smoothing parameter. In Figs~\allsmooths\ and \loessm\ a span of $21/43$ was used, chosen to make the degrees of freedom approximately equal to that of the other smooths. In principle, nearest neighbourhoods are preferable to symmetric nearest neighbourhoods because in a neighbourhood with a fixed number of points, the average distance of the points to the target point is less in the nearest neighbourhood (unless the predictors are evenly spaced). In general this should result in less bias. We use symmetric nearest neighbourhoods in a running-lines smoother, however, because of their superior performance at the left and right endpoints. A nearest neighbourhood at the endpoint contains the same number of data points as a neighbourhood in the middle of the data, so that when a least-squares fit is applied there, too much weight is given to points far away from the target point. A symmetric nearest neighbourhood, on the other hand, contains only about half of the number of points as a full neighbourhood, and thus automatically reduces the weight given to far way points. Nearest neighbourhoods work satisfactorily with {\sl loess} at the endpoints, however, because the tri-cube function does the job of down-weighting far away points. The locally-weighted smoothers are popular, since they enjoy the best of both worlds. They share the ability of near-neighbour smoothers to adapt their bandwidth to the local density of the predictors, while they have the smoothness features of kernel smoothers. Although in principal it requires $O(n^2)$ operations to compute a locally-weighted smooth, the current implementations compute the fit over a grid of values of $x$, and use interpolation elsewhere. \Sectionskip \Section{Smoothers for multiple predictors} So far we have talked about smoothers for a single predictor, that is, scatterplot smoothers. What if we have more than one predictor, say $X_1,\ldots, X_p?$. Then our problem is one of fitting a $p$-dimensional surface to $Y$. The multiple regression of $Y$ on $X_1,\ldots, X_p$ provides a simple, but very limited, estimate of the surface. On the other hand, it is easy conceptually to generalize the running mean, locally-weighted running-lines, and kernel smoothers to this setting. The first two smoothers require a definition of {\sl nearest neighbours} of a point in $p$-space. {\sl Nearest} is determined by a distance measure and for this the most obvious choice is Euclidean distance. Note that the notion of symmetric nearest neighbours is no longer meaningful when $p > 1$. Having defined a neighbourhood, the generalization of the running mean estimates the surface at the target point by averaging the response values in the neighbourhood. This highlights an important detail: what shape neighbourhood do we use? Nearest-neighbourhoods defined in terms of Euclidean distance are typically spherical, but one could imagine more general neighbourhoods defined in terms of a covariance matrix $\Sigma$ of the predictors. Such a metric would consider points lying on an ellipse centered at the target point to be equidistant from the target. This generalization is not a purely academic issue --- it may be important if the covariates are measured in different units or are correlated. In linear regression, the coefficients automatically scale each covariate correctly. Various strategies are used in practice, an obvious one being to standardize the individual variables prior to smoothing. This might not always be the best choice, however. For example, if the underlying surface changes more rapidly with one variable than the other, one may wish to have the neighbourhoods thinner in the direction of the first variable. This can be achieved by differential scaling of the variables. \figname{\bivar}\par The kernel smoother is generalized in an analogous way to the running mean. Given two $p$-vectors $\vec x^0$ and $\vec x^i$ in predictor space, the weight given to this $i$th point for the fit at $\vec x^0$ is $S_{0i}=(c_0/\lambda) d({{\norm{\vec x^0-\vec x^i}}/\lambda})$ where $d(t)$ is an inverse distance measure and $\norm{\cdot}$ is a norm, for example squared distance. The constant $c_0$ is usually chosen to make the weights sum to unity, so that the smoother reproduces the constant function. The multi-predictor Gaussian kernel uses $d(t)$ equal to the standard Gaussian density. The resulting weight function is depicted in Fig.~\bivar. \midinsert \vbox{ \centerline{\psfig{file=figures/bivar.p,width=3in}} \figurecaption{\bivar}{ The weight function for a bivariate Gaussian kernel smoother. }* } \endinsert The choice of norm is important in this setting, and determines the shape of the neighbourhoods in predictor space. One might argue, for example, that the norm should incorporate covariance information by replacing the squared error norm with the ``Mahalanobis'' distance $$\norm{\vec x^0-\vec x^i}=(\vec x^0-\vec x^i)^T\Sigma^{-1}(\vec x^0-\vec x^i)$$ where $\Sigma$ is the covariance matrix of the predictors. Locally-weighted lines generalize equally easily to two or more dimensions. The simplest generalization hardly needs any different wording than that used to describe the univariate situation. Once a multivariate distance is defined, nearest-neighbourhoods and tri-cube weights get assigned in exactly the same way. One computes a local plane rather than a line. The cubic smoothing spline is more difficult to generalize to two or higher dimensions: the so-called {\sl thin-plate spline} is one such generalization. It derives from generalizing the second derivative penalty for smoothness to a two dimensional Laplacian penalty of the form $$\int\int\left\{\left({\partial^2 f\over \partial x_1^2}\right)^2 + \left({\partial^2 f\over \partial x_1\partial x_2}\right)^2+\left({\partial^2 f\over \partial x_2^2}\right)^2\right\}\, dx_1dx_2.$$ The computations are $O(n^3)$ in number compared to the $O(n)$ for univariate splines. Details are not given here; the interested reader can find references in the bibliographic notes at the end of Chapter~3. Another generalization is known as multivariate {\sl tensor product} splines. These are also useful for generalizing univariate regression splines. The basic idea is to construct two-dimensional basis functions by multiplying together one-dimensional basis functions in the respective predictors. We discuss these in more detail for multivariate regression splines in Chapter~9. For the smoothing version of tensor product splines, the effective basis set has dimension $n^2$ formed by the tensor product of the univariate bases. The fit will be a damped regression onto this space of basis functions. See the references at the end of Chapter~3 for details. We are not devoting much space to multi-predictor smoothers because we don't feel that they are very useful for more than two or three predictors. Indeed, their many shortcomings (e.g.~difficulty of interpretation and computation) provide an impetus for studying additive models, the central topic of this book. These comments refer to the generic multivariate smoothers as described here. We are not referring to some of the adaptive multivariate nonparametric regression methods, which might also be termed surface-smoothers, and which were designed to overcome some of these objectional aspects. We discuss some of the problems associated with multi-predictor smoothing in detail at the beginning of Chapter~4. Despite these criticisms, multi-predictor smoothers (especially the bi-predictor) are useful in certain settings, as later examples show. \Sectionskip \Section{Discussion and bibliography} In this chapter we have described a number of scatterplot smoothers, including the running-mean and running-lines smoothers, cubic regression and smoothing splines, locally-weighted running-lines and kernel smoothers. We have also touched on surface smoothers. Our discussion has focused on smoothers with which we have some experience, and any imbalance in our introduction simply reflects this. In addition, the emphasis in the book is not on smoothers {\sl per~se}, but on smoothers as building blocks in the estimation of additive models. Some recent theoretical results suggest that for appropriately chosen smoothing parameters, there are not likely to be large differences between locally-weighted running-lines, cubic smoothing-splines and kernel smoothers (^{Silverman, 1984}; ^{Muller, 1987}). In order to assess the finite-sample operating characteristics of these and other smoothers, a comprehensive Monte Carlo study, akin to the Princeton robustness study, would be very useful. A recent report by ^{Breiman} and Peters (1988) has this flavour. For a detailed bibliography, see the notes at the end of Chapter~3. Here we list some introductory references we have found useful. ^^{Eubank, R.L.} Eubank (1988) covers most of the material in various levels of detail, especially smoothing splines. ^^{Silverman, B.W.} Silverman's (1985) paper on smoothing splines not only gives an easy to read introduction but also touches on a variety of applications. ^^{Cleveland, W. S.} Cleveland (1979) introduces the locally-weighted running-line smoother for univariate problems, and ^^{Devlin, S.}^^{Grosse, E. H.} ^{Cleveland, Devlin and Grosse (1988)} describe the multivariate version. \Sectionskip \Section{Further results and exercises 2} \Mark{EXERCISES \ 2}% \beginexercises% \exercise For the diabetes data, compute one of each of the smooths described in this Chapter for the variable \name{base deficit}. Compare qualitatively the resulting estimates in terms of notable features and smoothness. \exercise {\sl Updating formula for running-line smooth.} Suppose we add a point $(x_{j+1},y_{j+1})$ to a neighbourhood containing $j$ points. If the means, variance of $X$, and covariance of $X$ and $Y$ for the first $j$ points are $\bar x_j$, $\bar y_j$, $S^x_j$, and $S^{xy}_j$, show that $$\bar x_{j+1}=(j\bar x_j+x_{j+1})/(j+1)$$ $$\bar y_{j+1}=(j\bar y_j+y_{j+1})/(j+1)$$ $$(j+1)S^x_{j+1}=jS^x_j+{{j+1}\over j}(x_{j+1}-\bar x_{j+1})^2$$ $$(j+1)S^{xy}_{j+1}=jS^{xy}_j+{{j+1}\over j}(x_{j+1}-\bar x_{j+1})(y_{j+1}- \bar y_{j+1}).\eqn{\upd}$$ What are the equivalent equations for deleting a point? Together these can be used to update the least-squares slope and intercept, and hence the entire fit can be computed in $O(n)$ operations. \exercise Derive an expression for $S_{ij}$, the coefficient of $y_j$ in the expression $\gsmooth(x_i)=\sum_{j=1}^n S_{ij}y_j$ for the running-line fit at $x_i$, using $\symNN(x_i)$ to denote the set of indices of the $k$ symmetric nearest-neighbours to $x_i$. \exercise Show that the truncated power series representation $$\gsmooth(x)= \beta_0+\beta_1x+\beta_2x^2+\beta_3x^3+\sum_{j=1}^k\theta_j(x-\xi_j)_+^3$$ satisfies the three conditions of a cubic spline given in the text. \exercise {\sl Basis for natural splines.} Suppose $B$ is an $n\times (K+4)$ matrix containing the evaluations of the cubic $B$-spline basis functions with $K$ interior knots evaluated at the $n$ values of $X$. Let $C$ be the $2\times (K+4)$ matrix containing the second derivatives of the basis functions at the boundary points $x_1$ and $x_n$. Show how to derive $N$ from $B$, an $n\times (K+2)$ basis matrix for the natural cubic splines with the same interior knots and boundary knots at the extremes of the $X$. \exercise {\sl Derivation of smoothing splines; ^{Reinsch (1967)}.} Consider the following optimization problem: minimize $$\int_{-\infty}^\infty \{f''(x)\}^2 \, dx \quad \hbox{subject to}\quad \sum_{i=1}^n \{y_i-f(x_i)\}^2 \leq \sigma \eqn{\sopt}$$ over all twice differentiable functions $f(x)$. To solve this problem, consider the Lagrangian functional $$F(f)=\int_{-\infty}^\infty\{f''(x)\}^2 \, dx + \rho \left[ \sum_{i=1}^n \{y_i-f(x_i)\}^2-\sigma+z^2\right]\eqn{\lageq}$$ where $z$ is an auxiliary variable. \smallskip {\parindent 20pt \item{(i)} To find the minimum of $F(f)$, compute $d F\{f+\delta h(x)\}\delta$ where $h(x)$ is any twice differentiable function. (This is a standard technique in the calculus of variations). Integrate by parts the resulting expression (twice) and conclude that the solution is a piecewise cubic polynomial with knots at the $x_i$. Show that the solution and its first two derivatives are continuous at the knots, but that the third derivative may be discontinuous there. Show also that the second derivative is zero outside of the range of the $x_i$. \item{(ii)} Compute $d F(f)\rho$ and $\partial F(f)/\partial \sigma$ and conclude that either $\hat \rho=0$ implying that the least-squares line is a solution to $\sopt$, or $\hat\rho\neq 0$, in which case the solution is nonlinear. \item{(iii)} Establish an equivalence between the $\sopt$ and the penalized least-squares problem $\qb$ given in the text. } \exercise Suppose each observation has associated with it a weight $w_i$. \smallskip {\parindent 20pt \item{(i)} Derive the appropriate cubic smoothing spline using a weighted residual sum of squares. \item{(ii)} Suppose there are ties in the $x$-values. Suggest a way to overcome this for smoothing splines. } \exercise {\sl Semi-parametric regression; ^^{Green, P.J.}^^{Jennison, C. }^^{Seheult, A.} Green, Jennison, and Seheult (1985).} Suppose we have a set of $n$ observations of $p$ predictors arranged in the $n \times p$ matrix $\bf X$, an additional covariate $\vec z$, and a response vector $\vec y$. We wish to fit the model $y_i=\vec x^i{\fat\beta} +f(z_i) +\varepsilon_i$ by penalized least-squares. Construct an appropriate penalized residual sum of squares, and show that the minimizers must satisfy the following pair of {\sl estimating} equations: $$\eqalign{ {\fat\beta}&=(\X^T\X)^{-1}\X^T(\vec y-\vec f)\cr f &= \smooth(\vec y -\X{\fat\beta}\given \vec z)\cr}$$ where $\smooth$ is an appropriate smoothing-spline operator, and $\vec f$ represents the function $f$ evaluated at the $n$ values $z_i$. \exercise {\sl Efficient kernel smoothing; ^{Silverman (1982)}, ^{H\"ardle (1986)}} Consider the ^^{Nadaraya, E.A.}^^{Watson, G.S.}Nadaraya-Watson form of the kernel smooth given by equation \NadWat: $$\gsmooth(x_0)={\sum_{i=1}^n\rho({x_0-x_i \over \lambda})y_i\over \sum_{i=1}^n\rho({x_0-x_i \over \lambda})}.$$ We wish to use the FFT to compute both the numerator and denominator. With that end in mind we define a fine grid that covers the range of $X$. Describe how you would approximate the data on the grid and use the FFT to perform the computations. Pay particular attention to:\smallskip {\parindent 20pt \item{(i)} what happens at the boundaries, and \item{(ii)} a method for overcoming discontinuities due to the approximation. } \exercise Extend the methods described in the previous exercise in order to compute a kernel-weighted running-line fit. \exercise Suppose $\bS$ is an $n\times n$ smoother matrix for smoothing against the $n$ unique predictor values $\vec x=(x_1,\ldots, x_n)^T$. That is, if $s(x_i)=\sum_{j=1}^n S_{ij}y_j$, then $\bS$ has $ij$th entry $S_{ij}$. For all of the smoothers that we have described, $\bS\vec 1=\vec 1$, and for most $\bS\vec x=\vec x$. For which is the second statement true, and what are the implications of these two results? For smoothing splines the symmetric statements are also true: $\bS^T\vec 1=\vec 1$ and $\bS^T\vec x=\vec x$, since $\bS$ is symmetric in this case. Is this true for any of the other smoothers? What are the implications of these conditions? How would you modify $\bS$ for the other smoothers in order that the first or both of these latter conditions are satisfied. Try some examples and see what happens to the local support properties of the equivalent kernels. \endexercises \vfill\supereject