Mathcad includes a number of functions for performing regression. Typically, these functions generate a curve or surface of a specified type which in some sense minimizes the error between itself and the data you supply. The functions differ primarily in the type of curve or surface they use to fit the data.

Unlike the interpolation functions discussed in the previous section, these functions do not require that the fitted curve or surface pass through the data points you supply. The regression functions in this section are therefore far less sensitive to spurious-data than the interpolation functions.

Unlike the smoothing functions in the next section, the end result of a regression is an actual function, one that can be evaluated at points in between the points you supply. Whenever you use arrays in any of the functions described in this section, be sure that every element in the array contains a data value. Since every element in a array must have a value, Mathcad assigns 0 to any elements you have not explicitly assigned.

Linear regression

These functions return the slope and intercept of the line that best fits your data in a least-squares sense. If you place your x values in the vector vx and your sampled y values in vy, that line is given by

Polynomial regression

These functions are useful when you have set of measured y values corresponding to x values and you want to fit a polynomial through those y values.

Use regress when you want to use a single polynomial to fit all your data values. The regress function lets you fit a polynomial of any order. However as a practical matter, you would rarely need to go beyond n = 4.

Since regress tries to accommodate all your data points using a single polynomial, it will not work well when your data does not behave like a single polynomial. For example, suppose you expect your y i to be linear from xl to x 10 and to behave like a cubic equation from Xli to x20. If you use regress with n = 3 (a cubic), you may get a good fit for the second half but a terrible fit for the first half.

The loess function, available in Mathcad Professional, alleviates these kinds of problems by performing a more localized regression. Instead of generating a single polynomial the way regress does, loess generates a different second order polynomial depending on where you are on the curve. It does this by examining the data in a small neighborhood of the point you’re interested in. The argument span controls the size of this neighborhood. As span gets larger, loess becomes equivalent to regress with n = 2. A good default value is span = 0.75. .

Figure 14-10 shows how span affects the fit generated by the loess function. Note how a smaller value of span makes the fitted curve track fluctuations in data more, effectively. A larger value of span tends to smear out fluctuations in data and therefore generates a smoother fit.

A vector required by the interp function to find the nth order polynomial that best fits data vectors vx and vy. vx is an m element vector containing x coordinates. vy is an m element vector containing the y coordinates corresponding to the m points specified in vx.

Returns the interpolated y value corresponding to the x. The vector vs comes from evaluating loess or regress using the data matrices vx and vy.

Multivariate polynomial regression

The loess and regress functions discussed in the previous section are also useful when you have set of measured z values corresponding to x and y values and you want to fit a polynomial surface through those z values.

The properties of these functions are described in the previous section. When using these functions to fit z values corresponding to two independent variables x and y, the meanings of the arguments must be generalized. Specifically:

• The argument vx which was an m-element vector of x values becomes an m-row and 2 column array, Mxy. Each row of Mxy contains an x in the first column and a corresponding y value in the second column.

• The argument x for the interp function becomes a 2-element vector v whose elements are the x and y values at which you want to evaluate the polynomial surface representing the best fit to the data points in Mxy and vz.

You can add independent variables by simply adding columns to the Mxy array. You would then add a corresponding number of rows to the vector v that you pass to the interp function. The regress function can have as many independent variables as you want. However, regress will calculate more slowly and require more memory when the number of independent variables and the degree are greater than four. The loess function is restricted to at most four independent variables.

Keep in mind that for regress, the number of data values, m must satisfy

nwhere
n is the number of independent variables (hence the number of columns in Mxy), k is the degree of the desired polynomial, and m is the number of data values (hence the number of rows in vz). For example, if you have five explanatory variables and a fourth degree polynomial, you will need more than 126 observations.

Generalized regression

Unfortunately, not all data sets can be modeled by lines or polynomials. There are times when you need to model your data with a linear combination of arbitrary functions, none of which represent terms of a polynomial. For example, in a Fourier series you try to approximate data using a linear combination of complex exponentials. Or you may believe your data can be modeled by a weighted combination of Legendre polynomials, but you just don’t know what weights to assign.

The linfit function is designed to solve these kinds of problems. If you believe your data could be modeled by a linear combination of arbitrary functions:

Anything you can do with linfit you can also do, albeit less conveniently, with genfit. The difference between these two functions is the difference between solving a system of linear equations and solving a system of nonlinear equations. The former is easily done using the methods of linear algebra. The latter is far more difficult and generally must be solved by iteration. This explains why genfit needs a vector of guess values as an argument and linfit does not.

Figure 14-12 shows an example in which genfit is used to find the exponent that bes fits a set of data.

A vector containing the parameters that make a functionfof x and n parameters uo, u1′ … , un best approximate the data in vx and vy. F is a function that returns an n + 1 element vector containingf and its partial derivatives with respect to its n parameters. vg is an n-element vector of guess values for the n parameters.

Smoothing functions

y values that is smoother than the original set. Unlike the regression and interpolation functions discussed earlier, smoothing results in a new set of y values, not a function that can be evaluated between the data points you specify. Thus, if you are interested in y values between the y values you specify, you should use a regression or interpolation function.

Whenever you use vectors in any of the functions described in this section, be sure that every element in the vector contains a data value. Since every element in a vector must have a value, Mathcad assigns 0 to any elements you have not explicitly assigned. The medsmooth function is the most robust of the three since it is least likely to be affected by spurious data points. This function uses a running median smoother, computes the residuals, smooths the residuals the same way, and adds these two smoothed vectors together. The details are as follows:

• Evaluation of medsmooth(vy, n) begins with the running median of the input vector vy. We’ll call this vy’ . The ith element is given by:

•It then evaluates the residuals: vr = vy – vy’ .

• The residual vector, vr, is smoothed using the same procedure described in step 1. This creates a smoothed residual vector, vr’ .

• The medsmooth function returns the sum of these two smoothed vectors: medsmooth(vy, n) = vy’ + vr

Note that medsmooth will leave the first and last (n – 1)/2 points unchanged. In practice, the length of the smoothing window, n, should be small compared to the length of the data set.

The ksmooth function in Mathcad Professional uses a Gaussian kernel to compute local weighted averages of the input vector vy. This smoother is most useful when your data band of relatively constant width. If your data lies scattered along a band whose width fluctuates considerably, you should use an adaptive smoother like supsmooth, also available in Mathcad Professional.

and b is a bandwidth which you supply to the ksmooth function. The bandwidth is usually set to a few times the spacing between data points on the x axis depending on how big a window you want to use when smoothing.

The supsmooth function uses a symmetric k nearest neighbor linear least-squares fitting procedure to make a series ofline segments through your data. Unlike ksmooth which uses a fixed bandwidth for all your data, supsmooth will adaptively choose different bandwidths for different portions of your data.

Posted on November 21, 2015 in Statistical Functions