Code covered by the BSD License  

Highlights from
polyfitn

4.875

4.9 | 8 ratings Rate this file 201 Downloads (last 30 days) File Size: 52.6 KB File ID: #34765

polyfitn

by John D'Errico

 

25 Jan 2012

Polynomial modeling in 1 or n dimensions

| Watch this File

File Information
Description

Polyfitn is an extension of polyfit, allowing the user to create models with more than one independent variable. It also allows the user to specify a general model, for example, a quadratic model, with constant and quadratic terms, but no linear term.

For example, to fit a polynomial model to points selected from a cosine curve, we will only need the even ordered terms.

x = -2:.1:2;
y = cos(x);
p = polyfitn(x,y,'constant x^2 x^4 x^6');
p.Coefficients
ans =
    [0.99996 -0.49968 0.041242 -0.0012079]

The coefficients won't be exact of course, as I used only a finite number of terms for what is essentially a truncated Taylor series, and I had only a finite amount of points to build the model from. The exact first 4 coefficients for a cosine series would have been:

>> [1 -1/2 1/24 -1/720]
ans =
            1 -0.5 0.041667 -0.0013889

so we got the expected result.

Of course, polyfitn works in higher dimensions, as it was this problem it was really designed to solve.

x = rand(100,1);
y = rand(100,1);
z = exp(x+y) + randn(100,1)/100;
p = polyfitn([x,y],z,3);

The result can be converted into a symbolic form to view the model more simply. Here I'll use my sympoly toolbox, but there is also a polyn2sym function provided.

polyn2sympoly(p)
ans =
    0.43896*X1^3 + 1.4919*X1^2*X2 + 0.041084*X1^2 + 1.4615*X1*X2^2 - 0.095977*X1*X2 + 1.2799*X1 + 0.56912*X2^3 - 0.15306*X2^2 + 1.361*X2 + 0.94819

And of course, parameter error estimates are generated for those who wish to determine the significance of the terms generated.

I've also provided tools for evaluating these models, and to differentiate a model.

A caveat - beware the use of high order polynomials to fit your data. Just because a low order model works, a high order model is not necessarily better. High order polynomials often suffer from severe ringing between the data points. Always plot your data. Think about the model you will be building. Then plot the resulting model. Use your eyes to validate the result, more so than simply looking at an r-squared coefficient (although I do return that parameter too.)

If you do find that a high order polynomial mode is necessary because your curve is simply too complicated, consider using a regression or smoothing spline model instead.

Acknowledgements

This file inspired Gapolyfitn and Poly Val2 D And Poly Fit2 D.

MATLAB release MATLAB 7.5 (R2007b)
Tags for This File  
Everyone's Tags
approximation, curve, curve fitting, fit, fitting, function, interpolation, linear regression, modeling, multidimensional, polyfit, polynomial(2), regression
Tags I've Applied
Add New Tags Please login to tag files.
Please login to add a comment or rating.
Comments and Ratings (17)
12 Jun 2013 Daniel Higginbottom  
31 May 2013 Eric H.

Dennis Wouters,
Thanks for providing this fix. It works!

19 Mar 2013 Dennis Wouters

John, nice work.
I encountered a minor glitch.
When specifying the modelterms as an array of exponents, the variable 'varlist' was not assigned a value.
I fixed this by inserting the following at line 160:
else varlist = repmat({''},1,p);

25 Feb 2013 John D'Errico

Despite my having written polyfitn, high order polynomials (especially in multiple dimensions) are rarely a good substitute for a nonlinear model that is chosen to have the proper behavior. That is, IF you have such a model. Those polynomials tend to do nasty things to you, especially in extrapolation. And unless you have sufficient data, then the model may behave strangely even inside your data. I call this the problem of interpolation.

Basically, you need to be very careful with high order polynomial fits.

By the way, nlinfit DOES handle multiple independent variables. You suggest it does not, but there is no constraint on that. The problem is choosing an intelligent model in multiple dimensions. This is EXTREMELY difficult in general. One good reason is that a functional form like w = f(x,y,z) is so difficult to visualize.

Personally, my preference has always been to move to a spline-like model for these problems. I am often unable to pull a model out of thin air, and I would prefer to avoid a high order polynomial. My own gridfit is one simple example, with 2 independent variables.

25 Feb 2013 Karan

Hello John,

Thank you for your inputs. I really appreciate the help. From what i have understood is that the data is not sufficient enough for using such a high order polynomial fit. I am actually looking at fitting the data to my own custom equation but haven't found a model for it yet- similar to nlin fit but with multiple independent variables. If you do have any suggestions please do let me know.

Thank you.

Karan

22 Feb 2013 John D'Errico

Hi Karan,

You've fallen into a common fallacy in regression modeling. It goes like this: "If n is good, then n+1 is better, and n+2 must be really wonderful."

The fallacy starts out easily. We try a linear model on our data. It works, all right, but not very well. After all, must problems are only pretty locally linear. So we try a quadratic model. It fits better. The lack of fit is reduced. But there are still problems in the model. It is inadequate to fit the shape of our data.

Since we did so well with the quadratic, try out a third order model. Yep, it does better still. Wow. Keep pushing things, until we have a 6th order model, or 10th order. A polynomial model is just like a Taylor series approximation, so more terms must give a better result. Eventually though, it all just turns to numerical crap before our eyes. So what happened?

Here you have 3 variables. A first order model has a constant term, as well as x, y, and z terms. So 4 terms in that model. You have 74 data points, and surely had no problem estimating the corresponding coefficients.

A quadratic model adds in 6 more terms, x^2, y^2, z^2, but also the cross product terms x*y, x*z, and y*z, so a total of 10 coefficients to estimate.

The curse of our fallacy is that as we move up in order, the number of terms grows faster and faster. At 3rd order, we have 20 terms to estimate, 35 at 4th order, 56 at 5th order, and 84 at 6th order. A 10th order model would have 286 terms.

How about your data? You have only 74 data points. (In my early days when I did some statistical consulting, my clients would always ask how much data they needed. My answer was almost always "More than you have.")

Every distinct data point is at most one additional piece of information. (Replicate points do not count here. All they do is reduce the uncertainty in your estimates. So replicates are good, but not quite as valuable as distinct points in some respects.)

Worse, new data points do not always provide new information. For example, two points determine a line. A third point on that line adds no new information towards the slope or intercept of that line. Again, like a replicate, all it does is reduce the uncertainty in your estimate of the coefficients.

So what happened when you tried to estimate a 6th order model with only 74 data points? You had too little information in your data to support such a large model. The message that comes back is from MATLAB, telling you that your matrix is numerically singular (rank deficient.)

Having said all this, things can get worse yet, making the problem yet more difficult. Some sets of data are worse than others. For example, if you add 1e6 to all of your independent variables, a polynomial of some given order will still be estimable mathematically (in theory), yet it may result in a matrix that is numerically singular. This is because the linear algebra is written using double precision floating point arithmetic. The numbers may simply lie over too wide of a dynamic range for the algebra to succeed. (What happens when you raise numbers that are on the order of 1e6 to successively higher powers?)

So when you have more coefficients to estimate than data points, you will always have a rank deficient matrix, but depending on the data itself, you may still have rank deficiency problems.

Some polynomial models will become more tractable to estimate if you center and scale the variables. Thus make each variable so it has a mean of zero, and lie roughly in the interval [-1,1]. See that raising numbers that are on the order of +/-1 to higher powers does NOT create a dynamic range issue. This will force you to transform your problem, but it often allows you to gain viable estimates for an otherwise intractable problem. (The use of orthogonal polynomials is another approach that often helps here.)

I hope this helps. Really, this is a subject that I've seen taught as a semester long course, so I have only touched the surface.

John

22 Feb 2013 Karan

Thank you so much for this. I am getting this warning-
Warning: Rank deficient, rank = 64, tol = 5.8216e-010.
I have used 3 independent variables, 74 data points and used degree of 6. Could you please help me understand what this means? Thank you.

28 Dec 2012 John Mahoney

John,

Great code as usual!

It seems to fail when given only one data point.

n = 1;
x = rand(n,2);
y = exp(sum(x,2)) + randn(n,1)/100;
p = polyfitn(x,y,3);

While failing might be desirable in some cases, it should really throw an error instead. And your code does supply an n-param output for n-1 inputs, so it would not be inconsistent to handle this case.

I think it is in the logic after line 120 - checking the size of the inputs and transposing accordingly.

Thanks!

24 Oct 2012 wei  
04 Oct 2012 Scott DeWolf  
27 Aug 2012 Criz Andre Tamatoa Thomas Almonte Torrez Manlangit  
22 Feb 2012 John D'Errico

John - There are two possible questions here. First, if you have a problem with what is known as multiple right hand sides, then yes it is very vectorizable. By this I mean a problem with the same sets of independent variables, but different sets of dependent variables. Then backslash is the way to go, allowing you to factorize the equations once for a huge time savings. You would need to build the design matrix of course, but this is not a difficult thing to do.

If however, you have multiple problems, each with different sets of independent variables, then vectorization would require you to build a sparse block diagonal design matrix. As such, you can gain a bit in this effort, but not as significant of a gain as the first case. You need to build the design matrix as a sparse matrix. The trick here is to supply the blocks to blkdiag as individually sparse matrices. Thus, create the individual (sparse) design matrices in a cell array, then pass it to blkdiag as a comma separated list. Can you do this efficiently enough to make the backslash solve more efficient than a loop around the calls? Yes, but as I said, it will take some effort.

You can probably glean from my response that it will not be a trivial action to vectorize the result from this code, but that you could do things more efficiently with some time invested.

22 Feb 2012 John

Great set of functions!

Can you suggest any ways to vectorize their use? I have a scalar function that is dependent on its location in space f(x,y,z), and I calculate this function surrounding N bodies. Currently, I run a loop for each body, but is there a way to modify the code so that the backslash operator works across all of them at once, returning N sets of coefficients?

08 Feb 2012 John D'Errico

This should do it...

Look for these two lines below:

[Q,R,E] = qr(M,0);

polymodel.Coefficients(E) = R\(Q'*depvar);

Assume a column vector of weights as W, one for each data point. Change the above lines to read:

[Q,R,E] = qr(bsxfun(@times,W(:),M),0);

polymodel.Coefficients(E) = R\(Q'*(W(:).*depvar));

This should give you weighted estimates of the parameters. I think the statistics on the parameters will work too, as well as R^2, etc. Of course, you will need to pass in the weight vector too as an argument.

The above change assumes you are using a version that has bsxfun available. Its been around for a while so that should not be a problem. Even if not, a call to repmat will replace it.

08 Feb 2012 Slaven

Great function! Thank you. Would there be an easy way to convert it to do weighted least squares. I was looking to try to do it myself, perhaps you have some insights or warnings?

01 Feb 2012 John D'Errico

A linear regression (which is polyfitn) uses backslash at its core. For example, suppose one wished to use polyfit to estimate a quadratic model in one variable. The model would be

y = a1*x^2 + a2*x + a3 (+noise)

polyfit is able to estimate the three unknown coefficients [a1,a2,a3] by solving a linear (but overdetermined) system of equations in a way that minimizes the sum of squares of the residuals. Again, backslash is the engine underneath here.

Now, suppose one wished to estimate a more complex model, with two independent variables, z(x,y). I'll write it as a full two variable quadratic model, so there will be six unknown coefficients.

z = a1*x^2 + a2*x*y + a3*y^2 + a4*x + a5*y + a6 + noise

As long as you have sufficient data to estimate the six unknown coefficients, the problem is essentially the same. See that they enter the problem linearly in the same way as for the one variable problem we looked at before. Backslash again can solve the overdetermined problem to estimate the coefficients.

In fact, both polyfit and polyfitn use a QR factorization of the design matrix to achieve their goals. This is because they are both designed to also generate covariance matrix estimates on the parameters. A QR factorization is the easiest way to serve both purposes, but its all just linear algebra.

01 Feb 2012 Mr D

Thanks..your function solve my problem, but honestly I'm curious about the concept of linear regression in 2d surface. Do we have to do polyfit (1st order) on each row to get the best fit plane of the surface?

Thanks.

Contact us