Code covered by the BSD License  

Highlights from
polyfitn

4.96

5.0 | 25 ratings Rate this file 271 Downloads (last 30 days) File Size: 51.3 KB File ID: #34765

polyfitn

by

 

25 Jan 2012 (Updated )

Polynomial modeling in 1 or n dimensions

| Watch this File

File Information
Description

Polyfitn is an extension of polyfit, allowing the user to create models with more than one independent variable. It also allows the user to specify a general model, for example, a quadratic model, with constant and quadratic terms, but no linear term.
For example, to fit a polynomial model to points selected from a cosine curve, we will only need the even ordered terms.

x = -2:.1:2;
y = cos(x);
p = polyfitn(x,y,'constant x^2 x^4 x^6');
p.Coefficients
ans =
    [0.99996 -0.49968 0.041242 -0.0012079]

The coefficients won't be exact of course, as I used only a finite number of terms for what is essentially a truncated Taylor series, and I had only a finite amount of points to build the model from. The exact first 4 coefficients for a cosine series would have been:

>> [1 -1/2 1/24 -1/720]
ans =
            1 -0.5 0.041667 -0.0013889

so we got the expected result.

Of course, polyfitn works in higher dimensions, as it was this problem it was really designed to solve.

x = rand(100,1);
y = rand(100,1);
z = exp(x+y) + randn(100,1)/100;
p = polyfitn([x,y],z,3);

The result can be converted into a symbolic form to view the model more simply. Here I'll use my sympoly toolbox, but there is also a polyn2sym function provided.

polyn2sympoly(p)
ans =
    0.43896*X1^3 + 1.4919*X1^2*X2 + 0.041084*X1^2 + 1.4615*X1*X2^2 - 0.095977*X1*X2 + 1.2799*X1 + 0.56912*X2^3 - 0.15306*X2^2 + 1.361*X2 + 0.94819

And of course, parameter error estimates are generated for those who wish to determine the significance of the terms generated.

I've also provided tools for evaluating these models, and to differentiate a model.

A caveat - beware the use of high order polynomials to fit your data. Just because a low order model works, a high order model is not necessarily better. High order polynomials often suffer from severe ringing between the data points. Always plot your data. Think about the model you will be building. Then plot the resulting model. Use your eyes to validate the result, more so than simply looking at an r-squared coefficient (although I do return that parameter too.)

If you do find that a high order polynomial mode is necessary because your curve is simply too complicated, consider using a regression or smoothing spline model instead.

Acknowledgements

This file inspired Gapolyfitn and Poly Val2 D And Poly Fit2 D.

MATLAB release MATLAB 7.5 (R2007b)
Tags for This File   Please login to tag files.
Please login to add a comment or rating.
Comments and Ratings (53)
03 Sep 2014 Lukas Theisen  
12 Aug 2014 JinSun

Incredible. Thank you for your contribution!

10 Jul 2014 akshay

works great and should be included in standard package

28 May 2014 Christoph  
16 May 2014 Muhammad Irfan Zafar

Hi John,

I have used 'polyfitn' to find polynomials for varying parameters of nonlinear system based on three input variables. Inserting these polynomials in three equations, i have then tried to 'fsolve' them but i couldn't converge to one solution by inserting polynomials of different orders.

I have been going through your file exchange list to look for some other related code. I was wondering if i should try to use some spline method instead of 'polyfitn' but i am not sure how i can insert it in the equation and solve them. Also is their any other method to solve three nonlinear equations simultaneously.

Thanks

04 Apr 2014 Johan  
02 Apr 2014 John D'Errico

I'm sorry, but I can only assume that Gautam, like Mitch, is using a quite old release of MATLAB.

If you really want to resolve the problem, you would provide sufficient information for me to know if this is the case. Since a test case as you describe works perfectly, as I show here, my suggestion is to use a reasonably recent release, and since Mitch said that he was using a 7 year old release and got that same error, I assume this is the case.

If you are indeed using a recent release, and are still having a problem, then I can only assume that you are either using it improperly, or that you have something else installed that is causing the problem. For example, when I perform this test, it runs perfectly, but I don't have a 7 year old MATLAB release, nor can I even run something that old.

indvar = rand(12,2);
depvar = rand(12,1);
m = [0 0; 1 0; 0 1; 2 0; 1 1; 0 2; 3 0; 2 1; 1 2; 0 3];
p = polyfitn(indvar,depvar,m)
p =

ModelTerms: [10x2 double]
Coefficients: [-0.0610 -4.6966 5.5355 27.4400 -24.9658 -0.4228 -32.4825 42.3187 -18.4250 5.9917]
ParameterVar: [12.5177 249.5518 129.5089 850.2142 511.0641 282.0906 504.6530 953.0042 477.2848 209.7347]
ParameterStd: [3.5380 15.7972 11.3802 29.1584 22.6067 16.7956 22.4645 30.8708 21.8468 14.4822]
DoF: 2
p: [0.9878 0.7943 0.6748 0.4460 0.3845 0.9822 0.2851 0.3040 0.4878 0.7192]
R2: 0.6497
AdjustedR2: -0.9265
RMSE: 0.1327
VarNames: {'' ''}

Without useful information from you, I cannot guess what is the problem. The information I refer to in this context would involve telling me what MATLAB version you have. It would involve telling me the complete error message, not just a fragment of that error message.

By the way, trying fit a data set with only 12 data points with a model that has 10 terms will yield a virtually meaningless result. I would call that a case of MASSIVELY overfitting your data.

01 Apr 2014 Gautam

Mr. D'Errico, Thanks very much for your submission. It looks quite helpful, except that when I give a 12 x 2 matrix as the independent variable matrix and a 12 x 1 column matrix for the dependent variable, it displays the following result:
Undefined function or variable "varlist".

Error in polyfitn (line 233)
polymodel.VarNames = varlist;

The model I used is this:
m = [0 0; 1 0; 0 1; 2 0; 1 1; 0 2; 3 0; 2 1; 1 2; 0 3];

Any advice on how to get your function to work for me would be very helpful and greatly appreciated. Thank you very much.
Gautam.

01 Apr 2014 John D'Errico

MItch,

It is not relevant that you think it was properly defined in your test. I simply don't know what it was defined as, because you gave me everything else but that variable. So unless you can show me a constructible example that fails, then I have no idea what you are doing. That I used a different variable name is completely irrelevant. I could as easily have named the variables I passed in as Fred and Barney and it would have run.

I also don't know if there is an issue with that old of a MATLAB release. The site tells me I wrote the tool using R2007b, so it SHOULD work for you, but it is also conceivable that one of the mods made to it over the years is now making it fail in an old release. This is something I cannot test since I cannot use that old of a MATLAB release.

So my first suggestion is to run the test I gave. Verify that it does run properly. If it does not run, then you need to see what is different about the input arguments in the test you have. OR you can send me the data you have if you don't wish to post it online.

If the issue is that your old release is indeed the problem, then you will need to learn to solve the problem using backslash (not that difficult a task) or perhaps get a copy of the stats toolbox.

31 Mar 2014 Mitch Martelli

Hi John,
Thanks for your quickly replay.
I'm sorry for the missing data in my question, but my variable called err_func is your outpar and it was defined.
I check all the variables dimensions and they are correct.
Then I try to copy and past your example and appears the same error :

??? Undefined function or variable "varlist".

Error in ==> polyfitn at 232
polymodel.VarNames = varlist;

Could be a problem reletad to my Matlab version (R2007b)?

Thanks in advance
Mitch

28 Mar 2014 John D'Errico

MItch,

I don't know what you are doing. My initial guess is you tried to "run" the function, from the editor, or something like that. You don't run functions like that, although many people try to do so.

I do see that you have not defined the variable err_func in your example though, so I cannot really even guess what is your error.

inpar = rand(20,2);
outpar = rand(20,1);
model_exp= [1 0;2 0;0 1;1 1;0 2];
p = polyfitn(inpar,outpar,model_exp);

This test case runs nicely.

28 Mar 2014 Mitch Martelli

Hi John,
Thanks for your great submission.
I try to use it on my own data. But I’m not able to set correctly modelterms argument:

i.e.

in_par=[zeta theta];
model_exp= [1 0;2 0;0 1;1 1;0 2];
p = polyfitn(in_par,err_func,model_exp)

after run appears the following error:

??? Undefined function or variable "varlist".

Error in ==> polyfitn at 232
polymodel.VarNames = varlist;

Error in ==> launch_regr at 14
p = polyfitn(in_par,err_func,model_exp)

Please can you help me?

20 Feb 2014 francesco  
21 Nov 2013 John D'Errico

I suppose it still begs the question, does polyfitn need to have the ability to fit multiple right hand sides, i.e., the same model, but for multiple dependent variables?

Personally, I've found this needed only somewhat rarely, and when we did use such capability in my previous incarnation supporting a large group of color scientists, it was really only useful for building monitor models. Since then, monitors are more complex, thus less linear. I would conjecture that the newer monitors out there are better handled using more sophisticated models. This is certainly true for printers, etc, which are best handled using 3-d lookup tables, fit by a tool like my own gridfit.

But the fact is, I'm often surprised to find others use these tools in ways I did not expect.

So if I see at least a few people requesting this capability, I would consider adding it. Until then, my preference is for simplicity of code in this respect. (Hey, I am supposed to be retired!)

21 Nov 2013 Mark Mikofski

I completely agree with John; I threw out that suggestion flippantly, without thinking it all the way through. Sorry if I led you astray. Thanks John for your sound advice and superb mathematical tools - they are invaluable!

21 Nov 2013 John D'Errico

I'm sorry, but I completely, whole heartedly disagree with Mark's advice to Anderson.

Systems with multiple output variables require no more than a simple loop around polyfitn. The cost of that loop is simply not significant enough to increase the complexity of the code and the complexity of the interface.

To suggest instead that one use an iterative scheme like Gauss-Newton, lsqnonlin, or lsqcurvefit, etc., these are all absurd approaches. You will still need to loop, OR you will need to generate a large problem with essentially a block diagonal Jacobian matrix. All of this to avoid a simple loop? Or you will write a loop, just to avoid a loop? Note that in Mark's suggestions, you will need to write a bit of code just to evaluate the model. And if you use a tool like lsqnonlin or lsqcurvefit, then you either need to provide the Jacobian, or you will need to let the tool call your objective function multiple times just to compute finite differences to generate that Jacobian.

Don't throw an iterative tool at a problem to avoid a simple loop, on a problem that is truly solvable by basic linear algebra. See that even while the iterative tools will converge quickly, those tools will actually try to take spare iterations, until it effectively realizes that it has indeed converged. This will require a spare Jacobian evaluation.

20 Nov 2013 Mark Mikofski

@Anderson, you could use newtonraphson (http://www.mathworks.com/matlabcentral/fileexchange/43097-newton-raphson-solver) or if you have the optimization toolbox either lsqcurvfit or lsqnonlin (http://www.mathworks.com/help/optim/nonlinear-least-squares-curve-fitting.html)

20 Nov 2013 Anderson

This tool looks very useful. But do I understand correctly that it will not fit systems with more than one output variable? If so, does anyone know of a tool that does polynomial fitting for systems with more than one output variable?

30 Oct 2013 John D'Errico

Eran - It might be interesting to some, but I think it will encourage people to fit models that are too high of an order, something that I am constantly forced to caution people not to do.

Since your idea would require adding more options to the code, it would make the code more complex. As well, using the tool would be less simple, since if many models are generated and returned, then tools like polyvaln and polyn2sym would not work directly, or would also need to be modified.

As far as a significant gain in efficiency is concerned, remember that whenever there are more options in a code, it becomes less efficient for the basic user. Complex code is also more likely to have unintended bugs.

So I would not put in this enhancement myself. Could you do so yourself? Well, yes, you could in theory. To be most efficient, one would need to wrap a loop around the modeling process, probably using the qrupdate function to add or remove terms from the model. I'm not sure that is worth the effort compared to a simple loop around the entire code, just calling polyfitn repeatedly unless you were using the tool extensively.

27 Oct 2013 Eran

Hi again,

Thank you so much for your last answer!
Of course it was a dumb question :)

I have another question, that might be a bit more tricky -
lets say that I have a set of n points.
I want to check what is the polynomial fit for every k<n (of course k is big enough to hold enough data for the fit). meaning I want to have (almost) n polynomials fitings.
of course I can do a "for" loop and get my answer... But that is very inefficient.

Is there any way to modify your code slighlty so that this would work?

I think that the main problem is in the part where you define the "design matrix".

Thanks again for the quick help!

23 Oct 2013 John D'Errico

As a continuation, let me explain it like this, because this is a common problem in regression.

When you create a dataset where y = x/2, your data lives in a region that does not fill a region of the (x,y) plane. Instead, it lives entirely on the line y = x/2. In effect, x and y are completely dependent on each other, given x, we know the value of y.

The problem is, we have no information about what happens for z when your points fall off that line, so no estimation of the coefficients of a full model can be satisfactory.

To help convince you of what has happened, lets start with the original function z that you created.

z=4+x+x.^2+x.*y;

Substitute the known relation for y into z, to get it as a function only of x. Then we would have:

z = 4 + x + 1.5*x^2

See that we can model z as a function only of x.

out=polyfitn(x,z,2)
out =
ModelTerms: [3x1 double]
Coefficients: [1.5 1 4]
ParameterVar: [1.5362e-32 1.6407e-28 7.6808e-26]
ParameterStd: [1.2394e-16 1.2809e-14 2.7714e-13]
R2: 1
AdjustedR2: 1
RMSE: 9.327e-13
VarNames: {'X1'}

No warning was made here, and the fit was essentially perfect, yielding the values for the coefficients we would have expected. I could as easily have done the fit for z as a function of y.

23 Oct 2013 John D'Errico

HI Eran,

But think about what you are doing.

At first glance I made the rapid assessment that x was moderately large, and thus this was just a scaling issue. Then I looked at your problem. This is the line that causes the problem!

y = x/2

Your model is quadratic in x and y, in the sense that it has terms x.^2, x.*y, and y.^2. Of course, you hope that polyfitn will tell you that the coefficient of x.^2 was zero, as it implicitly was when you created the data.

But you created y as a simple multiple of x. You should see that x.*y == 2*y.^2 == x.^2/2.

There is NO linear algebra scheme in the world that can intelligently recover the coefficients of these terms, and know which set of coefficients you started with. The solution simply is not unique. This is why you get a warning about singularity. You will get an answer, but not a unique one.

The story here is garbage in, garbage out. Your data does not adequately support estimating the model you wish to fit.

23 Oct 2013 Eran

Hi Jhon,

Tahnks you very much for this so-needed function. It's a shame that matlab doesn't support this...

I tried using this function, and encountered a small problem.
I used:
x=0:100;
y=x/2;
z=4+x+x.^2+x.y;
in=[x' y'];

out=polyfitn(in,z,2);

and what I get is a warnning about the singularety of the matrix "R" produced by the function "qr". That of course when I use "\" that practicly uses the inverse of "R".
This happans twice in the lines after using the function "qr".

Do you have any advise on how to solve this?

I'm asking this question of course, becuase I get that the coeficients produced by "polyfitn" are very different from those you would expect...

Thanks you very much!

23 Oct 2013 John D'Errico

Hi Alex,

I think you wanted to ezsurf, not ezplot. For example, this worked for me:

xy = rand(100,2);
z = 3 + 2*sum(xy,2) + sum(xy.^2,2) + 4*prod(xy,2) + randn(100,1)/10;
model = polyfitn(xy,z,2);
fun = polyn2sym(model);
ezsurf(fun)

22 Oct 2013 ECOSat Team

Hi John,

These functions are wonderful!

I'm using it to model the intersect of two perpendicular cones, which provides me with a parabola in 3D space. I was wondering, is there an inbuilt way to plot fitted curves in their proper dimensions? I tried using your polyn2sym tool and then using ezplot, but it fails to properly show the curve in 3D.

Thanks,
Alex
Ecosat Team

02 Oct 2013 John D'Errico

Hi Raymond,

Those parameters are derived to apply to models with noise in them. For example, suppose I was given 10 measurements of a given process at a single point. The least squares best estimate of the constant model, thus y=c, is given by the mean of the samples. At the same time, we could compute an estimate of our uncertainty in that estimate of the mean, using it to compute approximate confidence intervals around the parameters. We can do similar things with more complex regression models, deriving estimates of both the parameters in the model and our uncertainty around those parameters based on the perceived noise in the data. The noise is perceived to be essentially what is left over after the model is estimated. Again, this will give us estimates of standard deviations that we can put on the parameters themselves. (Multiply the provided standard deviation by a Student's t-statistic for an approximate confidence interval at your required level of confidence. Using approximately +/- 2 or 3 sigma is a common thing to do here to get the confidence interval.)

In fact, most of the time when you have more data points than parameters to fit a model, we can also form estimates of our uncertainty in those parameters. The more data you have, the better of course, as it allows us better estimates of the parameters. A nice thing is we don't even need pure replicates to form that estimate of our uncertainty in the parameters. The uncertainty is based on what the model cannot predict in the data.

Traditionally for polynomial models, these standard deviation estimates are one way to decide to drop terms in a model. If the confidence interval includes zero as a possibility, then we might conclude that we cannot distinguish the provided estimate of a given parameter from zero.

At least this is true in a traditional model where we are trying to fit the correct model to our data, and all that obstructs this process is the presence of random (Gaussian) noise. The above approximations also assume the parameters are independent, so depending on the data, x and x^2 will usually not be orthogonal variables. That lack of orthogonality introduces correlation in the regression estimates, which is unfortunately ignored in the approximation used for the reported parameter standard deviations.

There are additional problems with this methodology when there is lack of fit in the model. Here we are trying to fit one model to the data, but in reality, the process is not truly based on that model. Our choice of model may be only an approximation to reality, or it may simply be wrong. Suppose you have a curved surface, and you attempt to estimate a linear model. In this case, there will be lack of fit, because the data is not well represented by the model.

Of course, the linear algebra does not understand that lack of fit is not just pure noise. Mathematics is a bit dumb in this respect. It cannot see a pattern in the residuals as we can. It simply believes the residuals are pure noise, and uses that noise to produce estimates of our uncertainty in the parameters using standard statistical techniques.

In the end, what I'm trying to say is these standard deviations are rough measures of how well the software thinks the parameters are estimated, based on the data you provided, based on several approximations, and based on the validity of your model for this data.

01 Oct 2013 Raymond Tsang

Hi John,

This is a great tool! I'm using it to fit a quadratic function to a magnetic field over a 3D lattice.

I'm interested in the uncertainty in the fit parameters. Is it correct to interpret "ParameterStd" as this uncertainty? (If not, what is?) And it is a little puzzling to me as to how "ParameterStd" is determined given that I have not provided the measurement uncertainty to the code. Can you explain a little bit about how this "ParameterStd" is computed?

Thank you!
Raymond

22 Sep 2013 John D'Errico

yonatan - Bound constraints on the coefficients are not one of the options. You can instead use a tool like lsqnonneg for basic non-negativity constraints on the coefficients. For example, if x and y are column vectors then to get non-negative coefficients for a and b as a vector, you would do this:

ab = lsqnonneg([x, x.^2],y);

For more complicated constraints on the coefficients, you would generally want a tool like lsqlin from the optimization toolbox. For example, if you wanted to constrain only one of the coefficients or you wanted constraints on the sum of the coefficients, then lsqlin would be useful. Lacking that toolbox, there are other tools like my own lee from the file exchange that can do some things for you along these lines, and with some effort you can always convert many general problems into one that lsqnonneg can handle.

22 Sep 2013 yonatan

This is Great!
Is there any way to constraint the function to be with positive coefficients only?
for example, f(x) = a*x + b*x^2 for positive a and b?
thank in advance.

04 Sep 2013 Andrew

This is great!

01 Sep 2013 Syeda  
02 Aug 2013 Chenyun Pan  
15 Jul 2013 John D'Errico

Martin - You would need to list the terms in the model. E.g., to specify a model that has total order no higher than 3, but no cubic terms in y, you might specify the modelterms argument as

modelterms = [3 0
2 1;
2 0;
1 2;
1 1;
1 0;
0 2;
0 1;
0 0]

Alternatively, you may do it by giving a list of terms as a string for that same model:

modelterms = 'x^3, x^2*y, x^2, x*y^2, x*y, x, y^2, y, constant'

15 Jul 2013 Martin

Hi John, first of all - great work!
My question: is it possible to use two different orders for my polynomial? Because i managed to fit my function (3D) with an polynomial of the order 6 (n=6). But it would be great to fit my function with a polynomial with the order e.g. 5 in one direction, and with 6 in the other one (m and n) .
z(x,y) = a(n)*x^n + b(m)*y^m + a(n-1)*x^(n-1) + b(m-1)*y^(m-1)...

11 Jul 2013 John D'Errico

Jeff - The reason why polyfitn returns a struct is because the model may be somewhat complicated, with general terms specified by the user in several variables. A 1-d polynomial is far easier to describe as a list of coefficients, since the 1-d polyfit always assumes a very specific model form.

However, it is no problem at all to extract the coefficients from the struct as returned. For example, consider this simple model z(x,y).

% Some simple data...
x = rand(10,1);
y = rand(10,1);
z = x + 2*y + 3 + randn(size(x))/10;

% Generate a model
mdl = polyfitn([x,y],z,{'x','y','constant'})
mdl =
ModelTerms: [3x2 double]
Coefficients: [0.9706 2.0162 3.0302]
ParameterVar: [0.0214 0.0266 0.0093]
ParameterStd: [0.1462 0.1631 0.0966]
R2: 0.9787
AdjustedR2: 0.9726
RMSE: 0.1180
VarNames: {'x' 'y'}

Of course, the coefficients are just numbers in a field of the struct. So we can extract the coefficients as a vector:

format long
C = mdl.Coefficients
C =
0.970599545975740 2.016179608143575 3.030225230230983

They are not exactly what I put in, but that random noise on the data corrupted things a bit.

Or we could even have converted the struct to a symbolic polynomial, either a sym, or my own sympoly form.

polyn2sympoly(mdl)
ans =
0.97059954597574*x + 2.01617960814358*y + 3.03022523023098

And we can evaluate the polynomial on our own without need of polyvaln, as long as you know what the model is. (And since the exponents for each term are stored in the struct, that is easy too.)

C(1)*.5 + C(2)*.25 + C(3)
ans =
4.019569905254746

For comparison:
polyvaln(mdl,[.5,.25])
ans =
4.019569905254746

11 Jul 2013 Jeff

John - wonderful work. Question: I think I've read that the coefficients solved for by polyfitn can only be interpretted by polyvaln. Is that correct? If I want to use polyfitn to find coefficients for a data fitting exercise, and then apply them to future datasets outside of matlab (and thus without polyvaln), is that not possible? Or, did I misunderstand something?

12 Jun 2013 Daniel Higginbottom  
31 May 2013 Eric H.

Dennis Wouters,
Thanks for providing this fix. It works!

19 Mar 2013 Dennis Wouters

John, nice work.
I encountered a minor glitch.
When specifying the modelterms as an array of exponents, the variable 'varlist' was not assigned a value.
I fixed this by inserting the following at line 160:
else varlist = repmat({''},1,p);

25 Feb 2013 John D'Errico

Despite my having written polyfitn, high order polynomials (especially in multiple dimensions) are rarely a good substitute for a nonlinear model that is chosen to have the proper behavior. That is, IF you have such a model. Those polynomials tend to do nasty things to you, especially in extrapolation. And unless you have sufficient data, then the model may behave strangely even inside your data. I call this the problem of interpolation.

Basically, you need to be very careful with high order polynomial fits.

By the way, nlinfit DOES handle multiple independent variables. You suggest it does not, but there is no constraint on that. The problem is choosing an intelligent model in multiple dimensions. This is EXTREMELY difficult in general. One good reason is that a functional form like w = f(x,y,z) is so difficult to visualize.

Personally, my preference has always been to move to a spline-like model for these problems. I am often unable to pull a model out of thin air, and I would prefer to avoid a high order polynomial. My own gridfit is one simple example, with 2 independent variables.

25 Feb 2013 Karan

Hello John,

Thank you for your inputs. I really appreciate the help. From what i have understood is that the data is not sufficient enough for using such a high order polynomial fit. I am actually looking at fitting the data to my own custom equation but haven't found a model for it yet- similar to nlin fit but with multiple independent variables. If you do have any suggestions please do let me know.

Thank you.

Karan

22 Feb 2013 John D'Errico

Hi Karan,

You've fallen into a common fallacy in regression modeling. It goes like this: "If n is good, then n+1 is better, and n+2 must be really wonderful."

The fallacy starts out easily. We try a linear model on our data. It works, all right, but not very well. After all, must problems are only pretty locally linear. So we try a quadratic model. It fits better. The lack of fit is reduced. But there are still problems in the model. It is inadequate to fit the shape of our data.

Since we did so well with the quadratic, try out a third order model. Yep, it does better still. Wow. Keep pushing things, until we have a 6th order model, or 10th order. A polynomial model is just like a Taylor series approximation, so more terms must give a better result. Eventually though, it all just turns to numerical crap before our eyes. So what happened?

Here you have 3 variables. A first order model has a constant term, as well as x, y, and z terms. So 4 terms in that model. You have 74 data points, and surely had no problem estimating the corresponding coefficients.

A quadratic model adds in 6 more terms, x^2, y^2, z^2, but also the cross product terms x*y, x*z, and y*z, so a total of 10 coefficients to estimate.

The curse of our fallacy is that as we move up in order, the number of terms grows faster and faster. At 3rd order, we have 20 terms to estimate, 35 at 4th order, 56 at 5th order, and 84 at 6th order. A 10th order model would have 286 terms.

How about your data? You have only 74 data points. (In my early days when I did some statistical consulting, my clients would always ask how much data they needed. My answer was almost always "More than you have.")

Every distinct data point is at most one additional piece of information. (Replicate points do not count here. All they do is reduce the uncertainty in your estimates. So replicates are good, but not quite as valuable as distinct points in some respects.)

Worse, new data points do not always provide new information. For example, two points determine a line. A third point on that line adds no new information towards the slope or intercept of that line. Again, like a replicate, all it does is reduce the uncertainty in your estimate of the coefficients.

So what happened when you tried to estimate a 6th order model with only 74 data points? You had too little information in your data to support such a large model. The message that comes back is from MATLAB, telling you that your matrix is numerically singular (rank deficient.)

Having said all this, things can get worse yet, making the problem yet more difficult. Some sets of data are worse than others. For example, if you add 1e6 to all of your independent variables, a polynomial of some given order will still be estimable mathematically (in theory), yet it may result in a matrix that is numerically singular. This is because the linear algebra is written using double precision floating point arithmetic. The numbers may simply lie over too wide of a dynamic range for the algebra to succeed. (What happens when you raise numbers that are on the order of 1e6 to successively higher powers?)

So when you have more coefficients to estimate than data points, you will always have a rank deficient matrix, but depending on the data itself, you may still have rank deficiency problems.

Some polynomial models will become more tractable to estimate if you center and scale the variables. Thus make each variable so it has a mean of zero, and lie roughly in the interval [-1,1]. See that raising numbers that are on the order of +/-1 to higher powers does NOT create a dynamic range issue. This will force you to transform your problem, but it often allows you to gain viable estimates for an otherwise intractable problem. (The use of orthogonal polynomials is another approach that often helps here.)

I hope this helps. Really, this is a subject that I've seen taught as a semester long course, so I have only touched the surface.

John

22 Feb 2013 Karan

Thank you so much for this. I am getting this warning-
Warning: Rank deficient, rank = 64, tol = 5.8216e-010.
I have used 3 independent variables, 74 data points and used degree of 6. Could you please help me understand what this means? Thank you.

28 Dec 2012 John Mahoney

John,

Great code as usual!

It seems to fail when given only one data point.

n = 1;
x = rand(n,2);
y = exp(sum(x,2)) + randn(n,1)/100;
p = polyfitn(x,y,3);

While failing might be desirable in some cases, it should really throw an error instead. And your code does supply an n-param output for n-1 inputs, so it would not be inconsistent to handle this case.

I think it is in the logic after line 120 - checking the size of the inputs and transposing accordingly.

Thanks!

24 Oct 2012 wei  
04 Oct 2012 Scott DeWolf  
27 Aug 2012 Criz Andre Tamatoa Thomas Almonte Torrez Manlangit  
22 Feb 2012 John D'Errico

John - There are two possible questions here. First, if you have a problem with what is known as multiple right hand sides, then yes it is very vectorizable. By this I mean a problem with the same sets of independent variables, but different sets of dependent variables. Then backslash is the way to go, allowing you to factorize the equations once for a huge time savings. You would need to build the design matrix of course, but this is not a difficult thing to do.

If however, you have multiple problems, each with different sets of independent variables, then vectorization would require you to build a sparse block diagonal design matrix. As such, you can gain a bit in this effort, but not as significant of a gain as the first case. You need to build the design matrix as a sparse matrix. The trick here is to supply the blocks to blkdiag as individually sparse matrices. Thus, create the individual (sparse) design matrices in a cell array, then pass it to blkdiag as a comma separated list. Can you do this efficiently enough to make the backslash solve more efficient than a loop around the calls? Yes, but as I said, it will take some effort.

You can probably glean from my response that it will not be a trivial action to vectorize the result from this code, but that you could do things more efficiently with some time invested.

22 Feb 2012 John

Great set of functions!

Can you suggest any ways to vectorize their use? I have a scalar function that is dependent on its location in space f(x,y,z), and I calculate this function surrounding N bodies. Currently, I run a loop for each body, but is there a way to modify the code so that the backslash operator works across all of them at once, returning N sets of coefficients?

08 Feb 2012 John D'Errico

This should do it...

Look for these two lines below:

[Q,R,E] = qr(M,0);

polymodel.Coefficients(E) = R\(Q'*depvar);

Assume a column vector of weights as W, one for each data point. Change the above lines to read:

[Q,R,E] = qr(bsxfun(@times,W(:),M),0);

polymodel.Coefficients(E) = R\(Q'*(W(:).*depvar));

This should give you weighted estimates of the parameters. I think the statistics on the parameters will work too, as well as R^2, etc. Of course, you will need to pass in the weight vector too as an argument.

The above change assumes you are using a version that has bsxfun available. Its been around for a while so that should not be a problem. Even if not, a call to repmat will replace it.

08 Feb 2012 Slaven

Great function! Thank you. Would there be an easy way to convert it to do weighted least squares. I was looking to try to do it myself, perhaps you have some insights or warnings?

01 Feb 2012 John D'Errico

A linear regression (which is polyfitn) uses backslash at its core. For example, suppose one wished to use polyfit to estimate a quadratic model in one variable. The model would be

y = a1*x^2 + a2*x + a3 (+noise)

polyfit is able to estimate the three unknown coefficients [a1,a2,a3] by solving a linear (but overdetermined) system of equations in a way that minimizes the sum of squares of the residuals. Again, backslash is the engine underneath here.

Now, suppose one wished to estimate a more complex model, with two independent variables, z(x,y). I'll write it as a full two variable quadratic model, so there will be six unknown coefficients.

z = a1*x^2 + a2*x*y + a3*y^2 + a4*x + a5*y + a6 + noise

As long as you have sufficient data to estimate the six unknown coefficients, the problem is essentially the same. See that they enter the problem linearly in the same way as for the one variable problem we looked at before. Backslash again can solve the overdetermined problem to estimate the coefficients.

In fact, both polyfit and polyfitn use a QR factorization of the design matrix to achieve their goals. This is because they are both designed to also generate covariance matrix estimates on the parameters. A QR factorization is the easiest way to serve both purposes, but its all just linear algebra.

01 Feb 2012 Mr D

Thanks..your function solve my problem, but honestly I'm curious about the concept of linear regression in 2d surface. Do we have to do polyfit (1st order) on each row to get the best fit plane of the surface?

Thanks.

Updates
29 Apr 2014

Added a p-value on the coefficients.

Contact us