| ??? 08/05/03 21:13 Read: times |
#52160 - RE: Extrapolation - Curve Fitting for table Responding to: ???'s previous message |
Hallo James,
using the wrong interpolation or extrapolation technique can show spectacular results, in a negative sense! There are some traps and many people make big mistakes. Curves which differ not much from straight line can be well approximated by a n-th order power-series polynomial: y = a0 + a1 * x + a2 * (x**2) + ... + an * (x**n) n-th order polynomial is fully determined by these n+1 parameters a0, a1, ..., an. Assume, you SURELY know, that your reference curve is fully represented by a n-th order polynomial, then, a set of n+1 data points is enough to determine these n+1 paramteres. But these data points must have a very certain property, they MUST be free of any error!!! These data points must be reference points, coinciding exactly with the reference curve. Whenever your data points are NOT free of error, power-series polynomial fit CAN result in total disaster! In these cases a statistical scheme must be used, called 'least-squares fit'. Normal procedure to find out parameters a0, ... , an, is to insert n+1 data points into above formula, leading to a system of n+1 equations, which must be simultaneously solved. But, again: This methode will only work trouble-free, when all data points are reference points. When data points show some error, results can be quite unsatisfying. Let's have some examples: Assume, your x-range is 0...20. In this range you have data points. Extrapolation shall be done up to x = 100. 1. You SURELY know, that your reference curve is a straight line. From the two reference points (0/0) and (20/20) you can determine a0 and a1: 0 = a0 + a1 * 0 20 = a0 + a1 * 20 It follows a0 = 0 and a1 = 1. Extrapolation leads to (100/100). So far, so good. Now assume, that one data point is not error-free. Assume the following data points: (0/0) and (20/19). Second data point shall show an error of 5%. Then, extrapolation results in (100/95), which also shows an error of 5%. But now, let's take the following two data points: (15/14.25) and (20/20). The first data point shall also show an error of 5%. Now, our methode results in a0 = -3 and a1 = 1.15. Extrapolation results in (100/112). Now the error is 12%! But not only extrapolated data points show an error, but also in the x-range of x = 0 ... 20. For x = 0 we get now (0/-3) instead of (0/0)! Now you will say: 'You must use the whole x-range between 0 and 20.' That's right, but not enough. Have a look at the following example. 2. Assume, you SURELY know, that your reference curve is fully represented by a 2-nd order power-series polynomial. From the three reference points (0/0), (10/1) and (20/4) you can determine a0, a1 and a2. Solving the system of three equations we result in: a0 = 0, a1 = 0 and a2 = 0.01. It's worth to note, that with any other set of three reference points we result in exactly the same parameters a0, a1 and a2. But only, when these data points are reference points, means, only when they are coinciding with reference curve!!! Now assume, that one of these data points show an error of 5%. Assume following data points: (0/0), (10/0.95) and (20/4). Solving system of equations result in: a0 = 0, a1 = -0.01 and a2 = 0.0105. Extrapolation leads to (100/104). Error is only 4%, so, this extrapolation is acceptable. But now, look what's happening with the following data points: (0/0), (19/3.43) and (20/4). Second data point shows an error of 5%, correct value would be '3.61'. Solving system of equations results in: a0 = 0, a1 = -0.18947368, a2 = 0.019473684. Now, extrapolation leads to (100/175.79). Error is now about 76% (!!), which is quite unacceptable. We conclude: When using non-error-free data points, not only using the whole x-range for choosing data points is important, but also to have them separated almost equally distant! Or by other words: Never use non-error-free data points, when they are very closely to each other, like (19/3.43) and (20/4) in our example. It's worth to note, that situation becomes even more disastreous, when higher order polynomials are involved. Don't forget, that terms are increasing the more rapidly the higher order n is. E.g.: 100**3 is much bigger than 100**2. Of course, when data points are error-free, means, if they are reference points, there's no such limitation. 3. Let's repeat our second example, which used reference points (0/0), (10/1) and (20/4). What happens, if we take an additional reference point, and try to fabricate a 3-rd order power-series polynomial? Let's take additionally (5/0.25). Now we insert our four reference points into: y = a0 + a1 * x + a2 * (x**2) + a3 * (x**3) And we have four equations to solve simultaneously, which results in: a0 = 0, a1 = 0, a2 = 0.01 and a3 = 0. So, nothing has changed! The try to expand our 2-nd order reference curve to a 3-rd order one has failed. This behaviour is generally valid, if data points are error-free, means, if they are pure reference points. 4. But what happens, if we add a fourth data point, while data points are NOT error-free? Let's have the following data points: (0/0), (10/1), (19/3.43) and (20/4). Point (19/3.43) again shows an error of 5%. Solving system of equations results in: a0 = 0, a1 = 0.210526312, a2 = -0.021578947, a3 = 0.0010526315. The first we see is, that 3-rd order term is not vanishing! But we assumed SURELY knowing, that reference curve is parabolic only. And if we now extrapolate we result in (100/857.89), which represents an error of awful 760 %!!!!!!! Keep in mind, that only ONE data point was assumed to be not error-free! If all data points show some certain error, things are going even worse... Again it's worth to note, that situation becomes even more disastreous, if higher order polynomials are involved. Never try to overcome the lack of precision of data points by using more data points than necessary! Generally spoken: If you surely know, that your reference curve is n-th order polynomial, never try to expand power-series polynomial to a higher order! Never use more than n+1 data points! This is generally valid, even if data points use the whole x-range and if they are equally spaced on x-axis. Only if data points are error-free, means, if they are reference points, such an approach can be made. But it's useless anyway, because all higher order coefficients will vanish then. If your data points are not error-free, following procedure is necessary: MAKE YOUR DATA POINTS ERROR-FREE! This can be achieved, by averaging over a large number of data points, all being IDENTICAL in x. Keep in mind, that the whole available x-range should be used, and that all data points should be equally spaced on x-axis. Even if you get your data points almost error-free, never use data points, which are very close to each other. In our example this could look like following: If you are sure, that your reference curve is parabolic (2-nd order polynomial), then use three data points at about x = 0, x = 10 and x = 20, if x = 0 ... 20 is your available x-range. Make data points error-free by averaging over a big number of data points being identical in x = 0, x = 10 and x = 20. Don't use that data point at x = 19, even if error is drastically decreased by averaging! Insert these three error-free data points, which are now reference points, into 2-nd power-series polynomial, like we did it above, and solve system of three equations simultaneously. This will give you 'good' coefficients, which guarantee the best extrapolation quality. If you are not sure, whether your reference is parabolic, but a 3-rd order term is involved, take a fourth data point into calculation, perhaps at about x = 5 or x= 15. This data point must also be error-free, of course, which can be achieved by averaging, in most cases. If your assumption was wrong, and reference curve was indeed parabolic, nothing bad happens. 3-rd order coefficient will almost vanish. This is a good criterium, how good your fit is. But, please keep in mind, it's only valid, if used data points are really nearly error-free! 5. There are many situations, where data points cannot be made error-free by avaraging. Either, because there are not enough data points for averaging, or errors are so high, that even with averaging resulting errors are too big. Also, there are many situation, where data points are not equally spaced. What then? In these cases only a statistical approach can help. One methode widely used is 'least-squares fit'. Assume we surely know, that our reference curve is parabolic. Then we take the term y = a0 + a1 * x + a2 * (x**2) and vary the coeffiecients in order to minimize S = sum over all (yi - a0 - a1 * xi - a2 * (xi**2))**2 If we take as example the four data points from 4. example, (0/0), (10/1), (19/3.43) and (20/4) we get the following sum: S = (0-a0-a1*0-a2*0)**2 + (1-a0-a1*10-a2*100)**2 + (3.43-a0-a1*19-a2*361)**2 + (4-a0-a1*20-a2*400)**2 S is minimized, when the partial derivatives vanish, means, when dS/da0 = 0, dS/da1 = 0 and dS/da2 = 0. This leads to a system of three equations, which must be solved simultaneously. The result is: a0 = 0.004578469675, a1 = -0.0002034874017, a2 = 0.009781250876. If we now extrapolate, we result in (100/97.80), which represents a quite acceptable error of 2.2%. Compare this result with the result of 4. example! This methode of least-squares fit is very powerful, when very many data points are put into calculation. Even if error of each individual data point is rather big, least-squares fit is very often surprisingly precise. S is an direct measure of goodness of fit. Assume you are not sure, whether your set of data points actually represents a parabolic reference curve. Then, just try a higher order power-series polynomial, minimize S in the way described above, and if S is now smaller than the minimzed S of the parabolic least-squares fit, then it's more probable that reference curve must be represented by a 3-rd order polynomial, instead of a parabola. But may be that even higher order polynomial terms are necessary. In all these cases S is a very good measure of goodness of fit. Kai |



