Stat analysis of curvefitting

 Roundtable Knight
 Posts: 125
 Joined: Mon Apr 21, 2003 11:04 pm
 Location: California
Stat analysis of curvefitting
Traders often talk about all the different ways that they use to avoid curvefitting in system designs. Elements like multiple markets, long test periods, long/short symmetry, simplicity of rules and outofsample testing are usually mentioned.
These are all very subjective criteria though, as what may seem like too many rules to one person, can seem like just the right amount to somebody else.
Now this problem of curve fitting, is an issue not only in trading, but in any discipline where models are designed based on sets of empirical data. So I have a hard time believing that this subject has not been more rigorously defined and explored among mathematicians and statisticians.
Does anyone know of any quantitative methodologies that have been developed and could be used to estimate the degree to which a particular model is curvefit to the data?
bbc
These are all very subjective criteria though, as what may seem like too many rules to one person, can seem like just the right amount to somebody else.
Now this problem of curve fitting, is an issue not only in trading, but in any discipline where models are designed based on sets of empirical data. So I have a hard time believing that this subject has not been more rigorously defined and explored among mathematicians and statisticians.
Does anyone know of any quantitative methodologies that have been developed and could be used to estimate the degree to which a particular model is curvefit to the data?
bbc
Measure of Curve Fitting
There was a third party program for TradeStation named Investor's Reality Check. I don't know if it is still sold or supported.
Essentially, it was a very sophisticated, bootstrapped form of a statistical ptest, which, given a benchmark, measured the probability that the backtested results were due to chance. The smaller the pvalue, the more likely the results were valid. The ptest did not assume a normal distribution.
Please don't ask for technical or sales details because I've told you all I know. Perhaps a web search might turn something up.
Essentially, it was a very sophisticated, bootstrapped form of a statistical ptest, which, given a benchmark, measured the probability that the backtested results were due to chance. The smaller the pvalue, the more likely the results were valid. The ptest did not assume a normal distribution.
Please don't ask for technical or sales details because I've told you all I know. Perhaps a web search might turn something up.
Monte Carlo simulations can help here, though this method brings its own baggage. Taleb's book "Fooled by Randomness" is good to get a sense of the method.
CSI has MC functionality in its Trading System Performance Evaluator (TSPE) that might be worth a look. You can read about it on their website: www dot csidata dot com. Use their search function to look it up.
Ken
CSI has MC functionality in its Trading System Performance Evaluator (TSPE) that might be worth a look. You can read about it on their website: www dot csidata dot com. Use their search function to look it up.
Ken

 Roundtable Fellow
 Posts: 84
 Joined: Thu May 29, 2003 12:11 pm
 Location: Eugene, OR
 Contact:
Here is a great book that does just what you are talking about. I HIGHLY recommend this book for any system developer. The Encyclopedia of Trading Strategies by Jeffrey Owen Katz, Donna L. McCormick. It costs $42 at amazon,
http://www.amazon.com/exec/obidos/tg/de ... ingblox20
I just bought the book 2 weeks ago because it has a lot of examples that were done using c++, which I am now coding my system in. The authors use a lot of interesting techniques most notably a ttest. The ttest gives you an estimate of how much of your results are due to curve fitting (assumes a normal distribution which is probably not true but due to the central limit theorem we can assume the error due to this to be too small to matter). This book won't give you any good systems to trade but will show you how to design and then scientifcally test your system. I hope that this helps.
Chris
http://www.amazon.com/exec/obidos/tg/de ... ingblox20
I just bought the book 2 weeks ago because it has a lot of examples that were done using c++, which I am now coding my system in. The authors use a lot of interesting techniques most notably a ttest. The ttest gives you an estimate of how much of your results are due to curve fitting (assumes a normal distribution which is probably not true but due to the central limit theorem we can assume the error due to this to be too small to matter). This book won't give you any good systems to trade but will show you how to design and then scientifcally test your system. I hope that this helps.
Chris
Last edited by Chris Murphy on Thu Jul 24, 2003 1:40 pm, edited 1 time in total.

 Roundtable Knight
 Posts: 125
 Joined: Mon Apr 21, 2003 11:04 pm
 Location: California
I spent a bit of time poring over some statistics books, and it does seem that for large enough samples (40+), the tdistribution is a good way to test the validity of your test results.Chris Murphy wrote: The authors use a lot of interesting techniques most notably a ttest. The ttest gives you an estimate of how much of your results are due to curve fitting (assumes a normal distribution which is probably not true but due to the central limit theorem we can assume the error due to this to be too small to matter).
Now here is an intersting observation. When selecting a tdistribution for your analysis, you first define the degrees of freedom that you have. Those are defined as (#independent observations  #paramers used). Now if you look at the tdistribution table in any stat book, you find that for df > ~25, the distribution largely converges to the normal distribution.
Now this implies that if you are using a large enough sample, then you can ratchet up the number of parameters in your system as high as your total number of data points minus twentyfive or so, without any loss of statistical robustness. In other words, if you have 100+ datapoints for your testing, then a twenty parameter system is essentialy as robust as a 3 parameter system. This certainly flies in the face of common wisdom.
bbc

 Roundtable Fellow
 Posts: 84
 Joined: Thu May 29, 2003 12:11 pm
 Location: Eugene, OR
 Contact:
This is actually not so and the book explains why. This is totally off the top of my memory so my response may not be "technically" correct. When testing with each parameter that you use you will optmize it and play with it to get the result that you feel comfortable with. With each optimization you lose degrees of freedom. So if you use 1000 parameters in order to get the program to have a good tscore would be next to impossible because you would be optimizing a near infinite amount of combinations. The more combinations the more degrees of freedom lost. With only 3 parameters you are losing less df's. The book actually has a unique algorithm that calculates an adjusted tscore due to parameter optimization. Of course you could always, on one of your first few attempts, get your 1000 parameters all optimized and then still have a good tscore but then again you could also get struck by lighting, die, and also have won the lottery all at the same time
Chris
Chris

 Roundtable Knight
 Posts: 125
 Joined: Mon Apr 21, 2003 11:04 pm
 Location: California
The degrees of freedom are the difference between your sample size and your number of parameters. Hence, with a large sample size, you can have a very substantial number of parameters while maintaining a very high degree of fitness. And according to the tdistribution table once you are past ~25 degrees of freedom, the benefit of additional degrees of freedom is minimal.Chris Murphy wrote:This is actually not so and the book explains why. This is totally off the top of my memory so my response may not be "technically" correct. When testing with each parameter that you use you will optmize it and play with it to get the result that you feel comfortable with. With each optimization you lose degrees of freedom. So if you use 1000 parameters in order to get the program to have a good tscore would be next to impossible because you would be optimizing a near infinite amount of combinations. The more combinations the more degrees of freedom lost. With only 3 parameters you are losing less df's. The book actually has a unique algorithm that calculates an adjusted tscore due to parameter optimization. Of course you could always, on one of your first few attempts, get your 1000 parameters all optimized and then still have a good tscore but then again you could also get struck by lighting, die, and also have won the lottery all at the same time :shock:
In this case, the major benefit of higher degrees of freedom is in case your result distribution significantly deviates from normal. However, this has nothing to do with the number of parameters in your model.
bbc

 Roundtable Fellow
 Posts: 84
 Joined: Thu May 29, 2003 12:11 pm
 Location: Eugene, OR
 Contact:
What I am saying is that when you have a lot of parameters your tscore is adjusted which makes it harder to have a model that you can say is due to it extracting an inefficiency vs just being lucky. So yeah you can have a lot of df's but they won't do you any good because your ttest gets adjusted for it. Do you want a lot of df's or a good and scientifically valid model? My guess is a good model and the more parameters the harder this is to prove. The book goes into this indepth and I cannot recommend the book enough if analysis of your system is what you are looking for. The book won't provide and systems that make money but his approach to judging systems is worth the read.
Stat analysis of curve fitting.
First up, if by curve fitting you mean getting a sample of data and running your system over it until you have tweaked it enough to get the optimal result, then I'd call that curve fitting.
If you mean testing your method to get the different elements to work together properly then I think that is ok. For example, 50day MA cross over a rising 300 day MA, go long, 50 day MA cross under 300 day MA, exit. This is a method, a quick look at maximum favourable and adverse excursions shows favourable trades give back 50% of profits so you want a trail stop and bad trades excede your base stop level so you want a worst case stop. As soon as you add them in with the MA's, the results of your previous back tests, change.
To get it to work now, modifications will have to be made.
So you need to select some data that is comparable with the data you'll be trading with, to test with, to iron out the bugs.
To use statistics to see if you are curve fitting you must use two samples.
Pick two equivalent data samples and call one of the two samples 'test 1' and the other 'test 2'. Before any changes are made to the original method. Run the method over data sample called 'test 1' saving the results and the analysis of it. Take the other data sample called 'test 2' and run the method over 'test 2' , again saving the results and the analysis. Compare both analysis and note similarities, looking at such things as means, variance, quartiles ,etc.
The ttest mentioned earlier could be suitable because it is suited to making inferences from small samples, and results from testing generally means a few trades. Because if you want to run your modifications over masses of data every time, it can take ages. For example three parameters with ten variables in each is a thousand runs ( 10*10*10), a likely scenario with an, MA exit, worst case stop and trail stop, all to be made to get to work together. If you run that over a watchlist of twenty samples, each set at ten years worth of data then thats one thousand runs over two hundred years of data. Unneccessary when any statistics book will show you sample sizes neccessary for satisfactory test result analysis. Ttest requires normal distribution, this is a limitation if it is to work properly. If the successful shorts are posted in the frequency distribution as negatives and the successful longs as positives then the mean will be zero, thus resembling normal by being balanced.
Once you have the results of the two intial runs on 'test 1' and 'test 2' and have saved them and their comparision details, you can start modifications.
From then on, when you do your tests on the modifcations you have done, only do them on 'test 1' data.
When you are happpy with the modifications, re run over both 'test 1' and 'test 2' data and see if they are still similar. If they are comparable as they were before, then its not curve fit. If the modification data 'test 1' posts great results and 'test 2 is noticably different now, then the system is curve fitted to 'test 1' data.
Back testing can be a long and frustrating business, any thing can change the results and the more variables you have the worse it is. Keep it simple sunshine.
If you mean testing your method to get the different elements to work together properly then I think that is ok. For example, 50day MA cross over a rising 300 day MA, go long, 50 day MA cross under 300 day MA, exit. This is a method, a quick look at maximum favourable and adverse excursions shows favourable trades give back 50% of profits so you want a trail stop and bad trades excede your base stop level so you want a worst case stop. As soon as you add them in with the MA's, the results of your previous back tests, change.
To get it to work now, modifications will have to be made.
So you need to select some data that is comparable with the data you'll be trading with, to test with, to iron out the bugs.
To use statistics to see if you are curve fitting you must use two samples.
Pick two equivalent data samples and call one of the two samples 'test 1' and the other 'test 2'. Before any changes are made to the original method. Run the method over data sample called 'test 1' saving the results and the analysis of it. Take the other data sample called 'test 2' and run the method over 'test 2' , again saving the results and the analysis. Compare both analysis and note similarities, looking at such things as means, variance, quartiles ,etc.
The ttest mentioned earlier could be suitable because it is suited to making inferences from small samples, and results from testing generally means a few trades. Because if you want to run your modifications over masses of data every time, it can take ages. For example three parameters with ten variables in each is a thousand runs ( 10*10*10), a likely scenario with an, MA exit, worst case stop and trail stop, all to be made to get to work together. If you run that over a watchlist of twenty samples, each set at ten years worth of data then thats one thousand runs over two hundred years of data. Unneccessary when any statistics book will show you sample sizes neccessary for satisfactory test result analysis. Ttest requires normal distribution, this is a limitation if it is to work properly. If the successful shorts are posted in the frequency distribution as negatives and the successful longs as positives then the mean will be zero, thus resembling normal by being balanced.
Once you have the results of the two intial runs on 'test 1' and 'test 2' and have saved them and their comparision details, you can start modifications.
From then on, when you do your tests on the modifcations you have done, only do them on 'test 1' data.
When you are happpy with the modifications, re run over both 'test 1' and 'test 2' data and see if they are still similar. If they are comparable as they were before, then its not curve fit. If the modification data 'test 1' posts great results and 'test 2 is noticably different now, then the system is curve fitted to 'test 1' data.
Back testing can be a long and frustrating business, any thing can change the results and the more variables you have the worse it is. Keep it simple sunshine.
I get headaches when math goes beyond the basic +*/=, so what Iâ€™m going to say certainly isnâ€™t grounded in math. Please rip it apart and tell me what you think.
I like what bbc first wrote about the larger the sample the more parameters allowed and I like that Jimsta makes a comparison between two sets of data.
My thinking is if I have a system that produces 30,000 trades (occurrences) and if the system doesnâ€™t have a crazy number of different conditions/parametersâ€¦.. say >25 (feel free to pick a number) I will consider the results as robust. But that is the quick and dirty way of looking at things. I think another look needs to be performed, one that will look at the number of occurrences of each condition/ parameter touching the data.
Letâ€™s say I have a long only SP system with only 6 conditions/parameters. But one of the conditions is specific to black Friday 87. This condition only touches or impacts the data once, but will have a major impact in results. While the macro look at the system, 6 parameters and profitable results might give the system a pass, shouldnâ€™t there be a micro look to see the impact of each condition/parameter?
That leads to the logic that if each of my conditions impact the data at least 5,000 times I am comfortable with that condition and if applied properly would welcome a number of more conditions/parameters that touch the data a high number of times and relates to the overall profitability or risk control of the system. If I have a condition that touches the data only 20 times out of the 30,000 trades, shouldnâ€™t that be looked at differently then a condition that touches it 10,000 times?
So maybe the question should be rephrased to not the quantity of parameters, but to the quality of parameters....
Thoughts?
Gordon
I like what bbc first wrote about the larger the sample the more parameters allowed and I like that Jimsta makes a comparison between two sets of data.
My thinking is if I have a system that produces 30,000 trades (occurrences) and if the system doesnâ€™t have a crazy number of different conditions/parametersâ€¦.. say >25 (feel free to pick a number) I will consider the results as robust. But that is the quick and dirty way of looking at things. I think another look needs to be performed, one that will look at the number of occurrences of each condition/ parameter touching the data.
Letâ€™s say I have a long only SP system with only 6 conditions/parameters. But one of the conditions is specific to black Friday 87. This condition only touches or impacts the data once, but will have a major impact in results. While the macro look at the system, 6 parameters and profitable results might give the system a pass, shouldnâ€™t there be a micro look to see the impact of each condition/parameter?
That leads to the logic that if each of my conditions impact the data at least 5,000 times I am comfortable with that condition and if applied properly would welcome a number of more conditions/parameters that touch the data a high number of times and relates to the overall profitability or risk control of the system. If I have a condition that touches the data only 20 times out of the 30,000 trades, shouldnâ€™t that be looked at differently then a condition that touches it 10,000 times?
So maybe the question should be rephrased to not the quantity of parameters, but to the quality of parameters....
Thoughts?
Gordon
I want to follow up on my previous post. As you can tell, I don't subscribe to the notion of loss of freedom in the typical sense of it.
But in the typical sense of it, when looking at a system with the statistical models that are out there, would one want to treat a system that goes long and short as two separate systems or as one system?
Normally what I have read, or better put, my translation of what I have read is that Long and Short is considered one system. While I am a big fan of symmetrical parameters, I would think it is best to treat the total system as two separate systems.
So if 6 is the number of parameters allowed, should a complete system be allowed 12? Six for the long system and six for the short system?
Thoughts/Opinions?
Gordon
But in the typical sense of it, when looking at a system with the statistical models that are out there, would one want to treat a system that goes long and short as two separate systems or as one system?
Normally what I have read, or better put, my translation of what I have read is that Long and Short is considered one system. While I am a big fan of symmetrical parameters, I would think it is best to treat the total system as two separate systems.
So if 6 is the number of parameters allowed, should a complete system be allowed 12? Six for the long system and six for the short system?
Thoughts/Opinions?
Gordon

 Roundtable Fellow
 Posts: 84
 Joined: Thu May 29, 2003 12:11 pm
 Location: Eugene, OR
 Contact:
Sir G,
I would only feel comfortable if I used the same parameters for both long and short. I would feel comfortable using different values for the parameters if I could understand why they should be different statistically or theoretically. If I simply optimized them to different values I would not feel comfortable unless their was reasoning behind why the two values should be different.
Chris
I would only feel comfortable if I used the same parameters for both long and short. I would feel comfortable using different values for the parameters if I could understand why they should be different statistically or theoretically. If I simply optimized them to different values I would not feel comfortable unless their was reasoning behind why the two values should be different.
Chris

 Roundtable Knight
 Posts: 122
 Joined: Thu Apr 17, 2003 9:49 am
The future performance of a method is independent of the procedure used to discover the method. If Alan tests 500,000 different parameter value settings in order to find the "best performing" one, the markets don't know that. Performance is not a function of "how many different things did you try?", performance is a function of "what did you finally choose?"
Beth mails a diskette to a Robo Broker and say "trade this." The only thing that determines performance is what's on the diskette. How it got there is completely immaterial. How many computer runs Beth performed, and how many Optimizations she deployed, doesn't matter. The only thing that matters is what's on the diskette.
Therefore the thing to focus your efforts upon, is figuring out how to evaluate a system. If you believe that systems can be "non robust", figure out how to test for this. If you think systems can be "too sensitive to small changes", figure out how to test it. If you think it's possible for a system to be "over fitted", figure out how to test a system and determine if it's over fitted. Apply all your insight and expertise and precious time, to figuring out how to test systems.
Then, test absolutely everything under the sun. Test one hundred million variations of EMA Crossover with trailing stops and pyramiding and portfolio heat limits. There are 5 parameters, try 40 values of each one. That's 40*40*40*40*40 different settings. 102,400,000 runs. If your test procedures are good, you can decide whether or not any of them are "tradeable", even though you are performing that fearful "optimisation" heresy.
Beth mails a diskette to a Robo Broker and say "trade this." The only thing that determines performance is what's on the diskette. How it got there is completely immaterial. How many computer runs Beth performed, and how many Optimizations she deployed, doesn't matter. The only thing that matters is what's on the diskette.
Therefore the thing to focus your efforts upon, is figuring out how to evaluate a system. If you believe that systems can be "non robust", figure out how to test for this. If you think systems can be "too sensitive to small changes", figure out how to test it. If you think it's possible for a system to be "over fitted", figure out how to test a system and determine if it's over fitted. Apply all your insight and expertise and precious time, to figuring out how to test systems.
Then, test absolutely everything under the sun. Test one hundred million variations of EMA Crossover with trailing stops and pyramiding and portfolio heat limits. There are 5 parameters, try 40 values of each one. That's 40*40*40*40*40 different settings. 102,400,000 runs. If your test procedures are good, you can decide whether or not any of them are "tradeable", even though you are performing that fearful "optimisation" heresy.
Everything Mark said I agree with. It isnt the number of parameters  the number is just an indicator that a system might be curve fitted  its the robustness of the system that comes out of design and testing that counts.
Also I agree with Gordon's point about Short System vs Long System. I had a conversation with a well respected system designer on this same subject because my testing and discretionary trading all indicated that markets (read people) behave differently depending on whether the direction is up or down and whether the recent slope is normal or extreme. Said system designer had also found this with all but one of his systems but he still would not implement systems that way because futures truth etc used the heuristic of number of parameters as a red flag for curve fitting.
To me the most obvious downside of seperating your short and long systems is that you halve the number of test outcomes. This can be at least partially overcome though.
John
Also I agree with Gordon's point about Short System vs Long System. I had a conversation with a well respected system designer on this same subject because my testing and discretionary trading all indicated that markets (read people) behave differently depending on whether the direction is up or down and whether the recent slope is normal or extreme. Said system designer had also found this with all but one of his systems but he still would not implement systems that way because futures truth etc used the heuristic of number of parameters as a red flag for curve fitting.
To me the most obvious downside of seperating your short and long systems is that you halve the number of test outcomes. This can be at least partially overcome though.
John
I also would say I'd rather the parameters not be split to accomodate the short and long side. However, if I were going to do it one of the reasons is in my experience equity markets tend to fall much faster than than they rise. Positively correlated here is the increasing of atr levels as the market falls (anybody notice how small these levels are today?).
I remember doing short term pattern recognition research, which I have yet to complete, and the long entry had to be quite a bit differnet from the short entry due to this characteristic. Good trading.
I remember doing short term pattern recognition research, which I have yet to complete, and the long entry had to be quite a bit differnet from the short entry due to this characteristic. Good trading.
Word meanings
Firstly I would like to thank the authors of those recent comments, some of their contributions bought me to this forum, after their excellent contributions to Chuck's forum.
It depends what is meant by the words used. Data is what the system is run over, eg, the OHLC EoD for the chosen vehicle, like GM from the Dow 30. The samples are the results the system produced. The parameters are the elements of the system that affect the result. The components are; the trader, the data, the system parameters.
There were several remarks about the size of the samples (results). In statistics a population is the whole picture, samples are the part that is considered, eg , a pollsters says the 1000 people that were rung about smoking are a reflection of the smoking habits of the population at large. Ttest was suggested because it is suited to small sample sizes, say about 30 to 50. Statistics can be used for quantifying outcomes and making inferences from the analysis.
I separate long and short only for testing results, that is not separated in the parameter design, but they are only separated to see that the results are similar, while the analysis in general is done to both together.
The data is an important issue. Sir G pointed out that adjusting the data changes the outcome considerably. Marks comments about the robobroker insinuates that the only thing that was being changed was the data. No matter what the testing that is or has been done, the data that the future will deliver is unknowable. A reasonable look at a vehicles data over the years shows how it changes with time. So no matter what your particular system creation method, it is an act of faith that the data the future will deliver is of the type that your system can handle. This is why optimizing is considered folly.
As it is the parameters (stops, ma , whatever) that determine how the data is responded to, then in my opinion, as the only properly adjustable component that is available, it should be the subject of interest when testing performance on known data. Its not the data component that should be changed to get the results desired. The only way you can see if your parameter changes are worthwhile is if the unchanged data produces better results, that are not curve fitted.
It depends what is meant by the words used. Data is what the system is run over, eg, the OHLC EoD for the chosen vehicle, like GM from the Dow 30. The samples are the results the system produced. The parameters are the elements of the system that affect the result. The components are; the trader, the data, the system parameters.
There were several remarks about the size of the samples (results). In statistics a population is the whole picture, samples are the part that is considered, eg , a pollsters says the 1000 people that were rung about smoking are a reflection of the smoking habits of the population at large. Ttest was suggested because it is suited to small sample sizes, say about 30 to 50. Statistics can be used for quantifying outcomes and making inferences from the analysis.
I separate long and short only for testing results, that is not separated in the parameter design, but they are only separated to see that the results are similar, while the analysis in general is done to both together.
The data is an important issue. Sir G pointed out that adjusting the data changes the outcome considerably. Marks comments about the robobroker insinuates that the only thing that was being changed was the data. No matter what the testing that is or has been done, the data that the future will deliver is unknowable. A reasonable look at a vehicles data over the years shows how it changes with time. So no matter what your particular system creation method, it is an act of faith that the data the future will deliver is of the type that your system can handle. This is why optimizing is considered folly.
As it is the parameters (stops, ma , whatever) that determine how the data is responded to, then in my opinion, as the only properly adjustable component that is available, it should be the subject of interest when testing performance on known data. Its not the data component that should be changed to get the results desired. The only way you can see if your parameter changes are worthwhile is if the unchanged data produces better results, that are not curve fitted.
PS
What Mark actually said, was that the parameters are the only constant in the situation he described. I agree.
When I said in my most recent post that the data was the only thing that had changed, I ignored the component that says who the user of the system was. Because what was meant, was to focus on the fact that the data that was put to the system parameters was probably different to that the designer had been using. His point reinforces the point that the parameters are all we really have to play with, ie , not focus on the data or the user.
Even though we need the data to test with and that data must be used properly,
as must the testing.
When I said in my most recent post that the data was the only thing that had changed, I ignored the component that says who the user of the system was. Because what was meant, was to focus on the fact that the data that was put to the system parameters was probably different to that the designer had been using. His point reinforces the point that the parameters are all we really have to play with, ie , not focus on the data or the user.
Even though we need the data to test with and that data must be used properly,
as must the testing.