Benchmarking

Discussions about the testing and simulation of mechanical trading systems using historical data and other methods. Trading Blox Customers should post Trading Blox specific questions in the Customer Support forum.
Post Reply
corvus
Senior Member
Senior Member
Posts: 31
Joined: Sat Sep 01, 2007 11:41 pm

Benchmarking

Post by corvus » Tue Nov 06, 2007 11:45 am

This is the first topic I have started and I am hoping it's strong enough to generate lots of results...

Before I start, let me state that while I don't think there is any such thing as a perfect system, I do think it quite useful to look at the performance of other systems as benchmarks. If you can't beat the risk free rate of return, why would you put the capital at risk in the first place? And, if you couldn't beat a passive index, possible with the addition of a simple set of trading rules, why bother with the cost of trading actively?

So, with that in mind, I'd be interested in knowing what sort of results various traders have had -- in two ways.

First, backtesting versus actual results. I'm aware of the optimization paradox mentioned as mentioned in WOTT, but I am curious but I am intested in a bit more of the specifics. For example, is the variation 5% or 50% ... over a fair bit of time.

Second, general benchmarks. What is a good ballpark of system performance? Are we talking about 20% for the past thee years (as I think Chris67 wrote) or approximately three times the S&P500 as I read by another?

I know there is a lot left unsaid here (risk per trade, markets, initial capital, etc) and it's intentional as I would rather leave things open. If you want to indicate the general system (turtle, ADX, a suite of strategies, etc) fine. But I'm not really too interested in the details as I imagine most people consider them somewhat private.

That being said, I'll start.

I put approximately 100k into two CTA traded systems with zero correlation to each other. The backtesting over the past 3-1/2 years shows:

Average Rate of Return of 72.5%
Max Drawdown 11.6%
Sharpe ratio 2.28
Sterling ratio 3.36
Sortino ratio 5.87
MAR 6.27

The actual results are much less impressive (enough for me to select TB for the next infusion of investment capital allocated to equities), but not worrying.

Results YTD: 1.5% gain


Thanks and good trading,

Corvus

TrendMonkey
Roundtable Knight
Roundtable Knight
Posts: 154
Joined: Fri Apr 22, 2005 9:14 pm
Location: Vancouver, Canada

Post by TrendMonkey » Tue Nov 06, 2007 12:38 pm

Hopefully for you YTD began yesterday?

sluggo
Roundtable Knight
Roundtable Knight
Posts: 2986
Joined: Fri Jun 11, 2004 2:50 pm

Post by sluggo » Tue Nov 06, 2007 9:33 pm

I like to build up a mental "data base" of backtested performance results, to compare new systems against old (previously tested) systems. Here then is some data obtained when testing the presupplied "Bollinger Band Breakout" system that comes with Blox. I used the Builder Edition to print out some additional statistics, notably duration of average trade in "market days" (5 market days per week, 252 market days per year), and R Cubed from the book Way Of The Turtle.

Of course I had to make a zillion assumptions when setting up and running these simulations. Just as you must make your own assumptions when you run your own simulations. I had to choose a slice of history to simulate: I picked the last 20 years, from 1/1/1987 to today. I had to choose a portfolio: I picked a basket of 86 futures markets around the globe, the most liquid markets. This defeats any accusation of "cherry picking" (selecting only the "best" markets), but it raises the spectre of "an unbalanced portfolio". You could overcome this objection by using sector-positioncount limits or sector-risk limits; but, as I was using the built in Bollinger Breakout system, which doesn't include these limits, I did not. If you want to explore further, you can download a version of the Bollinger Breakout system which does include sector-risk limits, from (this) page in the Blox Customers area of the website.

I ran the Bollinger Breakout system over a range of parameter settings, as shown in the image below: 10 risk values X 19 Close Average values X 26 Entry Threshold values = 4940 different simulations. After saving the results into a spreadsheet, one row per simulation, I like to plot (column This) versus (column That) as a scatterplot, and ponder the resulting diagram. Five such scatterplots are attached. Each scatterplot contains 4940 dots, one dot per simulation.

I find the plots illuminating.

Figure 1 shows that the relationship between CAGR and MaxDD is not particularly linear; for each MaxDD there are a wide range of possible CAGRs, and vice versa. Picking your MaxDD does not automatically determine your CAGR. As Howard Brazzil observed (link), if you're operating on the upper frontier (topmost, leftmost parameter sets in Figure 1), increasing your MaxDD by 1.1X increases your CAGR by more than 1.1X. Surprise!

Figures 2 and 3 show that gain-to-pain ratios like Sharpe and R-Cubed, have an "optimum" trade duration of approximately 7-12 months (150 to 250 market-days) on this system. The gain-to-pain curves peak around there. Trading more quickly (left edge of the plot) or more slowly (right edge) gives poorer results.

Figures 4 and 5 display the nonlinear relationship between the Sharpe Ratio and the MAR Ratio / R Cubed measure. The plots are curved; they aren't a straight line. This shows that the various different ways to measure gain-versus-pain are NOT identical to one another.

You can think of these plots as "trade-off diagrams"; they show the uncomfortable choices you need to make when selecting parameters for this system. They also highlight the potential peril of "optimization" (choosing the one parameter set among thousands, which gives the "best" performance): The plots show whether the Best parameter set is, or is not, representative of many other parameter sets. For example, in Figure 3, there is one parameter set with an R Cubed value of 3.6, while the vast bulk of parameter sets have an R Cubed of 2.5 or less. You may wish to ponder whether the "best" parameter set is really that much better than the others. Or is it an accidental "freak of nature", a mathematical anomaly unlikely to be encountered in the future.

Hope you enjoy the data.

EDIT: added trade-off paragraph
Attachments
chart2.png
Figure 2
chart2.png (24.78 KiB) Viewed 6595 times
chart1.png
Figure 1
chart1.png (30.5 KiB) Viewed 6597 times
BB_parameters.png
Bollinger simulation parameters
BB_parameters.png (12.44 KiB) Viewed 6595 times
Last edited by sluggo on Fri Nov 09, 2007 5:12 am, edited 2 times in total.

sluggo
Roundtable Knight
Roundtable Knight
Posts: 2986
Joined: Fri Jun 11, 2004 2:50 pm

Post by sluggo » Tue Nov 06, 2007 9:35 pm

the other figures (only 3 attachments per post!)
Attachments
chart5.png
Figure 5
chart5.png (17.54 KiB) Viewed 6588 times
chart4.png
Figure 4
chart4.png (17.7 KiB) Viewed 6588 times
chart3.png
Figure 3
chart3.png (28.55 KiB) Viewed 6590 times

Turtle40
Roundtable Knight
Roundtable Knight
Posts: 201
Joined: Wed Oct 19, 2005 1:53 pm
Location: Guernsey, Channel Islands

Post by Turtle40 » Wed Nov 07, 2007 2:21 am

Hi Corvus,

I have found it difficult sometimes to learn TB and to test "correctly". After all I guess most of us are essentially self-taught in this field, clearly starting from different levels of trading experience and so on.

I would suggest, for further reading, two of the books mentioned on the TB website:

Design, Testing and Optimization of Trading Systems-Pardo
Mechanical Trading Systems-Weissman

These have helped me enormously. I wish I had read them much earlier(!)

Good Luck.

svquant
Roundtable Knight
Roundtable Knight
Posts: 126
Joined: Mon Nov 07, 2005 3:39 am
Location: Silicon Valley, CA
Contact:

Post by svquant » Thu Nov 08, 2007 12:20 am

Interesting graphs sluggo - thanks!

One thing that stood out for me was just how few points were above the diagonal line in the CAGR vs MaxDD graph. Yes, many of us know this from our testing or live trading that it take a lot of effort to be above the 1:1 ratio on CAGR vs MaxDD but it is good to "see the picture" every so often.

sluggo
Roundtable Knight
Roundtable Knight
Posts: 2986
Joined: Fri Jun 11, 2004 2:50 pm

Post by sluggo » Thu Nov 08, 2007 6:41 am

svquant wrote:One thing that stood out for me was just how few points were above the diagonal line in the CAGR vs MaxDD graph.
The diagonal line is where CAGR = MaxDD, in other words, where MAR Ratio = 1. Points above the diagonal line are points whose MAR Ratio is larger than 1.

Luckily, if you store your simulation results in a spreadsheet, it is a simple matter to plot the column containing (MAR Ratio) against the column containing (MaxDD), as shown below. Those points you are interested in, are the ones above the (MAR Ratio = 1.0) horizontal line. As Howard Brazzil says, if you want to get a big MAR you're gonna get drilled by some painful drawdowns too.

Enjoy!
Attachments
chart6.png
Figure 6
chart6.png (29.49 KiB) Viewed 6436 times

corvus
Senior Member
Senior Member
Posts: 31
Joined: Sat Sep 01, 2007 11:41 pm

Post by corvus » Fri Nov 09, 2007 11:21 am

Sluggo: I really appreciate the considered and detailed response. It wasn't exactly what I was thinking an answer would look like but in several ways it's more beneficial and thought provoking.

One question immediately comes to mind though. You said that you keep a mental database of systems. Do you compare the systems against an arbitrary standard and discard the ones that you see as low probability performers? For example, a system with a MAR of X, with a R-cubed of Y, and a Max DD of Z warrants consideration; worse systems do not. The reason why I ask this in different markets (tending or not, volatile or not) different strategies work better than others and I am wondering if this is part of your decision making calculus.

Turtle40: I'll take the suggestion and get them. I've been looking at them that is pretty much all it will take to push me over the edge.

Thanks

sluggo
Roundtable Knight
Roundtable Knight
Posts: 2986
Joined: Fri Jun 11, 2004 2:50 pm

Post by sluggo » Tue Nov 13, 2007 9:00 am

Think positively: some day (perhaps even today) your account will be large enough to trade a Suite of several different systems, simultaneously. Thus you'll benefit from saving all the systems you test and all the test results you generate, because you will want to revisit them later when it's time to put together a Suite (or a larger Suite than you have right now) of a number of systems. So I wouldn't recommend using the word "discard" when discussing system test results; I much prefer to "archive" rather than discard.

The way to improve your performance at task X is to perform task X over and over. "How do I get to Carnegie Hall? Practice, practice." How do you become a good Chess player? Play a lot of games of Chess. How do you improve your game of golf? Play a lot of rounds of golf. How do you become a black belt in Judo? Train many months and practice a lot of Judo. How do you become an accomplished mechanical trading systems tester? Test a lot of mechanical trading systems.

As you're testing systems and archiving the results, you will start to notice patterns. You'll see that systems you really like have one set of characteristics, while systems you don't like at all, have another set of characteristics. Nobody can tell you what these characteristics will be, because they're not you, they don't know what you will or won't really like. You might be unfazed by "Open Equity Drawdown", for example. Some people are, others vehemently are not. There are dozens of different ways to "measure" the "goodness" of a mechanical trading system: Sharpe, MAR, R-Squared, R-Cubed, Sortino, Ulcer Index, and so forth. Ask yourself: why so many? The answer is, different traders have different ideas about "what is good" and so they need different tools to measure "how much do I like this system". This is why Blox produces such a large amount of output, and gives you Preference check boxes to enable or disable it, piece by piece: because different traders pay attention to different aspects of system behavior. What's important to Anderson is unimportant to Zimmerman and vice versa.

While you're building your physical database of system test results, consisting of printed pages and burned CD-ROMs and directory hierarchies on your hard drive, you will also be building a mental database, of opinions and impressions and patterns. Eventually you will have tested enough systems and studied enough results and looked into your heart to decide what's important to YOU enough times, that you will have effectively internalized the important aspects of the archive. You'll still keep the physical archives, and indeed you'll add to them when you generate new test results, but you'll find that you won't need to consult the archives as much as you used to. You will have achieved the state of Knowing Yourself and in particular, knowing what you like.

I don't possess a shortcut that will let you achieve enlightenment without doing a ton of work beforehand. And I'm dubious that such a shortcut even exists, but of course would be delighted to be proven wrong.

If it sounds like too much work and not enough reward, congratulations, that is a discovery of Self Knowledge that few people attain. If so, this passage from the first Market Wizards book (p.167) might comfort you:
Interviewer: What is the most important advice you can give the average trader?

Wizard: That he should find a superior trader to do his trading for him, and then go find something he really loves to do.

corvus
Senior Member
Senior Member
Posts: 31
Joined: Sat Sep 01, 2007 11:41 pm

Post by corvus » Tue Nov 13, 2007 10:37 am

I'm fairly certain I understand what you are saying: it's smart to keep the results of your research so that you can use it as either all or part of your overall trading system in the event it fits with the plan. But you probably would not field one the poor instantiations of a system unless you had a cogent theory behind it. Perhaps this would be more clear if I used your example.

On chart 3, a system with an average time of 400 days and a R-cubed of .5 wouldn't seem to make the cut when compared with one that is more along the lines of 200 and 2.5, respectively. You might test the suite with these two along with a few more at different points along the curve just to make sure, but unless there were some unexpected results, it would probably not pan out. Am I wrong here?

It seems somewhat like picking tomatoes at a store. If it isn't red I don't really want it. If it's damaged then the same is true. Same for hard, unripe ones that only appear to be ripe. Once you are in that range, it pretty much comes down to chance. Again, is this reasonable?

Cross-applying that way of thinking on the same graph, I would think for ones that I am willing to look are more likely to be say, above, say, an R-cubed of 2 or maybe ones that have an average trade in market of between 150 and 250 days and an R-cubed of 1 or even 1.5. I'm certain by system and market and that is where I think the mental database (or highwater mark) comes in.

If I'm on track, the point I'm making is it seems likely that 70% (or more) of the systems wouldn't warrant secondary consideration under most if not all circumstances; this might be a moot point with modern day storage being so large. But managing the huge number of results seems like it would be come an ever increasingly large task. I'm wondering about the heuristics of this all.

Post Reply