A discussion about generating synthetic data ?

Discussions about the testing and simulation of mechanical trading systems using historical data and other methods. Trading Blox Customers should post Trading Blox specific questions in the Customer Support forum.
yoyo2000
Roundtable Fellow
Roundtable Fellow
Posts: 58
Joined: Fri Jan 30, 2004 10:37 pm

A discussion about generating synthetic data ?

Post by yoyo2000 »

How about generating synthetic data to test some trading thoughts or methods?Does any logical and reasonable way to generate them exist?

anyone's interested in it?
TradingWindow
Contributing Member
Contributing Member
Posts: 9
Joined: Sat Aug 28, 2004 7:53 am
Location: London, UK (Europe)

Re: A discussion about generating synthetic data ?

Post by TradingWindow »

yoyo, try

"Options, Futures, and Other Derivative Securities by John Hull"

tw
Forum Mgmnt
Roundtable Knight
Roundtable Knight
Posts: 1842
Joined: Tue Apr 15, 2003 11:02 am
Contact:

Post by Forum Mgmnt »

In order to answer this question properly, we need to know why you want this data and why you think synthetic data is somehow better than actual historical data for this purpose.

There are lots of ways to generate synthetic data but each is better suited to some specific purposes than to others.

- Forum Mgmnt
Jake Carriker
Site Admin
Site Admin
Posts: 1493
Joined: Fri Sep 12, 2003 10:32 am
Location: Austin, Texas

Post by Jake Carriker »

I would like to talk about this topic. I propose that the forum members evaluate the following method of creating synthetic data series. It is outlined in Tushar Chande's book "Beyond Technical Analysis" . I don't consider it to be off limits for description here, as it is presented in the book as an example of how to create synthetic data, not a proprietary methodology.

Here is how it is done. Convert standard O,H,L,C price data into "delta" format by subtracting the the previous close from the current open, high, low, and close in turn. Thus, you end up with each bar being described entirely in relation to the previous close. For example, given delta values of .01, .05, -.03, -.01 and a previous close of $10, we can infer that today's price data is 10.01, 10.05, 9.97, 9.99. Each bar of data is converted to the delta format, and bars should be numbered from 1 to N (where N is the last bar in the series).

Next, the bar numbers are "scrambled" as follows: Generate a random whole number (X) between 1 and N. Assign the delta data found in bar # X to "bar #1" of the synthetic series. Repeat the process for bar number 2 through whatever to generate a theoretically infinite number of bars of synthetic data.

What you end up with is daily price changes which correspond to historical price changes, but are presented in a different order. The price changes can repeat or be skipped as well (i.e October 1987 might happen 5 times or 0 times in your new synthetic series). The idea is that the "character" of a market is preserved, while creating a totally unique data set. If the theory is correct, the method might create totally new data that retains the "signature" of the individual market it was created from.

The purpose of creating synthetic data in this manner is to create plenty of grist for the hypothetical backtesting mill. I don't know if there is merit in the idea or not, but it certainly is interesting. If it works, it would be easier to validade the robustness of a strategy.

What do you think?

Best,
Jake
ksberg
Roundtable Knight
Roundtable Knight
Posts: 208
Joined: Fri Jan 23, 2004 1:39 am
Location: San Diego

Travesty

Post by ksberg »

Jake, I think the method has merit, but needs additional information to preserve market characteristics. As described I believe the method is too random. We could borrow synthetic concepts that have already proven to preserve characteristics in other fields such as language and music. For instance, the Travesty algorithm has been used to create psuedo works of Mozart, Bach, Poe, or Shakespeare. What Travesty adds is the probability of seeing a random event after a series of related events. Afterwards, it's the sequence of events that adds characteristics that remind people of the original.

I don't know what key characteristics are important to preserve, but I'm guessing price and volatility are two minimal components.

Cheers,

Kevin
yoyo2000
Roundtable Fellow
Roundtable Fellow
Posts: 58
Joined: Fri Jan 30, 2004 10:37 pm

Post by yoyo2000 »

TradingWindow,sorry,but could you tell me in which chapter the book discuss this topic? I couldn't find it out:(

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
"In order to answer this question properly, we need to know why you want this data and why you think synthetic data is somehow better than actual historical data for this purpose. "

Faith,the reason I want this data is,I haven't enough related market data to backtest some of my thoughts in trading,for example,a method that generates only several trading signals in a year.But a statistical analysis (especially normal distribution) need at least 30 examples,so I have to collect more market data to generate more trading examples,and the more the market data,the more reliable the conclusion is.We can get the examples from several markets,but getting them from the markets that have the same characteristics is another reasonable way.Of course,the best data is the real market data,but the synthetic data with the same characteristic of the market should be a good Second-level-choice,when the real data isn't enough.And if I only want to trade in,for instance,currency market,it seem have little meaning to test throughout all of the tradable markets--only when I trade all the signals the system generates,can I gain the edge of the system.

And I'm sure,Faint,your intelligent suggestion will be much useful for me.

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~`
Faint,Carriker and ksberg,I'v generated some of synthetic data by the means of the book "Beyond Technical Analysis" mentioned,and the pics are packed in the attachment.

And I'm studying another way mentioned in the book "Trading system analysis" by Robert Barnes,but no results yet.The following is the theory and steps in brief,Since the book doesn't say the exactly steps.
whose name is the price vector model:

http://www.elitetrader.com/vb/showthrea ... post583206

It's hard for me to tell the differences between the two kinds above,could you please help me to point them out?(Much more appreciation for recommending some other better ways)
Attachments
data.rar
(123.57 KiB) Downloaded 1239 times
yoyo2000
Roundtable Fellow
Roundtable Fellow
Posts: 58
Joined: Fri Jan 30, 2004 10:37 pm

Post by yoyo2000 »

PS:

The way mentioned in the book "Beyond Technical Analysis" ( the delta model, for short) seem simple,but I feel the price vector model seem more reasonable,any suggestion in detail?
thanks
Forum Mgmnt
Roundtable Knight
Roundtable Knight
Posts: 1842
Joined: Tue Apr 15, 2003 11:02 am
Contact:

Post by Forum Mgmnt »

yoyo,

We're pretty informal here, addressing me as c.f. is generally better than Faith, that's probably true for Jake as well.

It seems to me that what you are going to do here is create a very sophisticated platform for building a curve fit system.

I strongly discourage you from proceeding down the course you outline using synthetic data. I don't believe there is any way that you can generate synthetic data that has more information than the underlying market data on which it is based. So if you don't have enough data points to draw any valid statistical conclusions, you don't have enough information, period.

Consider two possible outcomes:

1) Your tests over synthetic data have excellent results. How will you know this is not due to the algorithm you used?

2) Your tests over synthetic data have poor results. How will you know if this is not due to the same problem?

If you generate synthetic data using a purely algorithmic approach, and one that is not based on rearranging past data you run a much higher chance that you'll come up with a trading system that works great in your theoretical world, but you won't have any way of knowing if that world resembles the real world where you'll end up placing your trades.

The only way to get more information is to use more market data, even if that data comes from markets you don't intend to trade.

- Forum Mgmnt

P.S. I think there are places for synthetic data but I'd only use it as a robustness check, after you've already decided the system is good enough to trade based on historical market data. Monte-Carlo simulations are a variation on this idea.

Even here, I think there are better ways to get the same result.
Jake Carriker
Site Admin
Site Admin
Posts: 1493
Joined: Fri Sep 12, 2003 10:32 am
Location: Austin, Texas

Post by Jake Carriker »

Forum Mgmnt, I agree with your idea of using synthetic data as a confirmation check for system robustness. Another way to approach the situation is to develop a system entirely on synthetic data and then do the out of sample testing on the real data. This is one way to keep your system from being curve fit from several optimization loops through real data.

I think some of the same pitfalls exist with either method. The primary one is that you must make some assumptions in order to create synthetic data. If these assumptions do not hold, testing results will not be valid. Of course, market activity can change and invalidate prior assumptions in real data too. However, the difference is that real market data is self-evolving (that is, the latest tick necessarily reflects all the evolution of price behavior up to the present moment). With synthetic data, the initial asumptions about the data set will not evolve at all, but simply be recombined in new ways. Granted the larger the data sample is that serves as the model for the synthetic series, the greater the chances of the model being robust. However, if there is a major change in the structure of a market, the person using synthetic data needs to heed this just as much, if not more so than the person using real data.

On a different note, how does anyone feel about the fact that the Chande method does not normalize for absolute price level? What do you think about measuring the delta as a percentage change and applying that percentage change to the new series, rather than using absolute price changes? Intuitively, if there is a $1.00 delta, this is a very different percentage move if the price of the instrument is $1000 or $1. However, using backadjusted data, I also get commodity prices that are small positive or even negative numbers in the more distant past. It would not make much sense to adjust these in such a manner.

Lastly, c.f. wrote:
Even here, I think there are better ways to get the same result.
Could you share some of your ideas about the better ways? I am interested.

Best,
Jake

P.S. - Yes, please address me by first name.
ksberg
Roundtable Knight
Roundtable Knight
Posts: 208
Joined: Fri Jan 23, 2004 1:39 am
Location: San Diego

Valid cases for synthetic data

Post by ksberg »

(BTW: I prefer Kevin)

IMHO, synthetic data is a sophisticated concept that requires careful crafting and application. I think there are cases to be made for synthetic data, but they are special cases to be sure. One that comes to mind is the introduction of the EuroCurrency. Traders needed a way to understand how a new market might trade, so modeling the EC was appropriate.

Here are a few cases where it may be useful:
1. Too little data is available, so data is approximated from other sources (e.g. EC)
2. There is an structural shift in how the contract is defined (e.g. S&P BigPointValue changes)
3. An open outcry market will be traded electronically (support/resistance points are fuzzier in ES than SP)
4. Long-term economic fundamentals are similar, but volumes are significantly different.
5. You believe black-swan price shocks will be more frequent because of political instability.

I'm sure there are others. However, I think c.f. point is key on this: use it only as a robustness check, not as the basis for evaluating a new system. Synthetic data derived only from existing data will not provide additional information, only more data. But synthetic data derived from multiple sources actually can produce something new. Going back to the music example, think of combining Elvis, Reggae, and Led Zepplin ... you will have something entirely new (and a bit zany).

Cheers,

Kevin
yoyo2000
Roundtable Fellow
Roundtable Fellow
Posts: 58
Joined: Fri Jan 30, 2004 10:37 pm

Post by yoyo2000 »

Well,I like the informal mood here:)

Forum Mgmnt,my initial thought is to backtest some methods over the synthetic data that derived from the real markets data,and carry the ones which show positive expectencies on the real markets historical data I trade,for a out-of-sample test.And the logic behind it is:the synthetic data show wider types of market actions similar to the real market,so if I get a good mark on synthetic,the probability of getting a good mark on real market should be high.But now,it seem that there is no necessity between them......

IMHO, many thoughts of mind are denied after the discussion began,it's a good thing,and I hope I will learn more from the discussion.


"1) Your tests over synthetic data have excellent results. How will you know this is not due to the algorithm you used?

2) Your tests over synthetic data have poor results. How will you know if this is not due to the same problem? "

In fact,if I got a good remark from some synthetic data,I'd like to try to prove that it's due to the algorithm I'v used.For my latent assumption is the algorithm in synthetic data is different from those derived from the real data. (that's where also in law comes the principle that you're innocent until proven guilty).But Till now,I have really no idea that how to distinguish them in a rigid statistical or logic way(if exists).

The synthetic data is generated randomly,in this regard,it seem not curve-fitting,but it has the same distribution to the real data,and no new underlying feathers will be(from Jake's),in this regard,it's curve-fitting,How to distinguish it?
In my opnion,this is the essential problem to the synthetic data.

I'v got a headache,and am seeking for the hints or guide about it......


~~~~~~~~~~~~~~~~~~~~~~~~~~~

Jake,your first paragraph point a key out.It's One of the shortages of synthetic data,that no new situation is generated,that's of course unreasonable,maybe Kevin's suggestion is helpful,but the lack of the self-evolving characteristic still be there.

In my opnion,normalizing Chande's price model into percent mode should be better.Further more,the performence of the trading method should be percent-based,when we test on backadjusted data.

~~~~~~~~~~~~~~~~~~~~~~~~~~
Kevin,your viewpoint "... synthetic data derived from multiple sources actually can produce something new."
I agree that the synthetic data is new to each market,but to the whole market which the data is derived from,the data still provide only more data,no more information.And maybe in each market,the probility of some market actions will be statistically higher,though they maybe never or seldom occur in the past.That's why they look more nazy and absonant.I'm not sure whether it's reasonable......
shakyamuni
Roundtable Knight
Roundtable Knight
Posts: 113
Joined: Wed Jun 04, 2003 9:44 am
Location: Somewhere, Hyperspace

Post by shakyamuni »

I'v got a headache,and am seeking for the hints or guide about it......
You might try running some simulations with real market data instead of imaginary synthetic data and see if the headache persists.
Last edited by shakyamuni on Sat Jan 08, 2005 5:46 pm, edited 1 time in total.
Nic Garvey
Contributing Member
Contributing Member
Posts: 9
Joined: Thu Apr 01, 2004 9:45 am
Location: London, UK

Post by Nic Garvey »

I think that the method Jake Carriker outlines above might be a useful way to test various correlation ideas.

If we create the data so that the scrambling is the same across markets, the correlation of the markets will be maintained (eg. the 1987 crash will occur simultaneously across all markets, like in reality). Then correlation strategies can be tested across this synthetic data. Are the correlation limits improving performance as expected?

Another approach might be to test the validity of your correlation criteria. Say you use correlation limits which you chose based on a correlation study in Excel. Scramble the data for each market individually so that correlations do not perform as they once did and new ones may arise (eg. you might find synthetic ED is closely correlated with synthetic CL). Now perform the same correlation study in Excel and change the correlation limits to reflect the new correlations of the synthetic markets. Test the synthetic data with correlation limits versus the synthetic data without correlation limits. How does this compare with a similar test on actual data?
Jake Carriker
Site Admin
Site Admin
Posts: 1493
Joined: Fri Sep 12, 2003 10:32 am
Location: Austin, Texas

Post by Jake Carriker »

Here is an update on my experience with the Chande method of creating synthetic data.

I decided to take a relatively simple system that has known (to me) performance characteristics on a broad portfolio of 50 futures markets and subject it to synthetic data for the same markets. The synthetic data is derived using the method outlined in my previous post, and the "seed" data is the very same data that the "test system" has been backtested on.

I created a bit over 60 "years" of synthetic data for each of the fifty markets. I then tested the system over the last 20 "years" of this data (the length of the system's actual historical backtest). I was somewhat unsurprised to see that the system fell apart. Performance fell from a CAGR of 50% plus and a MAR around 1.5 to negative performance and a drawdown of >70%. The numbers just got worse as I added more "years" of data to the test.

Now, one of a few different conclusions may be true. First, I may have truly goofed in the data creation process. I don't think this is true, as I visually inspected all of the charted data, and it looks very similar to the actual markets it was derived from. I also checked for huge gaps etc. that are signs of bad data. However, it is possible that I messed up. Second, The synthetic data built in this manner may not capture the subtle qualities of price movement that allow a simple trend following system to extract profits. Third, the data may be fine and my system, simple though it is, may be a curve fit piece of #$*&. Fourth, my system may be fine and the data may be fine, and it is only that my rose colored glasses have been removed. It is possible that almost every system that anyone has made money with over the last 30 years has worked by a sheer run of luck, and if traded in most "alterenate universes" (or perhaps on tomorrow's data) would fall apart.

Now, I have lots of circumstantial evidence that #4 is not true, but I can't prove it. I also have some evidence (few parameters, consistent profitabilty over many markets, successful out -of-sample testing, etc.) that #3 is not true. As for #1, it is not likely that such consistently bad performance would be the result of errors in data creation. The kind of price spikes and disjointed action that comes from that is easy to spot. Therefore, I am left to ponder if #2 is the culprit.

Does anybody else have some research results from work with synthetic data? Perhaps other methods give better results, or maybe synthetic data is not where it's at. Of course, the scariest thing is if the results I got are not the result of the data, but evidence that backtesting should give us less confidence than it often does.

Best,
Jake
Roger Rines
Roundtable Knight
Roundtable Knight
Posts: 2038
Joined: Wed Oct 06, 2004 10:52 am
Location: San Marcos, CA

Post by Roger Rines »

Jake,
I found your ideas real interesting. Here are some thoughts on what I'm finding in my synthetic contract testing.
Jake Carriker wrote:Forum Mgmnt, I agree with your idea of using synthetic data as a confirmation check for system robustness. Another way to approach the situation is to develop a system entirely on synthetic data and then do the out of sample testing on the real data. This is one way to keep your system from being curve fit from several optimization loops through real data.
Another method that might work is to test the data using methods that have worked well for a long time on that market series. Using known good robust methods can be helpful in finding the scrambled data that has retained the market's characteristics if they produce reasonable results. These selected files could then be used to test new methods on some of the scrambled data. When the new method is deemed viable, it could then be tested on the remaining data series including the real market file. If the testing going both ways works, it would seem we would have an extended data series that could help us get a better feel for unseen market conditions.

Jake Carriker wrote: I think some of the same pitfalls exist with either method. The primary one is that you must make some assumptions in order to create synthetic data. If these assumptions do not hold, testing results will not be valid. Of course, market activity can change and invalidate prior assumptions in real data too. However, the difference is that real market data is self-evolving (that is, the latest tick necessarily reflects all the evolution of price behavior up to the present moment). With synthetic data, the initial asumptions about the data set will not evolve at all, but simply be recombined in new ways. Granted the larger the data sample is that serves as the model for the synthetic series, the greater the chances of the model being robust. However, if there is a major change in the structure of a market, the person using synthetic data needs to heed this just as much, if not more so than the person using real data.
A possible solution for the problem you mention about the synthetic data not growing would be to preserve the seed value used to generate the original sequence. This way the seed value that was used to set the random sequence generator would use the same number sequence when it built the new file. If the source data file has grown, some of the sequence might change, but in general the overall sequence for building the file would be the same. Another possible way to do this would be to use an expanded data range on the original data series. This would allow the user to generate a longer sequence from the original source.

Seed preservation could have other uses as well. For example, if a data series tested well with known good methods that same seed value could be used to generate another market series to see how the spread relationship would survive under testing by known good systems.
Jake Carriker wrote: On a different note, how does anyone feel about the fact that the Chande method does not normalize for absolute price level? What do you think about measuring the delta as a percentage change and applying that percentage change to the new series, rather than using absolute price changes? Intuitively, if there is a $1.00 delta, this is a very different percentage move if the price of the instrument is $1000 or $1. However, using backadjusted data, I also get commodity prices that are small positive or even negative numbers in the more distant past. It would not make much sense to adjust these in such a manner.
This was a really great idea that I'm going to try testing. To be sure I really understand your idea, here is the approach by which I'll disassemble and reassemble the data:

Code: Select all

   '   To disassemble a price record's elements [Elements = Opn, Hi, Lo, Cls)
   Price_Element = (Current_Price_Element - Previous_Close) / Previous_Close
 
   '   To Reassemble a price record from record elements
   New_Price_Element = (Previous Close * Price_Elemeent) + Previous_Close

If this is the intent, then I'll incorporate the process in the software I'm using here. If you or others would like to do some testing with this software it is available for the asking.
Roger Rines
Roundtable Knight
Roundtable Knight
Posts: 2038
Joined: Wed Oct 06, 2004 10:52 am
Location: San Marcos, CA

Post by Roger Rines »

Jake Carriker wrote:Here is an update on my experience with the Chande method of creating synthetic data.

[SNIP]

I created a bit over 60 "years" of synthetic data for each of the fifty markets. I then tested the system over the last 20 "years" of this data (the length of the system's actual historical backtest). I was somewhat unsurprised to see that the system fell apart. Performance fell from a CAGR of 50% plus and a MAR around 1.5 to negative performance and a drawdown of >70%. The numbers just got worse as I added more "years" of data to the test.

[SNIP]

Does anybody else have some research results from work with synthetic data? Perhaps other methods give better results, or maybe synthetic data is not where it's at. Of course, the scariest thing is if the results I got are not the result of the data, but evidence that backtesting should give us less confidence than it often does.
Jake,
I've noticed this same result but didn't see your possible problems because one of the systems I've used so far has been a solid performed from around 1988 to right now. With this known good method the issue I'm considering is that some data generated series loose too much of the markets characteristics to be considered acceptable. To prove this, I kept generating contracts using different seed values to see what would happen. Those results showed that one simple system did fairly well across the series when you look at the series in aggregate, but a more complicated system (known good for years) didn't fare as well on the same series. In Tushar’s book he recommends using the aggregate results for your expectations. This is a reasonable assumption, but I'm not sure some file sequence selection wouldn't serve the process a little better. Tushar doesn't speak to this point in his book, so I have nothing other limited test results to indicate this might be a good improvement or just chance.

On this same point, when I began looking at some of the trades in the bad-results files I noticed they weren't running as long as the good file trades. I then tested Trendiness of the synthetic series using a crude Major_Swing algorithm to see how it compared to the original market file. Those results also showed the trends between the major peaks were significantly shorter.

All this was done using the Swiss Franc currency file going back to 1982. This market is a known good trend market as are some of the other major currency markets so it seems some of the random sequences do loose the market's characteristics.
bolter
Full Member
Full Member
Posts: 11
Joined: Wed May 25, 2005 2:33 am
Location: Singapore

Post by bolter »

Roger,
Great job on the software. I've been reading charts for 20 years and I cannot visually tell the difference between your sythetic data and the original time series.

Several questions if you don't mind:

1. I wonder if you have given any thought to some method of retaining the short term serial correlation evident in most markets? The premise here being that each bar is not truly independent but rather is somewhat influenced by recent market action, ie: the market has a short-term memory. Does this suggestion have any merit?

2. Can you please explain the significance of the "Use Static Sequence Series" option and where its use would be recommended.

3. How do you incorporate sythetic data in your testing process? I know this is fairly open-ended question but I am a synthetic data virgin and any pointers from your experience would be very useful.

Thanks and regards,
bolter
Roger Rines
Roundtable Knight
Roundtable Knight
Posts: 2038
Joined: Wed Oct 06, 2004 10:52 am
Location: San Marcos, CA

Post by Roger Rines »

bolter wrote:Roger,
Great job on the software. I've been reading charts for 20 years and I cannot visually tell the difference between your synthetic data and the original time series.
Thank you for the kind words.
bolter wrote: 1. I wonder if you have given any thought to some method of retaining the short term serial correlation evident in most markets? The premise here being that each bar is not truly independent but rather is somewhat influenced by recent market action, ie: the market has a short-term memory. Does this suggestion have any merit?
This is an interesting question that will need a discussion to explore and implement. Here is why: My first reaction is to wonder how we would label the actual data records so that we could then retrieve them in a way that would emulate the history of the market, and still use a synthetic process of shuffling the data?

While the question sounds simple, it could pose a significant challenge that I don't know how to solve. With that said, I'm really open to ideas on how it might be accomplished so it can be tested to see if it add to our knowledge and ability.

bolter wrote: 2. Can you please explain the significance of the "Use Static Sequence Series" option and where its use would be recommended.
In the next release of the software, which I hope will be today, I try to explain this in the Help file. However, for those who don't have the Help file, here is a brief outline:

First a definition for "Seed Value"
A Seed Value, is a number that determines the ordering of records in the newly created file.

A possible use:
When a user wants to compare the file generation methods, it would be helpful if they had the ability to re-use the same ordering of records. By using the same ordering of records, the methods of capturing and then recreating the data will only differ by the various options the software allows for creating a new file.

In addition, if a user finds a sequence that test well with a known good trading method(s), they might want to create another correlated market with that same sequence. To do this, the software will allow the user to enter the Seed value used to create the file that tested well in the new file to be created.

Did I answer the question enough?
bolter wrote: 3. How do you incorporate synthetic data in your testing process? I know this is fairly open-ended question but I am a synthetic data virgin and any pointers from your experience would be very useful.
Right now this is all an experiment, so I'm going into this with my eyes wide open on how this might turn out. What has been surprising and encouraging is how much the software is teaching us about how the systems react to different market profiles.

It has also been a surprise to find that some files seem to be better than others, and how some synthetic methods don't seem to work at all.

To identify or select files that appear to work as an out of sample data series, I've been testing the synthetic data using a proven and an unproven system. We have other proven methods that we'll use later, but for now we are still trying to discover the interesting questions, and most of those have been about how to change the software so we can test the growing list of ideas. For example, going into this project we had no clue we would want to use a static Seed value to create a file. However, once we discovered some random file sequences seem to loose the market's characteristics and some don't, it was decided we needed to understand how other markets would perform if we used that same ordering sequence for different markets.

We also wanted to know how the different synthetic processes would compare. To do this we needed to take the ordering sequence out of the way and that required us to provide a way of using a static, or known Seed value.

Because this is all an experiment at this stage, we are not letting the results we get from synthetic data to influence our appraisal of any new methods.

As can be seen above, we don't know enough about synthetic data to know how we'll use it in our system testing. If we had to give some answer today, we would say that for us to use any synthetic data, we would need to test each file created on a series of known good systems, and then believe the results produced by the synthetic data files were representative of what we get when we test those same known good systems on the original market data files. More than likely, that testing by known good methods would produce a selected file series. At that point, we would want to create some new methods using these same selected files and then test the new methods on the original market files. If at that point the results kept us believing the results were representative, then I think we would probably consider the selected files as valuable contributors in our test data stable. This would then give us an unlimited amount of out of sample data to use as new system ideas surface and are implemented.
bolter
Full Member
Full Member
Posts: 11
Joined: Wed May 25, 2005 2:33 am
Location: Singapore

Post by bolter »

Roger Rines wrote: With that said, I'm really open to ideas on how it might be accomplished so it can be tested to see if it add to our knowledge and ability.
I've noticed this option in several Monte Carlo tools. I presume it is a very similar problem to solve, except for dates and not trades. I've e-mailed you an example.

In addition to various trend-following models I also run some shorter-term models that are based on various patterns/sequences in the market action. The latter obviously would require sythetic data where these short-term characteristics where preserved. Or maybe this is pushing the bounds of what's acheivable without prejudicing the outcome. Thoughts?
Roger Rines wrote: Did I answer the question enough?
That's clear to me now thanks.

As for your ideas on incorporating sythetic data as part of testing .... you've provided me with plenty to think about. I'll set aside some time to do some experimenting and share with you any revelations I may have.

All the best,
bolter
Roger Rines
Roundtable Knight
Roundtable Knight
Posts: 2038
Joined: Wed Oct 06, 2004 10:52 am
Location: San Marcos, CA

Post by Roger Rines »

bolter wrote:
I've noticed this option in several Monte Carlo tools. I presume it is a very similar problem to solve, except for dates and not trades. I've e-mailed you an example.
Your screen image of the MathLab Options selection arrived and it was interesting to see what they make available. After seeing your image I did a search of the net to see if there was any source code available that I could use and learn from on how to implement Serial/Auto Correlation on market data. So far nothing has been found that is complete enough to execute as a function for our software. For certain someone will know how to do this and hopefully they'll respond. When that happens it will be interesting to see how the synthetic files that test well with known good systems compare to the original market data. Knowing how the files that test well compare, as well as how the files that don't test well compare, might give us a bandwidth filter to use when generating synthetic data.

bolter wrote:
In addition to various trend-following models I also run some shorter-term models that are based on various patterns/sequences in the market action. The latter obviously would require sythetic data where these short-term characteristics where preserved. Or maybe this is pushing the bounds of what's acheivable without prejudicing the outcome. Thoughts?
At this stage I don't have much more than anecdotal evidence of how the short-term system that is described and reported on in the software's Help file performs. In that simple quick in and out system, it too seems to find that some synthetic contracts don't work well, but others do very nicely as surrogates. It doesn’t use short-term patterns, so it will be interesting to hear what you learn from your testing.

As for "pushing the bounds", sometimes keeping it simple works best when trading. We might find that our statistics give us data, but don't help us as much as a known good system's test results in finding a market file surrogate. Of course I don't know where this will come out after we've been testing this for a long period, so by keeping our options open and available will help us to preserve the data and that will give us history to judge at a later time.

We might also find that if the market looses the important characteristics we need, nothing will work with those files. However, my long-term system is using a pattern entry process and when it doesn’t find the pattern as it expects, it doesn’t generate the signal and the performance fails. At this point I don’t think we know enough, so keeping good notes will be our only hope to find the path to the best decisions.
bolter wrote:
As for your ideas on incorporating sythetic data as part of testing .... you've provided me with plenty to think about. I'll set aside some time to do some experimenting and share with you any revelations I may have.
Thanks for the kind words and support. It was my hope that we would expand what was known about synthetic data. In the past when this was discussed people would take a position but wouldn't share any facts supporting their position. The same is also true when I hear that some players with "large-hands" are using synthetic data in their method development, but facts about how the data was created never follows.

From what I can tell at this stage, it looks like the currencies will be one of the markets where synthetic data will work for out of sample testing of long-term systems. As time permits, we'll probably find others markets as well. Let’s keep sharing until we find the pony that is filling the room.
Post Reply