WARNING: The following rant is somewhat technical. It’s a departure from the economics/finance of quantitative equity into the actual technical implementation and the challenges that are often faced. If you don’t know what threading, serialization, and parallel computing is you’ll probably want to stop reading right here!
I use Matlab to analyze the gigabytes of equities data I have to put on trades. Last week I’ve run into a frustrating brick wall with Matlab that I can’t seem to get around. Hopefully this helps someone else in a similar situation.
Until a couple of weeks ago, it would take 8 hours for my program to simulate 20 years of trading activity and spit out results. Clearly this is annoying since if I want to change a quantitative factor or two and see how things would look, I have to let it run for 8 hours! At this point, I was fully utilizing Matlab’s parallel computing features (which actually helped considerably – without them I would have seen a run time of 20+ hours!).
I had the idea of implementing some kind of caching scheme to speed up the data retrieval from a heavily normalized database. That took a bit of programming to implement, but is now working like a charm. Run time was reduced from 8 hours to about 3 hours. That’s a great improvement, but if we could get faster still, that’d be great.
To understand where the bottleneck was, I killed of one factor at a time and measured how runtime was affected. In the end, I found that killing all factors (so for each month my program essentially connects to the database and disconnects), made an inconsequential difference. That is, the majority of the time spent is in the database connect/disconnect (something I, as a former programmer, should have known quite well!).
So, the solution: reuse the database connections (in the technical world, known as connection pooling). So, I spent many hours investigating Matlab’s database toolbox and its ability to pool connections. To make a long story short, Matlab’s database connectivity (build on Java JDBC drivers) doesn’t support connection pooling. I suppose if I were running it in an application server (Tomcat, something else…), I may be able to work. But, I have neither the environment, expertise, nor desire to go down that path.
So, I decided to implement my own pooling. To do this, I wrote a Java connection pool manager. That is actually a simple task since all a connection pool manager is, is a vector of connections and the logic to reuse a connection if it is idle. Having never done Matlab/Java integration before, it took me a bit to figure out how to make it all work, but in short order I was there. All set, right? No…
I changed my code to use the connection pooling rather than native Matlab database connections. However, it blew up due to the parallelization. What I didn’t realize going in is that when Matlab parallelizes code, it actually serializes objects, splits up the code across multiple threads, and then hydrates the objects in each thread. Unfortunately, the Matlab database object doesn’t support serialization. Thus, there was no way to pass a connection object into some parallel code. If I choose to eliminate the parallel code, then I can use connection pooling, but everything will run synchronously in a single thread – eliminating the need to even use multiple connections, and much of the performance boost I’ve seen so far.
I also learnt along the way that a Matlab database object isn’t the same as the underlying JDBC database object. Matlab encapsulates the JDBC equivalent in some custom structures. Once I realized this, I thought that maybe I could get around this issue by using JDBC objects directly rather than their Matlab equivalents. However, the JDBC connection objects also don’t support serialization, bringing me back to a dead end.
So, I’m left with a frustrating situation:
- The only way to use parallelization is to have a connection pool pass a connection into the parallel code. Having multiple connection pools (one for each thread) is silly and is no better than where I am now. So, in order to have a single pool, but have the parallel code use the connections, the connection must be passed to the parallel code, which requires serialization and is not supported by JDBC connections (and thus Matlab connections).
- I could skip parallelization altogether and run everything single threaded: this’ll likely take me back to 20+ hours!
- Leave things they way they are: the parallel code creates a connection, works with the data, and disconnects. This totals a 3 hour runtime, but seems to be the lesser of all evils.
- Rebuild an environment that uses some kind of application server that provides the connection pooler. I’m not even sure what this would look like, how Matlab would play in the environment, and the necessary support tasks involved in maintaining the environment. This is probably the least attractive alternative.
- Use .NET connections in Matlab since .NET provides connection pooling natively when using SQL server. This sounds like a great option, but a .NET connection isn’t a JDBC connection and so I can’t use the already existing Matlab database infrastructure. I’ll have to rewrite all database operations like reading, updating, etc.
- Scrap Matlab altogether and switch to a (real) programming language like Java or C# (most like the latter in my case). This will allow me to use connection pooling without having to do a bunch of custom Matlab database connectivity work. Of course, I’m not sure how this will compare in terms of execution speed for large array operations (which Matlab is notoriously good at).
Conclusion: for now I think I’ll leave things as they are. I have “bigger fish to fry” than the details of the simulator implementation. Besides, I’m not sure how much all of the work I put into alternatives 2 or 3 will save me in terms of run time. If I go from 3 hours to 2.5, the exercise seems pointless. If I can get from 3 hours to under an hour (which is what I think would happen), it may make sense. Either way, this’ll have to wait until I have nothing else to do and feel like rewriting all my infrastructure.