Thursday, September 1, 2011

Modern Portfolio Theory - A Python Implementation


I was surprised last week to find there was no accessible Python implementation of the calculation of the Efficient Frontier (as defined by Markowitz in his presentation of Modern Portfolio Theory ~1957).* Since the problem seemed simple to solve with the tools at hand, I set out to "right the wrong" and develop an open implementation (available under an open, BSD license via github) that meets my needs.

In the process of building the basic metrics, I was able to experiment with NumPy's datetime support -- as mentioned in my earlier post. The API's are still a bit in flux, but I thought I would write up a full workflow using the tools as they are. I would greatly appreciate feedback/contributions/suggestions.  So let's dive in ...

Thursday, August 25, 2011

NumPy's New datetime Support and a Financial Database Roundtrip

Last month when I was in Austin, TX at the 10th Annual (!) SciPy Conference I asked around about the status of native datetime support in NumPy. Based on the response of my good friends at Enthought, I've been living under a rock for a while -- it's been finished for some time now.

So, a month later, I've wrangled some time to take a look. I also happened to have a particular problem I wanted to solve. Here's a teaser preview of the output of my Modern Portfolio Theory implementation:



I'll get into the details of that implementation in a later post. For now, let's tackle some data handling.

The immediate task I undertook was to refactor a simple asset price data fetcher (from Yahoo Finance, in this particular case) that I use. I always end up hammering the Yahoo Finance server to pull down price data when I'm doing some ad-hoc analysis, and I thought I should build a simple function to stash that data away locally in a sqlite database file. I know there are other approaches for fetching and manipulating this data out there -- notably pandas aggregation functions, paired with the matplotlib finance functions. I've looked to those tools for guidance, but wanted to have a bit more functional approach, fewer (or at least different) dependencies, and SQL as an interface to the data. I'll say up front that all the code is available under a BSD license via github.

First, I cleaned up the old data fetcher and made it more functional (in the programming sense -- rather than object oriented). Really, I just wanted to be able to do this:



I prefer my api for this over the matplotlib api mainly because I don't want to import datetime, followed by all the startdate = datetime.datetime(2000,1,1) nonsense, just to get a date (pun intended). Anyway, once I've got my data, I want to stuff it into a sqlite db file and be able to (losslessly) pull it back out into a numpy array. This seems fairly simple because pysqlite offers conversion support for python datetime objects. Unfortunately, I wanted dates stored as floats of seconds since the epoch, so I had to do a bit more work. This resulted in the code:



These two functions provide for accurate round-tripping despite the nuances of daylight savings time treatment by numpy and sqlite.

Another important point in the use of the converter function is how one determines that the type is a "datetime." Sqlite uses datetime as a designator for a built-in type -- I'm just overriding that when I register the converter function. In order to have the proper type recognized, we need to specify the detect_types argument when we instantiate our connection to the database for INSERT-ing or SELECT-ing. Also, notice how I specify the type using the as keyword in my SELECT statement:



Now, I have a nice interface for this kind of interaction with my data:



I'm still looking into potential problems with how queries might perform where I've indexed a float column for the date. Also, any suggestions about how to better handle the DST issue would be most welcome. It's pretty hackish now and I'd like to avoid wading into the morass that is Python's congenital complex about round-tripping of timestamps and how to properly do timezones.

Next step: Port some metrics code over to use the numpy datetimes rather than my manual handling of dates. I'll be sharing this code as well as an explanation in a later post.

UPDATE: I've added a handy convenience function, price_utils.symbol_exists to check what data is in the sqlite database for a particular symbol.



It returns a tuple of the form (num_recs, first_date, last_date). Again, the code is available via github. Enjoy!