Tuesday, October 6, 2009

TDD in Python: P&L Statement - Step 2

As a continuation of my TDD experiment, I've implemented a bit more functionality toward a P&L statement. As usual, I've posted all the code to github. This exercise implements the queueing I listed as a need in my first post on this subject. So, for a "story problem":

Eddie buys 1000 shares of AAPL at 10:30am for 185.25 and later, at 11:00am he buys another 1500 shares for 184.00. As he unrolls his holdings, Eddie sells 1500 shares for 186.00. What is Eddie's realized gain/loss? What's the cost basis for Eddie's current holdings?

For this post, I'd like to take a first step in solving my story problem by defining my queue a bit better. The real answer to the problem above is "it depends on whether you use fifo or lifo queueing." I've conceived a Holding object to serve as my queue for any positions I've entered for a particular instrument (as uniquely identified by its symbol). So, let's define the behavior we expect out of a fifo-ordered Holding queue in the form of a test:

def portfolio_holding_fifo_test():
""" test of 'fifo' and 'lifo' queuing by adding and removing
a position.
"""
p1 = position.Position(symbol="AAPL", qty=1000, price=185.25,
multiplier=1., fee=7.0, total_amt=185250.,
trans_date=1053605468.54)
p2 = position.Position(symbol="AAPL", qty=1500, price=184.00,
multiplier=1., fee=7.0, total_amt=276000.,
trans_date=1054202245.63)
p3 = position.Position(symbol="AAPL", qty=-1500, price=186.00,
multiplier=1., fee=7.0, total_amt=279000.,
trans_date=1055902486.22)

h = portfolio.Holding()

h.add_to(p1)
h.add_to(p2)

h.remove_from(p3, order='fifo')

assert h.qty==1000
assert len(h.positions)==1
# simple check to make sure the position that we expect is left over...
p = h.positions[0]
assert p.price==184.00

As you can see from lines 5-13 we use our earlier Position object to define the positions that we're holding or removing. I then introduce the notion of a holding object which I can add_to or remove_from. My assertions at the end of my test are some not-very-comprehensive checks to see if the queue behaves as I would expect ... the first position is drawn from when removing the final position, then any remaining shares are drawn from the second position.

Since I expect I'll need to generalize my queuing, I'll write a similar test with 'lifo' ordering:

def portfolio_holding_lifo_test():
""" test of 'lifo' queuing by adding and removing
a position.
"""
p1 = position.Position(symbol="AAPL", qty=1000, price=185.25,
multiplier=1., fee=7.0, total_amt=185250.,
trans_date=1053605468.54)
p2 = position.Position(symbol="AAPL", qty=1500, price=184.00,
multiplier=1., fee=7.0, total_amt=276000.,
trans_date=1054202245.63)
p3 = position.Position(symbol="AAPL", qty=-1500, price=186.00,
multiplier=1., fee=7.0, total_amt=279000.,
trans_date=1055902486.22)

h = portfolio.Holding()

h.add_to(p1)
h.add_to(p2)

h.remove_from(p3, order='lifo')

assert h.qty==1000
assert len(h.positions)==1
# simple check to make sure the position that we expect is left over...
p = h.positions[0]
assert p.price==185.25

So, I've written my tests first -- now for the code. I create my Holding class and put it in a file called portfolio.py. Setting up the class and creating my add_to method was a cinch:

# Enthought imports
from enthought.traits.api import (HasTraits, Float, Instance, List)

# Local imports
from position import Position

class Holding(HasTraits):
""" Queue for held positions in the same security (as identified
by symbol). The removal of entries are handled in a 'fifo'
or a 'lifo' order, depending on the order argument of the
remove_from method.
"""

# Total quantity for a particular holding
qty = Float

# List of positions making up the holding
positions = List(Instance(Position))

def add_to(self, position):
self.qty += position.qty
self.positions.append(position)

As you can see, I've again used Traits to make my life easier. In this case I've got a simple qty Float value which I intend to use for ease in checking the total quantity held in my Holding object (as well as a checksum). I also have defined a List of positions as another Trait of my class. From there, the add_to method is pretty much the obvious implementation.

The hard part is in the remove_from, which actually implements the various behavior of the queue ordering. I've punted (for now) on my original goal of a 'wifo' (worst-in-first-out) queue order, as I suspect I'll need to refactor my position class to handle different sorting. The important bits of the code are in the handling of a full position, a partial position, or multiple positions depending of the quantity to be removed. Rather than reproducing all the code here, I'll just provide the link. I use recursion in the case of multiple removals, otherwise it's pretty easy to follow.

This was a fairly simple step, yet I feel much closer to my portfolio/statement goals -- and I feel like I can "turn my back" to some extent on the code I've written so far. That is, it is reasonably tested code that meets the api I've defined by my tests.

Some TDD observations:
1. It is getting a little easier to think in terms of tests first as I've done this second exercise.
2. I did write some additional tests to try to get at the edges of the functionality I was looking for, and to provide more comprehensive regression coverage.
3. Small tests force you into a good habit of writing small chunks of code that work, heightening the impression of progress.

Next up: 1) aggregating the holdings into a real portfolio and 2) introducing a method of logging activity and reporting performance in the form of a statement.

As always, comments or corrections are welcome!

Tuesday, September 29, 2009

Test Driven Python Development Experiment: P&L Statement - Step 1

I began trading equities in my personal portfolio earlier this Spring using an account with Interactive Brokers (whom I'd recommend for their nice tools, good access to products, and published APIs). One of the issues I've had with them, however, is the lack of clarity in a daily P&L statement. Without bogging down in too many details, it boils down to the fact that I'd like to see a few things that are not provided out-of-the box by Interactive Brokers:

  • Realized gain rolled up for each day.

  • An intraday, calculated P&L in the event that I wish to use it as part of an automated trading system.

  • A choice between FIFO, LIFO and what I call WIFO (worst-in-first-out, a more accurate and better-rhyming name for HIFO) handling of trades.

Developing these capabilities seems like an excellent exercise for me to experiment with Test Driven Development (TDD) and share my experiences with others, so consider this a test for TDD using a below-average developer with simple needs. The current state of the code is available at github.

Position Object


The first thing needed is a data structure to hold information about each position. The data populating this data structure will come from a text log of transactions, or, potentially, a database with transaction records. Interactive Brokers provides several formats of text output. I chose to use a text file provided in a TradeLog format. Looking at a few lines of a typical log gives an idea of the type and format of data provided:

ACCOUNT_INFORMATION
ACT_INF|UXXXXXXX|Test Company|Advisor Client|1111 Main Street, San Antonio TX 78201 United States

OPTION_TRANSACTIONS
OPT_TRD|305665068|APVGG|AAPL JUL09 135 C|ISE|BUYTOOPEN|O|20090512|10:22:55|USD|23.00|100.00|5.55|12765.00|-16.10|1.00
OPT_TRD|305668016|APVGG|AAPL JUL09 135 C|NASDAQOM|SELLTOCLOSE|C|20090512|10:24:59|USD|-23.00|100.00|5.61|-12903.00|-10.35|1.00
OPT_TRD|305681552|APVGG|AAPL JUL09 135 C|ISE|BUYTOOPEN|O|20090512|10:36:59|USD|46.00|100.00|5.30|24380.00|-32.20|1.00
OPT_TRD|305721409|APVGG|AAPL JUL09 135 C|ISE|BUYTOOPEN|O|20090512|11:25:29|USD|47.00|100.00|5.15|24205.00|-32.90|1.00
OPT_TRD|305767844|APVGG|AAPL JUL09 135 C|ISE|BUYTOOPEN|O|20090512|12:30:38|USD|23.00|100.00|4.70|10810.00|-16.10|1.00
OPT_TRD|306301447|APVGG|AAPL JUL09 135 C|NASDAQOM|BUYTOOPEN|O|20090513|14:19:58|USD|30.00|100.00|3.19|9570.00|-13.50|1.00

The optimal way to put this information into a data structure in python would probably be a to use a numpy structured array. However, since performance is not an issue (at this time), and I'd like to use this problem as a TDD test for myself, I'll forgo the numpy approach and use more of a brute-force, object-oriented method.

If I were to codify some requirements for my data structure by writing a test, it might look something like this:


import position

def position_test():
""" Dead simple test to see if I can store data in my position object
and get it back out
"""
p = position.Position(symbol="AAPL", qty=1000, price=185.25,
multiplier=1., fee=7.0, total_amt=185250.,
trans_date=1053605468.54)

assert p.price==185.25
assert p.description==""
assert p.display_date=="05/22/2003"
assert p.display_time=="08:11:08"

OK, so there are several things implicit in this test (don't worry, I'll break all those asserts into different tests before going forward). One obvious point is that I seem to require some fairly sophisticated handling of datetimes and default values. Other than that, we have a big pile of keyword arguments and a bunch of attributes in a fairly brain-dead object. The conventional approach is to build a giant __init__ with a bunch of lines like this nonsense:

if not description is None:
self.description = description
else:
self.description = 0.0

Frankly, we deserve better than to waste our lives on silly boilerplate code for initialization, validation and defaults. This is where Traits comes to the rescue (there's a nice tutorial here). I'll not go into all the details here, but you'll see how how traits allows us to have much cleaner code as I progress.

As a very simple first test, let's take our first assert and make a stand-alone test:

def position_attribute_test():
""" Dead simple test to see if I can store data in my position object
and get it back out
"""
p = position.Position(symbol="AAPL", qty=1000, price=185.25,
multiplier=1., fee=7.0, total_amt=185250.,
trans_date=1053605468.54)

assert p.price==185.25

We can solve it with the following code, which includes all the fields (attributes/traits) I think I'll need:

from enthought.traits.api import (HasTraits, Enum, Float, Int,
Str)

class Position(HasTraits):
""" Simple object to act as a data structure for a position

While all attributes (traits) are optional, classes that contain or
collect instances of the Position class will probably require the following:
symbol, trans_date, qty, price, total_amt

"""


side = Enum("BUYTOOPEN", ["SELLTOCLOS", "BUYTOOPEN", "SELLTOOPEN", "BUYTOCLOSE"])
symbol = Str
id = Int
description = Str
trans_date = Float
qty = Float
price = Float
multiplier = Float(1.0)
fee = Float
exchange_rate = Float(1.0)
currency = Str("USD")
total_amt = Float
filled = Str
exchange = Str

I now can throw my test code into a directory called "test" adjacent to this module (which I named position.py) and I use the excellent nosetest tool to harvest and run any tests I've got. It passes! ... but that was actually not much of a challenge, so let's look at another test function:

def position_initialization_test():
""" Test to see if I can handle fields for which I provide no data.
"""
p = position.Position(symbol="AAPL", qty=1000, price=185.25,
multiplier=1., fee=7.0, total_amt=185250.,
trans_date=1053605468.54)


assert p.description==""

... this is similarly a soft pitch over the middle of the plate. Traits has already solved this for us. The test passes with no change to the code.

How about something a little more challenging:

def position_dates_test():
""" Test to see if I'm handling dates correctly
"""
p = position.Position(symbol="AAPL", qty=1000, price=185.25,
multiplier=1., fee=7.0, total_amt=185250.,
trans_date=1053605468.54)

assert p.date_display=="05/22/2003"
assert p.time_display=="08:11:08"

So now I have to actually write some code. I won't go into all the painful issues with Python date handling, I'll just provide a link to the date utility code that I wrote to help smooth things over.

I do want to step back and write a test that defines a need for date handling that is currently a bit broken (at least in Python 2.5) -- converting a python datetime to a timestamp and a timestamp to a python datetime:

from date_util import dt_from_timestamp, dt_to_timestamp


def date_util_test():
""" Simple test of correctly transforming a timestamp to a python datetime,
and back to a timestamp
"""

# test a not-very-random sequence of times
for ts in range(0, 1250000000, 321321):
# simply see if we can round-trip the timestamp and get the same result
dt = dt_from_timestamp(ts)
ts2 = int(dt_to_timestamp(dt))
assert ts2 == ts


Picking back up with my position_dates_test ... the instructive part of what I need to do here is in using a Property Trait as a good human interface for my trans_date. The trans_date value is actually a float value indicating seconds (and fractional seconds) since the epoch (which happens to fall in the same year that we put the first man on the moon, the summer of love, Woodstock and, notably, the year of my birth -- 1969!). While the current tally of seconds since 1969 may be at the front of my mind, it's not a particularly good way to represent dates for most people. In fact, most GUIs that handle datetime entries separate the date and time. This makes sense -- we're used to interacting with the two with separate pieces of hardware (a calendar and a clock). The date and time properties are defined after the other traits in my class like so:

date_display = Property(Regex(value='11/17/1969',
regex='\d\d[/]\d\d[/]\d\d\d\d'),
depends_on='trans_date')
time_display = Property(Regex(value='12:01:01',
regex='\d\d[:]\d\d[:]\d\d'),
depends_on='trans_date')

###################################
# Property methods
def _get_date_display(self):
return dt_from_timestamp(self.trans_date, tz=Eastern).strftime("%m/%d/%Y")

def _set_date_display(self, val):
tm = self._get_time_display()
trans_date = datetime.strptime(val+tm, "%m/%d/%Y%H:%M:%S" )
trans_date = trans_date.replace(tzinfo=Eastern)
self.trans_date = dt_to_timestamp(trans_date)
return

def _get_time_display(self):
t = dt_from_timestamp(self.trans_date, tz=Eastern).strftime("%H:%M:%S")
return t

def _set_time_display(self, val):
trans_time = datetime.strptime(self._get_date_display()+val, "%m/%d/%Y%H:%M:%S")
trans_time = trans_time.replace(tzinfo=Eastern)
self.trans_date = dt_to_timestamp(trans_time)
return

The getter and setter methods behave just as you would expect, and the whole thing provides a meaningful interface to the timestamp stored in my trans_date trait. There are some imports that I don't show here for brevity, but it's fairly clean code considering what it does. The best part of all -- my test passes!

Finally, I want to try another test, because I know I'll need to sort my Position objects within a collection. Here's a simple test function:

def position_sort_test():
""" Test to see if I can collect and sort these properly.
The objective is to have the objects sort by the trans_date
trait.
"""


p0 = position.Position(id=102, symbol="AAPL", qty=1000, price=185.25,
multiplier=1., fee=7.0, total_amt=185250.,
trans_date=1045623459.68)

p1 = position.Position(id=103, symbol="AAPL", qty=-1000, price=186.25,
multiplier=1., fee=7.0, total_amt=-186250.,
trans_date=1053605468.54)

p2 = position.Position(id=101, symbol="AAPL", qty=500, price=184.00,
multiplier=1., fee=7.0, total_amt=62000.,
trans_date=1021236990.02)

plist = [p0, p1, p2]

plist.sort()

assert plist[0]==p2
assert plist[1]==p0
assert plist[2]==p1

This just collects a few positions into a list, calls the .sort method on the list and checks to see if it accurately sorts by the trans_date. The method I add to my class to accomplish this is actually quite simple. I define a __cmp__ method which correctly compares objects of this class and I'm good to go:

# support reasonable sorting based on trans_date
def __cmp__(self, other):
if self.trans_date < other.trans_date:
return -1
elif self.trans_date > other.trans_date:
return 1
else: return 0

This pretty much wraps up the functionality I want (so far) in my position class. This is evidenced by my nosetests results:

Oh, and one more thing -- thanks to the miracle of traits I can call the edit_traits method on any position object and get a nice form with all the fields. I've added a view definition to my position class to pare this down a bit:

traits_view = View(Item('symbol', label="Symb"),
Item('date_display'),
Item('time_display'),
Item('qty'),
buttons=['OK', 'Cancel'], resizable=True)

Now if p is a Position object then calling p.edit_traits() pops up the following dialog:

Again, all the code for this is available via github.

Given this simple exercise in TDD, my initial impression is that it's helpful for ensuring good test coverage for features, but I really need to go back and write corner-case tests and tests which will pick up minor regressions in the state of my code better. Also, I found that considering how to best design the code was often a separate exercise from designing tests -- so I could really sense I was switching mental contexts to try to drive the development with well thought-out tests. I'm willing to concede that this is unique to this developer. Overall, I think the jury is still out. I get the feeling I don't have enough experience with TDD to do it well, but hope springs eternal.

I'll handle proper queueing and adding/removing positions in a portfolio in the next post in this series. Until then, please raise any questions or corrections in the comments below.

Wednesday, August 26, 2009

Multidimensional Data Visualization in Python - Mixing Chaco and Mayavi

In a previous post, I recreated an infographic using the Chaco plotting library. Inspired by Peter Wang's lightning talk (scroll to about 5:15 in the video) at the recent SciPy Conference, I've extended this idea a bit to show the exploration of a "4D" data set (three axes and the color/size of the points) and using a 5th dimension (the date) as an interactive filter.

Since it's a whole lot easier to demonstrate than to describe, I made a short screencast of me playing with it:

video

While a bit hackish, the code is available for anyone wishing to play with or improve it.

I know I've said this before, but it bears repeating -- Mayavi is awesome.

Saturday, August 8, 2009

A Very Simple GUI Using Traits

I ran across this post today about some magic with simple syntax for GUI building that Richard Jones is playing around with. It occurred to me to give a simple example of traits usage to do the same thing.

So, thanks to traits, our class definition looks quite clean, and we can provide some manifest typing as well. I show a screen shot of my ipython session, which was launched with ipython -wthread:


So executing line [5] pops up the GUI shown. One nice thing about traits is the built-in MVC architecture, which allows me to change the value in the choices class attribute and it informs the listening label , which is updated automatically. Notice that the value f.choices inspected at the command line is updated as well:


The main drawback I see between this approach and Richard's is the size of the tool chain. The TraitsGUI piece requires wxPython (or QT--it works with either), and some dependencies.

Saturday, August 1, 2009

Infographic in Python using Chaco

This week I ran across a blog post about this New York Times infographic, which explains one of the measures of the "business cycle" based on industrial production (the data comes from the OECD originally). [Update: I also saw the post over at Juice Analytics in which they implemented this in excel.] Never one to pass up an opportunity to re-invent the wheel, I thought it would be a good exercise to implement this in Python using the excellent Chaco plotting toolkit which comes included in some python distributions. So, here's the beginning of that effort, after a few hours digging around the docs and hacking together a GUI:

This is a good start. I've posted the code for this at github.

To really flesh it out, you'd need to add in the Composite Leading Indicator data and make some of the elements update based on the selected range. It would also be cool to dynamically switch out the data for various countries, or view them concurrently. Any takers?

Information Density

I think what makes such a simple interface so compelling is that you are able to see the relationship between three pieces of data. Cross-plots are a great mechanism for visualizing relationships between two data sets, but they're made even more useful when you can highlight a range in a common index (e.g. time, in the case of time-series data, or depth, in the case of depth indexed data in the geophysics arena.) Even with a lot of information presented, the display is very clean--even sparse.

State and State-Transition

This particular graphic also reveals the "state" of the business cycle by partitioning the graph into quadrants. This data set has a very straightforward state inherent in it's construction, but one might imagine more sophisticated calculations of state decorating time series data such as this. I'm beginning to investigate the application of this to stock price streams and some derived state that can be displayed in ways that can be "replayed" and analyzed. Whether real "information" can be teased out of the data will remain to be seen, but I'll try to leverage the visual cortex to gain intuition about the data.

Any comments/suggestions about the approach are welcome.

Wednesday, July 29, 2009

Hello World (Map)

I'm resurrecting my blogging effort outside of my former company--in the hopes of sharing useful tips I discover in my new software pursuits.

One particular area of distraction for me has been in GIS and mapping. I'm using the excellent mapnik toolkit to produce mapping visualizations. Mapnik sits on a fairly heavy stack of dependencies including:
  • GDAL/OGR - libraries and utilities for dealing with a multitude of raster and shape (vector) files. These have very good python bindings
  • PostgreSQL/PostGIS - nice, but not required
  • PROJ4 - Cartographic Projections library
  • BOOST - The forever-to-compile C++ libraries that mapnik uses for wrapping C++ codes.
Getting all of this running on OSX is a huge pain in the neck--especially if I want to use my preferred Python Distribution. Much of this pain is alleviated with Framework installs of many of the dependencies from the excellent kyngchaos site (many thanks!). I'm still not convinced I have all my libraries linked properly (who needs ICU_LIBS anyway?!).

My original mapping goal was fairly simple: Develop a composite map of roads, terrain and some custom features for a nice large-format printout for my son's room. Technically, this meant a few things:
  • converting srtm elevation data [1] to nice contour shape files (gdal_contour).
  • generating a hillshade raster image from srtm data
  • generating a color relief from srtm data
  • properly composing the custom features (considering Cartographica for this, but it's a bit expensive for a hobby project and the product seems at bit...nascent. Suggestions?).
  • beautifully rendering everything using a nice scanline approach.
It's important to note that the data and the tools I've used (with the exception of Cartographica) are all freely available (some tools do have copy-left provisions, and the data are not to be 'sold' in many cases). There is a massive amount of data for my region of interest, and a half day of effort gave me something like this:


Well...there was still much work to be done. The above image used the downloadable contour sets from the two Texas counties that this image spans, and the renderer I was using (cartographica) doesn't do very good text-layout "collision detection," nor does it have very fine-grained control over things like max angles and spacing for labels. So, putting together a simple mapnik script, using my own generated contours, applying hill shading, and tweaking the color relief to make it more wife-friendly yields something like this:


I'm getting close. I need to re-append the roads data and add all my custom features (labels, points of interest, etc.) I'm pleased with the progress so far.

This exercise has led me to think about potentially different approaches. While mapnik is extremely nice, they do use boost which, unfortunately, leads to a more lengthy and difficult build/install. (I'll try to document the OS X steps that I used when I set it up on a new machine--I had to deviate from the "official" installation instructions in a couple of places.) Why didn't they use SWIG? Is the mapnik C++ layer heavily templated? Also, I'd like to see the rendering layer use my favorite stack of 2D rendering tools -- Kiva and Enable. This would allow an intermediate layer which would provide more general vector output routines. I'll try to get a handle on the level of effort for this in the coming weeks, as time allows.

----
[1] The srtm data set is one of the cooler products I've come across--a shuttle mission to map the world and provide free access to global elevation data at 10m or 30 resolution (gladly sponsored by this American taxpayer!).