Working with Financial Data

This post is the first in a series that explains how data scientists can turn their scripts into consumer-facing web applications. The intro post for this series explains the scope of the project and contains an index of other posts.

The code seen in this post can be found in the backend branch of the GitHub repo.

cd /to/desired/directory
git clone -b backend https:[email protected]/coshx/portfolio_optimizer.git
cd portfolio_optimizer/backend/optimizer

What’s the Big Deal with Time Series Data?

In time series data, one or more variables is measured repeatedly over time. From a statistical perspective, we might say that we observe the same experimental unit repeatedly over time. In finance, the experimental unit is often the stock market, and we are observing price variables repeatedly over time. Typically, we cannot assume that observations are mutually independent (non-correlated) because past price fluctuations often have an impact on future price. In financial terms, a dip in price one day can lead to buying pressure the next. This is one reason why dealing with time series data can become a tangled process.

How Do We Optimize?

Our portfolio optimizer uses scipy.optimize to find the optimal allocations over a given period of time. For the reasons mentioned above, it is unlikely that these allocations will continue to be optimal long into the future. Cross-validation, which is a data partitioning technique used to lower the risk of fitting the noise in a data set, can improve the predictive value of our optimized portfolio. However, since there are many frontend UI/UX considerations to implementing cross-validation in our web application, we will not implement cross validation in our application. I strongly encourage others to implement some form of cross-validation before reading too much into our application’s results.

Be Wary of Free Financial Data

Another reason to temper your expectations of our application’s market-beating potential is the difficultly of getting accurate financial data. For any analysis upon which investment decisions are based, it is advisable to use the best data available, which likely involves paying for a license. We use free data from Quandl because it is easily accessible, fairly complete, and removes the need to perform web scraping. Needless to say, don’t base your next investment on the free data that powers our web application.

Splits, Mergers, and Changing Ticker Symbols

Accounting for splits, mergers, and changing ticker symbols are some reasons why accurate financial data is expensive to maintain. During a stock split, a company might double the number of shares available, thereby halving the price of each outstanding share, which would lead to a dramatic change in price. For this reason, most daily financial data includes “Adjusted Close” values, which normalize stock price based on these splits. Additional accounting headaches arise from companies that merge, spin off, or otherwise change their ticker symbols. Tracking these assets backwards in time and modeling how an investment changes under these circumstances both add to the cost of maintaining accurate financial data.

Survivorship Bias

A final word of caution when working with financial data is to be wary of survivorship bias. Survivorship bias assumes that the universe of past stocks and securities is identical to today’s universe. By only considering those stocks which are traded today, we implicitly remove poorly performing stocks from consideration. Companies like Enron are not tempting to investors today, but they likely were in 2000. As you might guess, survivorship bias results in overly optimistic results.

Python for Financial Data

Our application will leverage three key pieces of the Python data science stack:

  • numpy: numpy represents data in n-dimensional arrays and interfaces with lower level C and Fortran code to carry out “vectorized” operations quickly.
  • pandas: pandas is essentially a wrapper around numpy that provides indexing for time series data through a data type called a DataFrame, which may be familiar to R users.
  • scipy: scipy is a collection of libraries that includes both numpy and pandas. We will use the scipy.optimize library to optimize our portfolio allocations.

Time Series Data in Python

One of the more laborious tasks in a data science project is preparing data. Luckily, Quandl’s Python module spits back a nicely formatted pandas data frame. The following code from utils.py grabs the Adjusted Close for a given stock ticker symbol. It then renames the sole column in this data frame symbol.

def hit_quandl(symbol, start, end):
    """Gets adjusted close data for a stock."""
    price = Quandl.get("YAHOO/{}.6".format(symbol), trim_start=start,
                       trim_end=end)
    return price.rename(columns={'Adjusted Close': symbol})

We will set up the proper environment to run this code locally in the next post, but the following is an example of the pandas data frame that we will be working with.

$ python utils.py
                  GOOG          FB        AAPL
Date
2012-05-18  299.900615   38.230000   70.168461
2012-05-21  306.748784   34.029999   74.256482
2012-05-22  300.100412   31.000000   73.686282
2012-05-23  304.426106   32.000000   75.484211

Notice that our data are indexed by date. On its own, numpy would return an unindexed matrix with three columns and four rows. Using a pandas data frame allows us to label each of our rows with a date object that can be used to join columns and slice the data by date.

In the next post, we will look at constructing a replicable Python environment. Then, we will use this environment to listen to HTTP requests, call our data service, optimize a portfolio, and send an HTTP response.