Building a Data Science Backend with Tornado

This post is the second in a series that explains how data scientists can turn their scripts into consumer-facing web applications. The intro post for this series explains the scope of the project and contains an index of other posts.

The code seen in this post can be found in the backend branch of the GitHub repo.

cd /to/desired/directory
git clone -b backend https:[email protected]/coshx/portfolio_optimizer.git
cd portfolio_optimizer/backend/optimizer/

Problem, Goal, Solution

  • Problem: We have a few Python scripts that crunch data beautifully. To share these with web users, we first need to expose our scripts to the web.
  • Goal: By the end of this tutorial, we want a server that can handle HTTP POST requests, call our scripts, and respond with the requested data.
  • Solution: Configure Tornado to handle requests, interface with our Python scripts, and respond with our scripts’ output.

Why Do it That Way?

Caveat: I knew I was going to be hosting my application on a Linode, which is a virtual private server. You may be enticed to use a PaaS such as Heroku, but most PaaS systems have issues installing some of the PyData dependecies, especially the GNU Fortran compiler. In theory, if you can get Anaconda on your PaaS, you should be in good shape. My rationale for each of the tools we use in this post are as follows:

  • Tornado:
    • Simplicity: We don’t need to serve static files or support CRUD operations. We just want to be able to handle web requests, read data into pandas, crunch data, and spit the result back to the user. Since Tornado is as much an asynchronous networking library as it is a web framework, we can get it to handle web requests without much code.
    • Written in Python: Using Tornado means that we can treat our scripts as Python modules rather than calling them as standalone scripts living on a server.
    • Configurability: Tornado allows me to expand the backend API to handle a wider variety of web requests if I want this in the future.
    • Ease of Deployment: Since we only plan on handling HTTP requests and not serving static files, we can run Tornado on our server much the same way that we run it in development.
  • Anaconda:
    • Python Dependencies: The PyData stack, and SciPy in particular, require a number of dependencies that can be a pain to install on a server. Anaconda makes this process less painful and more reliable across systems.

Installation

I use different branches in the GitHub repository for each post in this series, so make sure that once you have cloned the GitHub repo, you switch to the backend branch, as noted at the top of the post. Then, follow the directions under “Getting Started” in the GitHub README. The beauty of using Anaconda is that it installs both the correct Python version and the needed dependencies. When an environment is activated, it prepends symlinks to its own version of Python and the necessary dependencies to your $PATH.

Install Gotchas: If you are installing on a server, like a Linode, you may encounter errors when running the backend application because certain dependencies were not included in the ./backend/environment.yml file. In my installation, libgfortran was not available, so I needed to run conda install libgfortran=1.0 from the activated stocks environment to install that dependency. The ability to troubleshoot these dependency issues relatively easily makes Anaconda a virtual necessity for getting the PyData stack online.

Backend API

Since the design of a web application’s backend can have significant ramifications for what can be done in the frontend, it pays off to think carefully about the kind of data we will need to display to the user. In our case, we want the user to be able to click an “Optimize” button and get back the optimal allocations for each stock in her portfolio. After parsing the HTTP request, a sample call to our backend might come in a form like this.

params = {'symbols': ['AAPL', 'FB', 'GOOG'],
          'startDate': '01-01-12',
          'endDate': '03-20-16',
          'initialInvestment': 1000.00}

Setting Up Tornado

Let’s set up Tornado to handle POST requests containing the above parameters. Looking at the code in ./backend/app.py reveals the key steps to doing this:

The MainHandler Class

This class allows us to write handlers for each type of HTTP request that Tornado might receive. Later, we will see that handlers can be attached to routes to define our backend’s API. We really only have one kind of request (the one mentioned above), so we only need one request handler, which we call MainHandler. In the code below, we include both a GET and a POST request handler.

class MainHandler(tornado.web.RequestHandler):
    """Handles post requests by responding with a JSON file."""
    def set_default_headers(self):
        self.set_header('Access-Control-Allow-Origin', '*')
        self.set_header('Access-Control-Allow-Methods', 'POST, GET, OPTIONS')
        self.set_header('Access-Control-Max-Age', 1000)
        #self.set_header('Access-Control-Allow-Headers', 'origin, x-csrftoken, content-type, accept')
        self.set_header('Access-Control-Allow-Headers', '*')
        self.set_header('Content-type', 'application/json')

    @tornado.web.asynchronous
    def get(self):
        """Respond to GET requests for debugging purposes."""
        self.write("Success!")
        self.finish()

    @tornado.web.asynchronous
    def post(self):
        """Respond to POST requests with optimal allocations."""
        data = json.loads(self.request.body.decode('utf-8'))
        stock_params = dict_from_data(data)
        allocs = optimize_allocations(stock_params)
        self.write(allocs)
        self.finish()
set_default_headers()

A lot could be said about the set_default_headers() method. I’ll summarize saying that setting Access-Control-Allow-Origin to * allows any web domain to send requests to our backend and that Access-Control-Allow-Methods allows POST, GET, and OPTIONS requests. There are a number of different HTTP header fields that can be controlled. One common problem these headers solve is dealing with Cross-Origin Resource Sharing (CORS).

get()

The GET request handler is used for debugging purposes. It allows us to go to port 8000 (or whichever port was specified in the --port=XXXX flag) and verify that our handler is working. If you run python -m backend.app --port=3001 from the portfolio_optimizer directory (remember to have the conda environment acitvated first), you can see our “Success!” message in your browser at http://localhost:3001.

post()

The POST request handler is a bit more involved. Since most web frontends can only handle JavaScript, our request will come in as a JSON object. We need to parse this object into a Python dictionary, which is done by the json.loads() function. We then remap the keys in our newly parsed dictionary to match more pythonic variable names with underscores. Next, we pass these parameters to optimize_allocations(), which calls our optimization modules. Finally, we write a POST response that encloses the results of our optimization. This response will also include a JSON object, which we set in the last set_header() call.

Sidenote: The @tornado.web.asynchronous decorator is used to tell Tornado that these functions are asynchronous. This means that Tornado will not wait for one request to process before handling the next request. Instead, multiple requests can be processed at the same time.

The main() Method
def main():
    """Runs application on httpserver with handler for '/'."""
    app = make_app()
    http_server = tornado.httpserver.HTTPServer(app)
    http_server.listen(options.port)
    tornado.ioloop.IOLoop.current().start()

This method is the driver of our request handler. Before anything else, we call make_app() to initialize a Tornado application. We only have one application, but we could have potentially many. After that, we “mount” our app onto a Tornado HTTPServer, set the correct listening port for the server, and then we begin the Tornado I/O loop, which listens for requests.

# Add command line flags
define("port", default=8000, help="run on the given port", type=int)
define("debug", default=False, help="run in debug mode")

...

def make_app():
    tornado.options.parse_command_line()
    return tornado.web.Application([
        (r"/", MainHandler)
    ])

The most interesting aspect of this function is the command line parser, which looks for the --port and --debug flags when the app is run.

Optimizing Allocations

Looping back to our goals, we have discussed how our Tornado HTTP server handles requests and returns responses using the MainHandler class. Buried within our asynchronous POST request handler, we made a call to optimize_allocations(), which is the connection point to the optimization modules. The great thing about using Tornado is that we can import these scripts as modules, meaning we do not have to deal with data I/O to our Python scripts. Let’s look at these two modules now.

The utils Module

In ./backend/optimizer/utils.py, the code for the hit_quandl() method was reviewed in the previous post. It returns the Adjusted Close price for a single stock. The get_data() method handles getting the data for each stock in the parsed HTTP request, concatenating these columns together, and returning a data frame with the Adjusted Close prices over the date range requested.

def get_data(params):
    symbols = params['symbols']
    start = params['start_date']
    end = params['end_date']

    prices = {s: hit_quandl(s, start, end) for s in symbols}
    stocks = pd.concat([prices[s] for s in prices], axis=1, join='inner')
    return stocks
The optimize Module

The optimize_portfolio() function in ./backend/optimizer/optimize.py is the entry point for the optimize module. It takes the data frame of prices from utils.get_data() (named prices) and returns a dictionary of each stock and its optimal allocation. We sort the prices data frame to ensure that each optimized allocation is paired with the correct stock symbol.

def optimize_portfolio(prices):
    prices = prices.reindex_axis(sorted(prices.columns), axis=1)
    symbols = list(prices.columns.values)
    allocs = optimize_allocations(prices)
    allocs = allocs / np.sum(allocs)

    return {k: v for (k, v) in zip(symbols, allocs)}

The optimize_allocations() function is the meat of the module. It defines an “error” function, which really just computes the Sharpe ratio. The Sharpe ratio is a measure of risk-adjusted returns, meaning that it measures portfolio returns but penalizes portfolios that take on more risk. We will be looking to maximize the Sharpe ratio of our portfolio. Since scipy.optimize only has a minimize function, we will seek to minimize -1 * sharpe_ratio.

The bnds parameter sets the range of numbers over which to optimize. There is a tuple specifying that the bounds should lie between (0, 1) for each stock in the portfolio. The cons parameter constrains the optimization so that the sum of all allocations is 1.0. Finally, scipy.optimize.minimize returns a special OptimizeResult object, so we return the solution array, x, from this object. If you want to read more about how scipy performs this optimization, you can read about it in the documentation.

def optimize_allocations(prices):
    # 1. Define 'error' function
    def sharpe(weights):
        return get_portfolio_stats(get_portfolio_value(prices, weights))[3] * -1

    # 2. Set bounds
    num_stocks = prices.shape[1]
    bnds = tuple([(0, 1)])*num_stocks
    cons = ({'type': 'eq', 'fun': lambda x: np.sum(x) - 1})

    # 3. Run optimizer
    ## Get optimizer notes by setting 'disp' to True
    result = spo.minimize(sharpe, num_stocks * [1. / num_stocks],
                          method='SLSQP', bounds=bnds,
                          constraints=cons, options={'disp': False})

    return result['x']

Testing Things Out

One of the most important diagnostic tools for setting up a backend API is replicating the kind of query you want with a cURL request. After running python -m backend.app --port=3001, we can use a cURL request to hit our backend and make sure that the optimizer is working.

curl -H "Content-Type: application /json" -X POST -d '{"symbols": ["AAPL", "GOOG", "FB"], "startDate": "01-01-12", "endDate": "03-20-16", "initialInvestment": 1000.00}' http://localhost:3001

You should get back something resembling the following:

{"AAPL": 5.758872628492446e-17, "FB": 0.4510451480968203, "GOOG": 0.5489548519031797}

Next week we will look at workflows to deal with development, staging, and production to streamline frontend development.