Machine Learning for Fintech



A wide range of institutions are exposed to credit risk in large pools of loans such as credit cards, mortgages, auto loans, student loans, and business loans. These include marketplace, alternative, and traditional lenders; card issuers; investors in loans and structured finance deals backed by pools of loans such as ABSs, CLOs, CDOs, and MBSs; and loan servicers and insurers.  

Measuring and managing that risk involves three major challenges: accurate estimation of loan-level risk and correlations between loans given extensive data from various sources, analysis of pool-level risk, and optimal selection of loans for a portfolio. Loan-level risk can be estimated using a variety of classifiers from machine learning. We show how such classifiers can be leveraged as the basic building blocks for efficient risk analysis and optimization of the large pools of loans common in practice (pools of tens or hundreds of thousands of loans are common).   

Our loan portfolio engine accepts as an input a choice of machine learning classifiers of loan-level risk.  Based upon the classifier chosen, the portfolio engine provides fast, accurate risk analysis and optimization of a loan portfolio in real time (whereas status quo methods can take hours or days).  For instance, the portfolio engine efficiently computes the distribution of the P&L for a loan pool.  It also quickly selects optimal portfolios of loans according to a wide range of objectives and constraints.  The portfolio engine's algorithms are based upon asymptotic methods from stochastic analysis and optimization.  

The performance of the portfolio engine has been tested on large mortgage (120 million mortgages) and P2P loan data sets (1 million loans).

Loan Portfolio Engines

Kay Giesecke, Stanford University

presentation


In this talk Gary Kazantsev from the R&D Machine Learning group at Bloomberg will discuss the evolution and development of several key Bloomberg projects such as sentiment analysis, market impact prediction, novelty detection, social media monitoring, question answering and topic clustering. We will show that these interdisciplinary problems lie at the intersection of linguistics, finance, computer science and mathematics, requiring methods from signal processing, machine vision and other fields. Throughout, we will talk about practicalities of delivering machine learning solutions to problems of finance and highlight issues such as importance of appropriate problem decomposition, feature engineering and interpretability.

There will be a discussion of future directions and applications of Machine Learning in finance as well as a Q&A session.

Teaching machines to read for fun and profit

Gary Kazantsev, Bloomberg LP


Startups interested in creating data products through machine learning face a daunting task, lack of data. Many get caught up in a vicious cycle - no data leads to few users and not having users means not being able to collect interesting data. We will discuss Startup.ML's fellowship model for engaging with startups and established data science teams with the objective of helping them overcome the challenge of cold start. 

We will cover various approaches including transfer learning, word embeddings and neural network initialization.  We will also explore ways to leverage available datasets to bootstrap machine learning models in startups that have not yet gathered their own datasets. 

Machine Learning Approaches for Startups That Lack Data

Arshak Navruzyan, Startup.ML


At LendingHome machine-learned models heavily influence our loan origination process.  Mortgage credit generally is a hard machine learning problem.  Credit, loss estimation, and prepayment are three difficult sub-problems that we need to address.  Operationalizing any of these presents its own challenges.  I will discuss these three problems and cover LendingHome's approach to credit in detail.

The talk will answer questions relevant to the practitioner: How do you apply modern ML techniques to a very old problem?  Where do off-the-shelf algorithms fail?  How can you leverage both traditional and alternative credit factors?  How do you do so while delivering performant models sensible and transparent enough for a human to act on?

Our approach borrows ideas from several seemingly dissimilar areas such as Natural Language Processing and Transfer Learning.  To illustrate this we will show, for example, how at-origination, static credit models differ from sequential models that adapt over time and contrast the modeling power of each.  We will present some results on
historical mortgage datasets of 60M loans.

ML Approach to Mortgage Credit

Justin Palmer, LendingHome

presentation


Stripe processes billions of dollars in payments a year and uses machine learning to detect and stop fraudulent transactions. Like models used for ad and search ranking, Stripe's models don't just score---they dictate actions that directly change outcomes. High-scoring transactions are blocked before they can ever get refunded or disputed by the card holder. Deploying an initial model that successfully blocks a substantial amount of fraud is a great first step, but since your model is altering outcomes, subsequent parts of the modeling process become more difficult:

How do you evaluate the model? You can't observe the eventual outcomes of the transactions you block (would they have been refunded or disputed?) or the ads you didn't show (would they have been clicked?) In general, how do you quantify the difference between the world with the model and the world without it?

How do you train new models? If your current model is blocking a lot of transactions, you have substantially fewer samples of fraud for your new training set. Furthermore, if your current model detects and blocks some types of fraud more than others, any new model you train will be biased towards detecting that residual fraud. Ideally, new models would be trained on the "unconditional" distribution that exists in the absence of the original model.

In this talk, I'll describe how injecting a small amount of randomness in the production scoring environment allows you to answer these questions. We'll see how to obtain estimates of precision and recall (standard measures of model performance) from production data and how to approximate the distribution of samples that would exist in a world without the original model so that new models can be trained soundly.

Counterfactual Evaluation of ML Models

Michael Manapat, Stripe

presentation


Coinbase is the largest bitcoin wallet company with over 2.4 million users who have opened over 3.1 million wallets to buy or sell bitcoins across 26 countries. Fraud detection is an important lynchpin of our service that allows users to purchase bitcoins instantaneously while limiting our fraud loss. In this talk, we'll look at how we are detecting fraud using a combination of human analysts and Machine learnt systems.

In particular, we will discuss some unique challenges in a Machine learnt fraud detection system such as training using skewed data sets, challenges around learning new fraud patterns in real-time, designing A/B test environments that allow us to collect training data as well as limit our losses, etc.

Machine Learning to Fight Fraud in Bitcoin Payment Network

Soups Ranjan, CoinBase

presentation


Will Bitcoin survive? Who cares. Cryptoecononomics, due to its novel and programmable incentivization structure, extends descriptive and analytical models of economics with an element of design thinking with a dynamic toolkit that is far richer than any previous set.This includes smart-contract enforced stigmergy, swarm intelligence, new types of network topologies, and novel incentivization methods for positive behavior.

Derived from Joel Dietz’s academic work and prototyped cryptoeconomics systems, Joel unpacks the meaning of “Swarm,” including both current trust and reputational models and projections concerning future of distributed trust systems. In particular, he analyzes the types of business models and industries most likely to be disrupted by distributed trust and the application of machine learning to these networks. 

The Trust Web: Network topological review of cryptoeconomics and blockchain with implications for the future of everything

Joel Dietz, Swarm

presentation


We introduce Bayesian Global Optimization as an efficient way to optimize a system's parameters, when evaluating parameters is time-consuming or expensive. The adaptive sequential experimentation techniques described can be used to help tackle a myriad of problems including optimizing a system's click-through or conversion rate via online A/B testing, tuning parameters of a machine learning prediction method or expensive batch job, designing an engineering system or finding the optimal parameters of a real-world physical experiment.

We explore different tools available for performing these tasks, including Yelp's MOE and SigOpt. We will present the motivation, implementation, and background of these tools. Applications and examples from industry and best practices for using the techniques will be provided.

Bayesian Global Optimization

Scott Clark, SigOpt

presentation