Ryan is an Assistant Professor of Computer Science at the Harvard School of Engineering and Applied Sciences. He will talk about how Bayesian optimization can take humans out of the machine learning loop. Ryan recently founded Whetlab, a company that makes it easy to use these tools to get machine learning pipelines going on novel data without having to hire a room full of PhDs.
Matei Zaharia, MIT & Databricks
Matei is an Assistant Professor in computer science at MIT and CTO of Databricks, the company commercializing Apache Spark. He started the Spark project while he was a PhD student at Berkeley and has done research on a variety of topics in large-scale computer systems. Matei will be discussing the steps required to build efficient end-to-end data pipelines and ongoing work in Spark to enable such pipelines.
Carl Anderson, Warby Parker
Carl is the Director of Data Science at Warby Parker. His role encompasses heading up data engineering, supporting over 20 analysts, and creating a data-driven organization. He will be leading a high level talk on the different areas where machine learning can impact an e-commerce organization, drawing on some examples from Warby Parker, and also covering how data scientists can contribute to a data driven culture within their organization.
Alekh Agarwal, Microsoft Research
Alekh's research focus is on problems which arise while applying machine learning techniques to massive datasets. Part of his research aims to understand the tradeoffs between learning and computation, as well as designing efficient learning algorithms that can learn under a given computational budget.
Sudarshan Sudarshan, Microsoft
Azure Machine Learning is a cloud-hosted service for creating and publishing elastic machine learning services on Microsoft Azure. In this talk, we will describe some of the capabilities of Azure ML such as collaboration, versioning and provenance integrated into the product, built-in algorithms for common machine learning tasks such as classification, regression and recommendation, tools for model validation and hyper-parameter tuning as well as integration with external tools such as R, Python, Vowpal Wabbit, Hive and SQL. We will also give an overview of the backend execution architecture and more recent features such as automated creation of scoring experiments and support for programmatic model training.
Sachin Chouksey, Microsoft
In this workshop we will follow the Cloud Data Science Process to go from a TB of data to an operationalized machine learning model in a couple of hours using resources in the cloud. We will work with an online display dataset in a Hadoop cluster to conduct exploratory data analysis, create features, build count tables against categorical features, sample the data for use in Azure Machine Learning, build and train a model to predict if a user will click on an ad using Azure ML and operationalize and consume the model as a webservice.
Peter Bull, DrivenData
Peter is a cofounder at Drivendata, a platform that runs data science competitions for nonprofits, NGOs, and governments. Just like every major corporation today, these groups have more data than ever before and are eager to tap into the power of their data.
Peter will speak on the ways in which statistics, computer science, and machine learning can be applied to challenges in the social sector. The talk will address both the big-picture context of the data for good movement, and an in-depth case study of the methods which won DrivenData’s recent competition on smart school budgeting.
Alec Radford, indico
Alec is the head of research at indico and will be leading a session on text analysis using recurrent neural networks & Theano.
Vik Paruchuri, DataQuest
Vik is the founder of Dataquest, a site that teaches you data science in your browser. He'll be talking about why it's important to make machine learning more accessible, what the best tools are, and how we can improve them.
Paul Ruvolo, Olin College
Paul Ruvolo is an Assistant Professor of Computer Science at Olin College. His research consists of both algorithmic development and applications of machine learning. On the applied side he is working on problems in robotics and assistive tech. On the algorithmic development side he has been working to combine features from both multi-task and transfer learning. In his talk Paul will present his paper "ELLA: an Efficient Lifelong Learning Algorithm". He will also discuss several extensions to the algorithm that he and his collaborators have developed.
Bugra Akyildiz, Axial & DataKind
Bugra is a Data Scientist at Axial where he works on information retrieval and recommender systems. In his talk, he will give an overview of topic modeling and how it can be applied in a text classification schema. Specifically, he will talk about a case study that he was a part of at Datakind using Amnesty International’s data.
Soups Ranjan, Yelp
Soups is a Data Mining Engineer at Yelp where he is using Machine Learning to make local ads more relevant. He is passionate about all things data and leads Yelp's university outreach efforts by organizing the Yelp Dataset Challenge, where he provides a dataset containing 2M+ reviews about local businesses in 10 cities world-wide for use in academia.
Yelp has a unique local ads product where businesses buy impression- or performance-based advertising to drive clicks. HE will draw the curtains open on how Yelp does local ads with a brief overview of the components of the ads system including: 2nd price auction based ad-exchange and auto bidding to set bid prices. We will also talk about problems such as Click-Through-Rate Prediction, Inventory Forecasting, and more.
Will Drevo, Datasight
Will is the cofounder and CEO of Datasight. He will be leading a talk entitled "How to Create a Machine Learning Startup: Building Tools for a Multibillion Dollar Industry on a Budget." His talk will discuss how building an ML startup from scratch requires more than just ML expertise. In this talk he will take us through the process of finding a product niche, interviewing customers, starting a pilot, and prove that ML isn't just about cross validation.
Joe Cauteruccio, DigitasLBi
Joe is a Data Scientist at DigitasLBi, a global media and technology agency. His work primarily focuses on algorithm development and the application of machine learning to marketing problems. His talk will focus on the assignment of credit to digital media, otherwise known as Ad Attribution. He will begin by briefly setting the stage, introducing the problem and some of the methodologies currently employed in industry. We then recast the problem as a binary sequence classification problem, surveying a number of generative graphical model approaches, including both Markovian and Bayesian network techniques. The various techniques and their implementations are studied using both simulation and sample data.
Oren Schaedel, Versal
Oren's role at Versal encompasses internal and external analytics, data pipelines and research projects including clustering of content from online courses. His talk will be a high level explanation of the challenges in semi-labeled categorization using a combination of generative and discriminative models along with Wikipedia as a taxonomy, to aid in the labeling of multi-disciplinary content with different media types.
Matthew Johnson, Harvard
Matt got his PhD from MIT in 2014 and is now a postdoc at Harvard, where he thinks about time series models and scalable inference. He'll talk about making Bayesian inference systems out of composable, reusable parts.
Eric Chiang, Yhat
Eric is a founding member, engineer, and in-house data scientist at Yhat, a NYC based startup building platforms for enterprise analytics. Prior to Yhat, Eric held positions at Cloudera, IBM, and Genentech. He will discuss the struggles of turning data science into data-driven products from an engineer's perspective, from his experiences at Yhat and Genentech.
Michael Els, MaxPoint
Michael is a Principal Data Scientist at MaxPoint, a computational advertising company. His talk will focus an online algorithm that was needed to control ad volume across thousands of campaigns functioning across multiple advertising exchanges simultaneously. The desire was to serve ads optimally across the day and maximize ad quality for each ad campaign. A real-time product was developed that incorporates a proportional-integral-derivative algorithm, kalman filters, and particle swarm optimization.
Beth Logan, DataXu
Beth leads the Optimization Team at DataXu, responsible for the algorithms that drive their real-time advertising platform. Previously she worked in other big data fields: speech, music, biology and medical and has over 30 publications and 11 issued patents.
Her presentation will describe how to build a practical machine learning system, using DataXu’s technology as an example. DataXu’s ad placement decisioning technology uses machine learning to make billions of decisions per second. However, a system that runs at scale 24x7 in over 30 countries is far removed from lab experiments. Beth will take us through some of the challenges faced and tradeoffs needed to build a practical machine learning system.
Mike Tamir, Galvanize
Mike is the Chief Science Officer at Galvanize. He will discuss text classification in situations where labeled training sets are scarce.
Supervised text classification is hampered by the need to acquire expensive labeled training sets. In some systems and methods discussed pre-existing Word2Vec or similar algorithms are leveraged to create vector representations of documents that enable a model to be successfully trained with a drastically reduced training set. By using this technique the implementer can now devote low investment to acquiring a small volume of labeled data examples in order to train proximity thresholds, without devoting significant resources using traditional text classification algorithms which typically require training volume examples that are orders of magnitude larger.
Diego Klabjan, Northwestern and Opex Analytics
Diego is a professor at Northwestern University in the Department of Industrial Engineering and Management Sciences. He is also the founding Director for the Masters of Science in Analytics.
Diego will discuss machine learning techniques for web recommendation and personalization. He will cover MLlib enhancements for classification with imbalanced classes.
Clay Kim, Localytics
Clay is the Senior Manager of Data Science at Localytics and an active contributor to the Apache Spark ecosystem. He will talk about the challenge of putting automated A/B testing with Mutli-Armed Bandits into production.
Colin Raffel, Columbia
Colin is a PhD student at Columbia University working in the Laboratory for the Recognition and Organization of Speech and Audio with Professor Dan Ellis. His research focuses on machine listening. In his talk, he will cover Lasagne, a recently developed library which provides utilities for easily constructing and training neural networks using Theano. Lasagne includes many state-of-the-art techniques and is meant to be as efficient as possible.
Roger Grosse, University of Toronto
Roger is a Postdoctoral Fellow in machine learning at the University of Toronto. His research focuses on representation learning, Bayesian methods, and techniques for evaluating models and algorithms. He recently graduated from MIT, where his dissertation work focused on automatically selecting probabilistic models in spaces defined compositionally in terms of simpler models. Roger is also a co-creator of Metacademy, an open-source educational web site which uses a dependency graph of concepts to generate personalized learning plans. He will talk about algorithms for speeding up the training of neural nets using tractable but accurate approximations to the curvature of the loss function.
Paco Nathan, Databricks
Paco is the Director of Community Evangelism at Databricks. Also an O'Reilly author ("Just Enough Math", "Enterprise Data Workflows", Spark cert exam), and an advisor at GalvanizeU and Amplify Partners. He has a background in distributed systems and led data teams for several years on large scale ML use cases.
In his workshop we will take user email list archives from an Apache project and show workflows of Spark SQL, MLlib, and GraphX to surface insights, leaderboards about the dev community, topic modeling, community analysis, and visualizations.
Bob Crovella, NVIDIA
Bob leads a team at NVIDIA that is responsible for supporting the GPU Computing products. In the morning he will lead an intro to CUDA programming session. In the afternoon he will take us through an exercise involving deep learning on GPUs. He suggests basic C/C++ programming as a prerequisite. All attendees will be set up with AWS instances for the workshops.