There seems to be very little overlap currently between the worlds of infosec and machine learning. If a data scientist attended Black Hat and a network security expert went to NIPS, they would be equally at a loss.
This is unfortunate because infosec can definitely benefit from a probabilistic approach but a significant amount of domain expertise is required in order to apply ML methods.
Machine learning practitioners face a few challenges for doing work in this domain including understanding the datasets, how to do feature engineering (in a generalizable way) and creation of labels.
A variety of datasets can be collected as a precursor to creating a training set for a machine learning model:
- Log files from systems, firewalls, proxies, routers, switches that capture in semi-structured formats network activity and user behavior
- Application level logging and diagnostics that record user/system access information and application usage
- Monitoring tools, IDS systems and SIEMs
- Network Packet Capture (PCAP) is a rather compute/storage intensive process of recording the raw ethernet frames
Some of these sources (like log formats) are readily available and fairly standardized while others will require extensive tooling and software modifications (e.g. application logging), and yet others will require a significant hardware footprint and a monitoring network that could rival the size of the real network.
Bearing in mind that the whole point of machine learning is generalization beyond the training set, thoughtful feature engineering is required to go from the identity information of IP addresses, hostnames and URLs to something that can turn into a useful representation within the machine learning model.
For example the following might be a useful feature space created from proxy logs (Franc)
lower case ratio
upper case ratio
vowel changes ratio
has repetition of '&' and '='
start with number
number of non-base64 characters
has a special character
max length of consonant stream
max length of vowel stream
max length of lower case stream
max length of upper case stream
max length of digit stream
ratio of a character with max occurrence
HTTP request status
is URL encrypted
is protocol HTTPS
number of bytes up
number of bytes down
is URL in ASCII
client port number
server port number
user agent length
number of '/' in path
number of '/' in query
number of '/' in referrer
is the second-level domain raw IP
Once the input feature space has been established, getting the label is the next challenge. For each observation of the training set, we need to determine if it is a malicious or a benign pattern.
Having a network security expert create labels (on potentially millions of observations) will be expensive and time consuming. Instead we might rely on a semi-supervised approach by leveraging publicly available threat intelligence feeds. MLSec provides a set of tools and resources for gathering and processing various threat intelligence feeds.*
Thus labels are created by doing a join between public blacklists and the collected dataset. Matching can be done on fields like IP addresses, domains, agents, etc. Keep in mind that these identity elements are only used to produce the label and will not be part of the model input (Franc).
Finally once we have the input feature space and the labels, we are ready to train a model. We can anticipate certain characteristics of input space such as class imbalance, non-linearity and noisiness. A modern ensemble method like random forest or gradient boosted trees should overcome these issues with proper parameter tuning (Seni).
Bigger issue is that this is an adversarial use case and model decay will be a significant factor. Since there is an active adversary trying to avoid detection, attack patterns will constantly evolve which will cause data drift for the model.
Some possible solutions to the adversarial issue could be the use of a nonparametric method, using online/active learning (i.e. letting the model evolve on every new observation) or having rigorous tests to determine when the model should be retrained.
To address some of the issues unique to adversarial machine learning, Startup.ML is organizing a one-day special conference on September 10th in San Francisco. Leading practitioners from Google, Coinbase, Ripple, Stripe, Square, etc. will cover their approaches to solving these problems in hands-on workshops and talks.
Franc, Vojtech, Michal Sofka, and Karel Bartos. "Learning detector of malicious network traffic from weak labels." Joint European Conference on Machine Learning and Knowledge Discovery in Databases. Springer International Publishing, 2015.
Seni, Giovanni, and John F. Elder. "Ensemble methods in data mining: improving accuracy through combining predictions." Synthesis Lectures on Data Mining and Knowledge Discovery 2.1 (2010): 1-126.
* This approach is limited to global IP addresses and domains and cannot be used for internal threats.