6 Steps for Applying Data Science to Security
Two experts share their data science know-how in a tutorial focusing on internal DNS query analysis.
Security practitioners are being told that they have to get smarter about how they use data. The problem is that many data scientists are lost in their world of math and algorithms and don’t always explain the value they bring from a business perspective.
Dr. Kenneth Sanford, analytics architect and sales engineering lead at Dataiku, says security pros have to work more closely with data scientists to understand what the business is trying to accomplish. For example, is compliance the goal? Or is the company looking to determine what it might cost if they experienced a ransomware attack?
"It’s really important to define the business problem," Sanford says. "Something like what downtime would cost the business, or what the monetary fine would be if the company were out of compliance."
Bob Rudis, chief data scientist at Rapid7, adds that companies need to take a step back and look at their processes and decide what could be done better via data science.
"Companies need to ask themselves how the security problem is associated with the business problem," Rudis says.
Sanford and Rudis created a six-step process for how to build a model to analyze internal DNS queries – the goal of which would be to reduce or eliminate malicious code from the queries.
1. Define the business problem
Too often security practitioners get lost in the details of the technology and they don’t always think through the business issue at hand. For example, if the goal is to analyze DNS requests, it’s important to decide if you want to focus on the thousands or possibly millions of internalDNS requests or the external DNS requests on a web site or ecommerce site. Once you decide what’s more important, a data scientist can build a model to analyze those activities.
2. Decide what data sources would be best to solve the problem
Here’s where you would decide what the model would look like to solve the business problem. For example, if the company decides it wants to stop internal users from clicking on links that result in phishing attacks, it needs to build a model of all internal DNS requests. In terms of the data required, you will need a set of legitimate emails, a set of corrupted emails and the IP addresses and domains of where those emails originate. The data scientist needs to be creative to imagine a world where all the data are available.
3. Take an inventory of the data
Here’s where you have to take an inventory of the data that’s available. While you should aim for perfection, recognize the constraints. Keeping with the DNS theme, most DNS data comes from routers, mobile phones, servers and workstations. Take an inventory of the type of queries being made and then determine if it’s in a format you can work with and whether you have the IT infrastructure available to store it and access it properly. For example, if you don’t have adequate storage, you’ll need to figure out what you need and what that investment will cost.
4. Experiment with many data science techniques
Now it’s time to put your hands to the keyboard and experiment with which data science technique works best. You may decide on a highly explainable linear model or a deep learning algorithm, but whatever you do, the idea is not to deploy an algorithm for the sake of doing high math. The goal should always be to pick the best way for the machine to deliver analysis that a human couldn’t do that will let the business make good decisions. In the case of our DNS example, you will want to build models that can consistently tell you with high confidence that a DNS request is malicious.
5. Test for a real-world perspective
When testing, the team will want to determine if the model generates too many false positives, too many false negatives and if the analysis happens fast enough to be of use to the business. It’s always important to have a real-world perspective on the purpose of the model you are building. In the DNS example, you should ask if the model will reduce the number of malicious DNS queries the company makes internally?
6. Follow-up and continuous improvement
Once the testing is complete, a process that can take several weeks, it’s time to put the model into production. However, it’s really important to understand that these models require constant monitoring and continuous improvement. It’s not like deploying antivirus software where every couple of weeks you will get new signatures you can update. The model has to be continuously monitored to ensure that it’s meeting the company’s goal of stopping malicious DNS queries hitting the internal network.
Book your tickets for the Data Science Summit here: https://www.thedatasciencesummit.com/book-your-tickets/