How to interpret the MITRE ATTA&CK® Evaluation

10 mins read

In this article we’ll delve into the MITRE testing methodology and compare this against what matters in the real world to give some useful tips for analyzing the evaluation results.

This article will tell you:

  • What the MITRE ATT&CK® evaluation is, and what it is not
  • How to ask your own questions when evaluating EDR vendors
  • How to compare telemetry, detection types, and EDR solutions

Detecting real-world cyber attacks

One of the biggest challenges for organizations is buying and implementing the right tooling to empower their security teams. In general, most organizations have a history of inadvertently buying ineffective tooling or struggling to get value from existing tooling due to noise or complexity. And while organizations such as Gartner provide product guidance, this can often be too high level and not based on real-world benchmarking.

To help provide organizations with a more detailed analysis of tooling in 2017 [1], MITRE launched a program to evaluate EDR vendors against the MITRE ATT&CK framework to effectively offer a publicly available impartial benchmark. The initial results were released in 2018 and give a great overview of the kinds of telemetry, alerts, interface and output you get from each product or service listed.

The assessment was based on a real-world threat group APT3 and provided a rich set of detection cases to measure against covering all major areas of the cyber kill chain. However, it did not factor in at all how effective in a real-world environment this would be, nor did it cover any aspects of responding to attacks. So although the evaluation is a useful starting point, it should form just one aspect of how you assess an EDR product.

An EDR Product Assessment

The Round 1 MITRE evaluation was essentially a product assessment that was focused on measuring EDR detection capabilities in a controlled environment with the main assessment criteria being telemetry and detections. The output was a list of test cases and results for each, focusing mainly on detection specificity and time to receive the information. Taking a simplified approach like this helps break down a complex problem space like detection into something more manageable. But does this overly simplify the problem?

Often, in the world of detection it’s not finding the “bad things” that matter, but excluding legitimate activity so your team can more effectively spot anomalous activity. By testing in a noise-free environment vendors are able to claim to “detect” test cases that would have likely been hidden by noise in the real world. MITRE clearly note this as a limitation but it’s not all that obvious when reviewing results.

Going beyond the product itself, key areas like the people driving the tool and process/workflow are also noticeably absent from the test and are often more important than the tool itself. As such we’d recommend taking a holistic approach, using the MITRE evaluation as a starting point but remaining aware of its limitations and instead asking your own questions.

For example:

  • What are the false positive rates like in the real-world?
  • Can you demonstrate capabilities that either limit noise or help draw attention to specific activity that closely matches legitimate activity?
  • Can you demonstrate a real-world end-to-end investigation? From a threat hunting-based detection, to investigation, to time-lining and response?
  • Can you issue response tasks in order to retrieve forensic data from the machine?
  • Can you contain and battle an attacker off the network?
  • Is my detection team technically capable of driving the tool and available 24/7/365?
  • Could you benefit from a managed service and if so can they demonstrate they are able to detect advanced attacks?
But what can you learn from the existing results? And how should you interpret them?

Each vendor has their own set of results that consist of roughly 100 different test cases, each with an associated Description, Technique ID, Detection Type, and Detection Notes. The first thing to note is that this is a technical assessment with technical results and no high-level scoring mechanism so you may need to ask your technical team members (or an external party) for guidance. We’ve included an example result below.

Signal to noise

As many of the MITRE techniques closely match real-world legitimate activity, they can be false-positive prone. For example, Rundll32 usage is common across many organizations making it commonly too noisy for anyone to monitor manually, whereas Mshta is used less often making it easier to spot.  But that noise can be valuable; handled correctly, it adds fidelity.

This is a prime example where machine learning and broader context of the monitored activity can pick meaningful signals from the noise and calculate risk scoring to only raise an alert whenever multiple related activities have been detected in unusual context. Your team’s efficiency can significantly improve when they are focusing on the high risk detections in broader context and leaving machine learning to look into higher volume activities otherwise hidden in the noise.

Figure 1 – The test results give some great technical detail, but no obvious score 

The most relevant fields here are the “Detection Type” and “Detection Notes” as they explain how the vendor performed. Together they give a summary of essentially whether the vendor logged any associated telemetry and whether there were any alerts/detections related to the activity.

In the following sections we’ll look at how you can assess the importance of both “Telemetry” and “Detections”.

How to measure telemetry

The biggest pre-requisite for any kind of detection is having the data to analyze it in the first place. Most EDR providers will collect real-time telemetry for process data, file data, network connections and services, registry or persistence data, which cover a large number of attacker actions. But what are the key factors to look out for here?

Collected Data – Looking at the test cases you’ll see most products successfully collected telemetry for nearly every test case. One area where quite a few products were caught out was the Empire section where the actors disable PowerShell logging. Only a subset of products detected this activity. Outside of MITRE you’ll find more advanced products will also collect data associated with memory anomalies and data for WMI and .NET activity which can help detect more cutting-edge attacks.

Timing – Response times matter and the MITRE results provide a measure for how long it might take for data/alerts to be returned to you from an endpoint. MITRE assign a “delayed” tag to anything that takes more than roughly 30 minutes or so. While faster data processing is a good thing, the reality is that most real-world breaches will take minutes or hours to detect and contain (with an industry average of months to years). So we’d recommend focusing less on the time to receive data and more on whether you are able to detect the attack at all and how long it takes you to contain it.

Quality – The MITRE evaluation can help you understand if a product collects basic data for the specific test cases; it won’t, however, help you confirm that the product gives you the necessary context to complete an investigation (this comes back to the isolated product test vs real world issue). For example, a process event will usually contain the path to what executed, but does it also show you the hash, certificate information, parent processes, and child processes? This is not something MITRE measures.

Retention – One subtle point with the MITRE evaluation is that testing and assessment are performed immediately after each other so retention isn’t a factor. In the real world, retention is a huge problem as EDR datasets can be very large making long time storage costly and technically challenging. As a business it’s important to clarify how long each of the datasets will be stored for, as this can have a financial, regulatory and operational impact. For example, if you don’t have a 24/7 team and something were to happen on the weekend, the data could be gone by Monday.

Understanding Detection Types

Automated alerting allows your team to scale your detection efforts and increase your reliability of detecting known indicators. Detections are a key component of the MITRE evaluation, with detection quality captured by classifying alerts as enrichments, general behaviors or specific behaviors. In general, the more specific the indicator the better, as they create fewer alerts.

Do remember though that detections and alerts are just one component in your detection approach and should not be relied on as a single approach because alerts are “reactive” instead of “proactive”. When used correctly alerts can help you reliably spot the easy stuff and improve your response times. The risk with taking an alert-based approach in highly targeted organizations is that it can create a reactive culture within your team leading to complacency and a false sense of security. Finding the right balance between reactive alert-based detection and proactive research driven threat hunting to also address never-seen-before techniques that existing tools simply can’t identify as malicious will help you catch the anomalies that tools/alerts will often miss.

Comparing solutions

Although MITRE don’t score solutions, they do provide a comparison tool to help you easily see for each use-case how each solution performed.

Figure 2 – The results are comparable once you understand the detection types

It’s useful to take a holistic approach when comparing results, giving equal weighting to telemetry, detection, and how quickly results are returned (low number of “delayed” results), as each of these aspects bring different benefits to security teams. For the detection and managed service components you want to make sure that adequate information is provided to enable your team to respond to any notifications.

Figure 3 – Kill chain analysis providing holistic view to compare results

Forester have previously released a scoring tool for MITRE. While an interesting approach, the results for this tool are heavily weighted towards detections and – as mentioned already – using detections as your primary evaluation criteria is not a good way of measuring the overall effectiveness of an EDR tool. What matters most in a real-world breach is having the right data, analytics, detections, response features, and – most importantly – a capable team to drive any tool.

The MITRE Evaluation

The MITRE evaluation is a great step forward for the security industry, bringing some much needed visibility and independent testing to the EDR space. MITRE themselves should be applauded for their efforts, as fairly and independently comparing solutions in such a complex problem space is very challenging.

In the short term, we are excited to announce that F-Secure has just completed the Round 2 MITRE evaluation and we will be posting the results when they are ready.

Follow us on Twitter and LinkedIn to be the first to know!