Building on research success, CITL grows and focuses on scale

In the past year a lot has happened at CITL:

  • We compared the utilization of application armoring techniques across thousands of applications within Windows 10, Linux Ubuntu and Linux Gentoo, and Mac OS X.
  • We hacked up a proof-of-concept of our toolchain and looked under the hood at IoT applications and operating systems in use on “Smart” TVs. 
  • We found safety issues that seemed to have gone unnoticed for years in popular applications (e.g. Firefox and Office 2011 on OS X). 
  • We provided both high level and detailed results of our findings and approaches in our presentations at BlackHat and Defcon. 
  • We collaborated Consumer Reports and other public interest organizations to codify how to assess security risks to consumers in the products they buy.
Read more

CITL Status Report

Some people have been asking when they're going to get to see all the great output and data we're generating, so this seemed like a good time to explain where we're at right now.  We've realized that while we're busy executing on this plan, maybe other people would like to know more about the behind the scenes also.  In order to have automatically generated reports and software scores available at scale, here's what needs to happen: 

1. Automate Static Analysis measurement collection - This part is done for Windows, OSX, and Linux/ELF environments (Intel and ARM).  It takes ~1 second per binary and we're confident in the accuracy of the results we're collecting.  

2. Collect a lot of fuzzing and crash test data - Well underway, but still ongoing.  We've got about 100 cores chugging away, and enough results now to be moving on to the next step.

3. Correlate dynamic analysis results (2) with static analysis results (1) to finalize score calculations.  This is what we're working on now, and it's the main thing that has to happen before we're happy with releasing reports at scale.  

4. (Reach Goal) Gain enough confidence in our mathematical model to successfully predict dynamic results based on static results.  This will allow us to present estimated crash test results based on easily automated analysis.  

Why is this third step so important?  While we know that some things make software safer (ASLR, stack guards, DEP, source fortification, etc) and that some things make software weaker (using historically unsafe functions, high complexity, etc), the industry needs better data on how much they impact software safety.  If a perfect score is 100, how many points is having ASLR worth?  Linking to insecure libraries certainly introduces risk, but how much should it impact the overall score?  We want to have a better answer to questions like these before we publish our first official software safety reports.  Having a strong model to support our risk assessments will provide our ratings with the credibility they need in order to influence consumers, developers, the security community, and the commercial world.  

In the meantime, here's an overview of the sorts of software properties we're measuring with our static analysis.  Also, stay tuned!  We're hoping to have some exciting new partnerships and efforts to announce in the coming months.  

To Upgrade or not? A look at Office 2016 for OSX

When a new suite of a familiar piece of software comes out, you have to decide if you want to upgrade or not.  Sometimes there's a long-awaited feature, and you can't wait to get the latest and greatest, and sometimes it looks an awful lot like what you've already got, so it doesn't really seem worth the bother.  The security and software risk profiles of the old vs new versions are another factor that should be part of this consideration.  

When we looked at scores for OSX applications, the Microsoft Office 2011 suite was at the bottom of its category, and the accompanying Microsoft AutoUpdate application was at the bottom of the whole OSX environment.  Since then we've been asked how Office 2016 stacks up in comparison, and it provides an excellent example of hidden benefits of an upgrade, as it has a much better risk profile than the 2011 suite.  

Read more

Fortify Source: A Deeper Dive into Function Hardening on Linux and OS X

Source fortification is a powerful tool in modern compilers.  When enabled, the compiler will inspect the code and attempt to automatically replace risky functions with safer, better-bounded versions.  Of course, the compiler can only do that if it can figure out what those bounds should be, which isn't always easy.  The developer does not get much feedback as to the success rate of this process, though.  The developer knows that they may have enabled source code fortification (-D_FORTIFY_SOURCE), but they do not get a readout on how many of their memcpy instances are now replaced with the safer memcpy_chk function, for example. This is important to the consumer because just looking to see that a good software build practice was intended does not reveal whether the practice actually improved the safety in the resulting application. That made us really curious to dig into the data on source fortification and its efficacy. 

Read more

Software Application Risks on the OSX Continuum

In our previous post about the score histograms for Windows, Linux, and OSX, we promised deeper dives to come. We also noted interesting things about each continuum and reminded people that the real value is being able to compare risk present in various software within a single continuum.  No we will take our first look at where some applications of interest live on the score continuum for OSX.  We'll look at three categories of software here: browsers, office suites, and update software.  

Read more

Our Static Analysis Metrics and Features

If you've seen our post about the score distributions in OSX, Linux, and Windows 10 base installs, your first question is probably about what factors go into computing those scores.  This post will provide some high level understanding of what factors we consider for those static analysis scores.  

The main question we're trying to find the answer to is, "How difficult is it for an attacker to find a new exploit for this software?".  Attackers have limited resources, and just like anyone else, they don't like to waste their time doing something the hard way if an easier path is available.  They have tricks and heuristics they use to assess the relative difficulty to exploit software ...

Read more

Score Distributions in OSX, Win10, and Linux

The data we're sharing first is the data from what we refer to as our static analysis.  (Fuzzing and dynamic analysis data will be described later). This is the part of our data that is most similar to nutritional facts of software, as it focuses on features and contents of the binary.  In particular, what application armoring features are present (and how well were they done), how complex is it, and what does the developer hygiene look like?  Each binary gets a "local score", based on just that binary, and a "library score", based on the local scores for all libraries it depends on (directly or indirectly, through other libraries).  These two values are combined to produce the total score.  The charts below show histograms of the total scores for each of the tested environments.  

We have static analysis data on the base installs on all 3 of our initial desktop environments: OSX (El Capitan 10.11), Windows 10, and Ubuntu Linux (16.04 LTS).  Since it had come to our attention as an interesting test case, these installs also include Anaconda, the big data analytics package from Continuum.  It's important to note that scores don't compare directly between environments.  If the 50th percentile mark for Windows is higher than the one for Linux, for example, that doesn't necessarily mean anything.  Each environment has different sets of safety features available and different hazards for programmers to avoid, so the score values aren't apples to apples. This is important enough to bear repeating: consumers should compare the ratings we will be releasing of applications against each other *within* a particular environment. What we're focusing on here is the overall distribution and range of scores, as this tells us something about how consistent the development process was for that operating system/environment.  So which company appears to have the better Security Development Life Cycle (SDLC) process for their base OS?

We're still finalizing some data before we share reports or scores for specific software verticals, but we can share the overall landscape in each environment.  In a near-future blog post we'll add call-outs for specific applications to these charts, allowing comparisons between competing projects.  For now, we're just getting the lay of the land.  First off, here's the histogram of application scores for the base installation of OSX.  

Histogram of static analysis scores for base install of OSX El Capitan.  

Histogram of static analysis scores for base install of OSX El Capitan.  

Base OSX has a roughly bimodal distribution, with the scores covering a pretty broad range.  The 5th and 95th percentile lines are 67 points away from each other.    While there aren't any perfect scores, there's a fair amount in the 90s (100 would be a "perfect score" for this high level static view). [Note: "Perfect score" does not mean secure. Instead, it means that the appropriate safety measures and protective harnesses were enabled in the development process. As an analogy, a car that has seat belts, anti-lock breaks, air bags, active and passive restraints, etc. would receive a "perfect score". We could also call this the "bare minimum", but we're trying to encourage the industry rather than simply berate it. In the future we will release the measurements that show the level of efficacy each of these safety features achieved per application, but for now we are discussing this at a much higher level to help acclimate people to the approach.]   

Similar histogram showing total score distribution in Ubuntu Linux.  

Similar histogram showing total score distribution in Ubuntu Linux.  

With Linux Ubuntu (16.04 LTS), we still have a pretty broad range of scores, but with a longer tail for the upper range of the scores.  Here we have more of a normal distribution.  The 95th percentile line moves down a bit, but the other percentile marks are surprisingly close to the values in OSX.  

With Windows, we have a very different picture. Frankly, this is very impressive: 

Windows 10 score distribution.  All scores lower than -22 are from Anaconda.  Without Anaconda included the 5th percentile is a score of 53 (instead of 40).  

Windows 10 score distribution.  All scores lower than -22 are from Anaconda.  Without Anaconda included the 5th percentile is a score of 53 (instead of 40).  

Windows shows a pretty different chart from the previous two.  The bin for scores from 65-70 has about 5500 files in it.  Even accounting for the increased number of files in the Windows base install, this is much higher than the biggest peaks on either of the other charts, indicating that application of armoring and security features is much more uniform for the Windows environment.  Windows also has many more files with a perfect or near-perfect score than OSX or Linux had.  While all three environments had the potential for a final score to be slightly over 100, this is the only environment to have any in practice.  

This distribution will become a lot broader once we bring in third party applications.  This prediction is supported by what Anaconda's presence already does to the chart.  Namely, it contributes the lowest 450 or so scores to the chart, including everything scoring a -22 or lower.  [Note: Continuum, the company that packages Anaconda, is aware of these findings. Should they decide to move to a modern development with a full complement of safety features available (and frequently enabled by default), their scores will improve significantly.  We are hoping they do so, as their current product imposes significant risk on their customers, and it would be a great initial win if this got fixed. We will keep you posted on this topic.]

We are installing a wide variety of third party software onto our test systems. This will provide an understanding of the overall software ecosystem for each environment.  More importantly, though, it'll allow consumers to call out all members of a particular software category and compare them based on where they fall on this safety continuum.  

With this information it will become a straight forward process to choose and use software that increases the cost to the adversary, and thus measurably decreases the risk to yourself. After all, who would want to install and run the easiest target software for no reason?

Other Industries that Inspired Us

Evaluating the risk profile of software is a technically complex task, but there are lots of other industries where consumers have to engage in complex decision-making.  Choosing which car is the best choice for you, or which food best fits your diet, or which new refrigerator to buy are all very technically complex decisions, but those industries have all developed (mandatory) labels to help consumers stay informed...

Read more

CITL's Reception at Black Hat and Defcon

So, first off, thank you!  We've been thrilled so far by the media coverage and security community reaction we received.  It was particularly exciting to hear people in other Defcon and Black Hat talks speculating on how our efforts could be used to inform/support/feed into theirs (Thanks, Jeremiah!).  While our main reports and content will be tailored to a broader audience, we definitely want our data to be a tool that the security researcher community can make use of.  We also think that this detailed data will be extremely useful to the insurance industry as actuarial data to inform their cyber insurance practices....

Read more