Scaling Variant Analysis
The past handful of years I’ve been really interested in static analysis but not from the traditional appsec program perspective of shifting left and catching bugs before they get merged. Instead I use it for code exploration, vulnerability discovery, and variant analysis. I want to share a bit about how I use these tools because truthfully, I think it helps to get more value out of them and selfishly, I want the vendors to invest more into supporting these use cases.
What I see often with teams deploying these tools is setting up CI pipelines and running the default rulesets. There’s a time and a place for the default rulesets but overall I don’t find them particularly useful, especially for source code auditing and vulnerability research. Outside of the most egregious issues it’s likely not going to find much but they can be a good safety net and starting point for your SDLC. They can also be a great reference for seeing how to implement rules for specific conditions. Why aren’t these default rulesets very good? There are a lot of reasons but I’ll focus on a few.
When you look at what these tools typically include it’s typically about breadth. The vendors want to be able to say that they cover the full OWASP top 10 or some percentage of CWEs. Otherwise, they’re not going to check the box for the people buying. This isn’t specific to SAST but is common across most security products. You’ll see them go for wide coverage but not deep. So you’ll get some coverage for some variants but not all. Building rules is hard and takes time. There’s lot’s of variation and you need to prioritize what’s going to demo well and be most useful for the largest group of customers. There’s only so many broadly applicable rules with low false positives you can build until you start needing to chase the long tail which starts to incur diminishing returns. I think pushing the rule development in house to the customers is likely the right path because you need specialized knowledge of the environment, I just don’t think most customers are willing to buy into this right now. Most organizations are buying automation because they don’t want to increase headcount.
High false positives are the fastest way to get ripped out and not get the contract renewed. If you’re breaking the build without a good reason people are going to get angry. You’re going to slow down development and you only get so many chances until the business is fed up. Certain types of vulnerability classes are undecidable problems which means there’s limits to what you can detect. This is often worked around by scoping down the problem or only detecting certain conditions of a specific vulnerability class.
One of the reasons SAST is performant are the shortcuts that tools choose to take. Virtually all of them will make decisions to trade comprehensiveness for performance. One of the main choices many tools make is to not analyze dependencies and instead stick to just the application code itself. This limits the analysis scope to a manageable amount but means that the rules developed need to be aware of the abstractions that the libraries provide. If they aren’t it can lead to false negatives. This means that the focus on general rules becomes much less useful especially if your organization uses in house developed or more niche frameworks and abstractions.
If code isn’t being developed it’s often not being scanned because CI pipelines aren’t being run. Unfortunately, this is typically the entry point for kicking off a scan, a change to code. This is fine if the tools are staying static but if you’re adding new rules, the tools are improving, analysis is getting better you’re potentially going miss detecting vulnerabilities. You’re going to have blind spots. This was a major concern I had. You only find bugs where you look.
What did I want? Primarily the ability to perform ad hoc scans against a specific subset of an ecosystem. I wanted to be able to define an experiment, test a hypothesis, and understand the full scope of a problem across everything. I wanted to conduct surveys. I wanted to do this all separately from the standard SAST workflow to ensure that the results produced didn’t get surfaced until the quality was there. This would give the ability to iterate on rules and try things that didn’t scale or may have been low quality. I wanted to make sure this didn’t impact engineers or their workflow. There are a lot of ways to learn about vulnerabilities, they don’t all have to come from our discovery. Do you have an internal red team or pentest team? What about a bug bounty program? Are vulnerabilities ever found during code reviews? There are a lot of ways you can learn about the presence of and types of vulnerabilities within code bases. This gives a starting point.
The thing about vulnerabilities is that for a lot of classes if they exist in one place there is a good chance they’ll exist somewhere else. People joke about stack overflow copy/paste programming, or maybe more appropriate today AI driven programming, but it’s common. There’s often a lot of repetitive boilerplate or similar functionality that’s needed in multiple places and the same engineer writing it multiple times or another one copying how it was done elsewhere; there’s a good chance you’ll end up with the same bug. When I talk about scale, this is what I’m talking about. Once we’ve found a bug in one place, how can we identify everywhere else that it occurs. There’s even a name for this, variant analysis.
This is something that we can do today manually. This works ok if the vulnerability is fairly straight forward, the code base is small, or you only care about a single code base. Each one of these factors can increase the work exponentially as you need to manually perform data flow tracking, verify data isn’t sanitized or encoded, or other conditions that mitigate the vulnerability. Static analysis tools can help us do this more efficiently, they can help us automate some of this process. If we can build rules that can identify the bug and variants of it, we can run them against the project and identify other cases. But what happens when we want to scale this to an entire ecosystem? For small companies with only one or a few repos, this isn’t a big deal. You can probably run this on a single system manually. For large companies that have tens or hundreds of thousands of repos that’s just not viable.
Both GitHub and Semgrep have tools targeted at this use case. In 2023 GitHub released Muli-Repository Variant Analysis or MRVA after a private beta. This is available today and can be used under the same terms as the CodeQL licensing I believe. There is a blog post you can check out that introduces it and covers it more in depth that I encourage you to read if you’re interested.
MRVA is run primarily through the VS Code CodeQL extension. Basically you write or load up a query then, in the left rail, you can specify your targeting. GitHub provides a Top 10, 100, and 1000 which will automatically select the top n repositories for the specific language targeted by the query. You can also provide your own list of repos. Using the command pallet you can run Variant Analysis task and a new tab will open which will provide the results. The extension takes care of packaging everything together and sends it up to GitHub to run through a special action. You can see here it ran 1000 repos in about 23 minutes. That will vary based on the query, repos being targeted, etc. but overall, it’s pretty performant.
When you have findings, you can drill into them and it will function much the way that results do for individual queries. It will highlight the code snippets denoting the points in code that match the query that was run and give you clear steps through data flow analysis and taint tracking when it’s being used so that you can verify it.
Semgrep has a currently in beta tool called Code Search. I’m not sure how widely it’s been rolled out yet but I have had the chance to play with it. It’s accessed through their web application. You can select a rule from your own registry or one of the public ones.
Targeting is done in one of two ways currently: selecting sets of your private repos, if you’ve added the Semgrep app to your GitHub, or using GitHub code search to search for repositories that meet your specified criteria.
The analysis runs within Semgrep’s infrastructure and they handle all the orchestration. Given that it’s very new, I haven’t had a chance to do enough real world usage to understand what performance is like. I will say that so far performance seems in line with what I was seeing with MRVA. The only major limitation I’ve seen thus far is the limit to 30 repos at a time when using the search function for targeting, however I’ve heard that this is just a limit of the beta and not intended to be present when it releases. You’re provided a page that includes any findings organized by repo. Currently this results view does not provide data flow traces like you can get from the cli which makes it hard to review when using taint tracking but I expect that it just hasn’t been implemented yet since it’s a feature of Semgrep today.
So we’re good to go right? Not exactly. We have some options available to us but they both lack maturity and some pretty key features and functionality. Both GitHub and Semgrep seem to be investing into the products and I’m hopeful that we’ll see them mature but currently I just don’t feel they are there yet.
Targeting is currently completely inadequate. Both CodeQL and Semgrep each do parts of it ok, but effectively targeting still requires a lot of manual work and often custom tooling. With CodeQL the only way to select targets is to either use one of those pre-generated top lists or create your own. Creating your own is an exercise left to you. You can use GitHub search to identify repos matching certain conditions but the api enforces result limits, doesn’t handle de-duping of repos, the api doesn’t currently provide access to the newer code search which is more powerful, and you need to build your own tooling to actually extract a list of repos and turn it into the correct format that the VS Code extension can work with. Semgrep integrates the search into the product but doesn’t provide a way for you to provide your own targeting list or batching into 30 repo segments. Both only support targeting repos that are hosted on GitHub currently and if you’re on enterprise server or another hosting platform just don’t work.
Both tools are currently tied to GitHub. Given that GitHub owns CodeQL it’s unclear if they’ll ever support using MRVA with repos not hosted on the platform. Because it requires building a database and running the queries against the database, it has a hard dependency on Advanced Security. It leverages GitHub for database storage and there isn’t currently a way to move that out to another platform. Even if you wanted to use it with a different source code hosting platform, you’d currently still have that dependency. In talking with Semgrep I’ve been told they’re interested in expanding beyond GitHub. Unclear what that timeline looks like but they definitely have less dependence and incentive to keep you on the platform.
This leads into the issue of compute. MRVA utilizes GitHub actions for its orchestration and compute. It has a hard dependency on it. If you’re an Enterprise Server customer I’m not sure if it’s even an option or if you have to be in GitHub Cloud. It definitely doesn’t allow self hosted runners to be used which means you’re stuck paying for GitHub hosted compute which can get very expensive very fast. There’s nothing public about this changing but I’m hopeful that it’s something that will eventually. With Semgrep the compute is completely opaque. When I spoke with them they said that it wasn’t on the roadmap but they were interested in chatting about it so unclear if this is something that will be addressed. Because it’s in beta it’s unclear how the billing and pricing is going to work for it. I think the major issue here for businesses that are used to operating on prem or just non-SaaS there isn’t really a go forward mechanism.
Campaign management is non-existent in both tools and I haven’t seen anything else yet that attempts to really solve this. CodeQL’s VS Code extension will save the results from previous runs letting you refer back but there really are no management, organization, or annotating features. You can export the data but you can’t import it and it doesn’t export to SARIF so it’s not particularly useful. The data is mostly stored locally which means that it can’t be easily shared between systems and you can’t work with other people easily. There’s no synchronization point or way to collaborate. With Semgrep there’s even less, however it’s still in beta and much earlier than MRVA. In talking with Semgrep, my understanding is that this is something that’s coming, though I don’t know when. I provided feedback a number of times to them and am hopeful we’ll see something eventually.
Outside of managing the orchestration and providing compute reviewing the results is probably the most mature of features that these tools have today. They both provide a good starting point and both provide enough tooling to be effective. This starts to fall apart when you need to manage a lot of results. When you want to iterate, there’s not a good way to diff results, diff changes to the rules, or keep track of where you left off. If you’re only looking at a small set, it’s not so bad but if you’re running a campaign over the course of weeks or months it becomes difficult to track progress.
Lastly, I think one of the larger issues is that these platforms for variant analysis are limited to a single static analysis tool. I think that CodeQL and Semgrep are both great products. They each have their own strengths and weaknesses and I think they complement each other very well. Depending on what I’m trying to do I may reach for one or the other. There’s other tools out there and it’s likely that we’ll continue to see additional ones developed and released. Orchestration is likely the hardest part to solve, outside of delivering an all in one SaaS product, and both of these tools do that albeit in very limited ways. It’s currently the case and probably going to remain so that the variant analysis tooling is tied to the company’s static analysis tooling which I think is unfortunate.
In my case, none of these variant analysis tools existed when I decided I wanted this capability. We were forced to build our own tooling which is still very immature, not very high quality, and very specific to our environment. That being said, it was enough to prove out the strategy and value. I think there’s definitely a desire for this tooling but I don’t think it will ever be the primary product or the differentiator for a venture scale company. The real differentiation and defensible product is in the static analysis tooling itself. Unless you’re tied to a specific static analysis tool you probably want to be able to swap out the static analysis portion. Sometimes grep might be enough or you want a bespoke tool. Or maybe something new enters the market you want to try. If so, you want to be able to use your existing variant analysis workflow in the same place as everything you’ve already done. I’m hopeful we’ll see some development in the open source space to fill these gaps. It’s something I’m interested in.
Finally, I want to give a huge shoutout to Blaine who built the majority of our implementation and helped bring this vision to life.