A Deeper Look at Modern SAST Tools
Now that we have some resources and know a bit about CodeQL and Semgrep, let’s dive in and take a deeper look at them both. While both tools are good examples of modern static analysis tools they are quite different. As I said in the first blog post, I don’t think one is necessarily right or wrong in their approaches, I think they are complementary. I want to get deeper into why I feel that’s the case and talk about some of the times you may want to choose one over the other. I’m going to leave a comparison of writing CodeQL queries to Semgrep rules to another post because there is so much to cover there and I just don’t think I could cover it here while doing it justice.
Use Cases
I’ll start by saying that for both CodeQL and Semgrep there is the use case that I care about, augmenting my vulnerability research workflow to aid in identification of security vulnerabilities in a given code base, and the use case that both GitHub and R2C are primarily trying to sell you, leveraging these tools as a major component of your shift left strategy. They are targeting replacing your existing SAST or becoming your first implementation of it for companies that haven’t rolled one out yet. Their goal is integration into your CI pipeline to run either on an interval or as part of your code review process and identify issues before they make it into production. This is clear from their licensing models which we’ll talk about shortly.
There are several categories of issues that either can be leveraged for:
- Security vulnerabilities
- Non-security bugs
- Style enforcement/linting
Both tools come with a library of queries and rules that implement detections for examples from these with the ability to write your own. I haven’t spent much time evaluating the library that comes with each to really speak to the quality or comprehensiveness because in most of my uses, I’ve written custom rules to detect very specific conditions fairly tailored to the code base itself rather than a general rule. This was mostly used to perform variant analysis to identify other locations where a specific vulnerability pattern existed or to perform data flow analysis while eliminating cases where a specific step was taken which would eliminate the vulnerability (eg. sanitizing input for a SQL query). My assumption is that the generic queries and rules in these libraries are probably fairly limited except for the most egregious issues in order to avoid false positives and apply to the widest possible set of code bases. The included libraries are something that I’d like to take a deeper look at in the future but from skimming through them, they don’t seem like they would be particularly helpful on their own for my use case.
Licensing
Licensing is actually a pretty big deal depending on what you want to do and what features you need. CodeQL licensing is fairly simple. You can use all features of the tool absolutely free so long as you are using it with open source code. The second you want to use it with closed source code (with the exception of a few narrow carve outs) you need a commercial license. This is something that is included with GitHub Advanced Security (GHAS), one of their commercial products and and add-on to GitHub Enterprise (SaaS or server) where you pay per user of the organization. If you’re not an existing GitHub customer or trying to use it outside of the GitHub ecosystem, good luck.
They really don’t want to sell you a license and unless you’re prepared to put up some serious money (hundreds of thousands of dollars) they don’t really even want to talk to you even when you tell them you have budget approved and are trying to throw money at them. Their business model is mostly centered around selling GHAS and charging businesses per committing user to every repo that is being scanned rather than providing commercial access to vulnerability researchers. As someone who leads an offensive security team that is jumping in and out of repos on a short term regular basis and not having engineers using the tooling itself, it’s just not a practical or realistic option for us. The pricing doesn’t align with our use case and just isn’t realistic for most teams/businesses using it in this way.
After a lot of back and forth, some cold emailing, and harassing the right people, I was able to secure some licenses for non-open source work but this is by far one of the most painful stumbling blocks to using CodeQL legally in a vulnerability research capacity. Depending on your use cases, this may or may not matter. Unless you are on an internal vulnerability research team or doing open box assessments as a consulting security services firm there doesn’t seem to be a ton of situations where you’d have access to the closed source source code that you want to use CodeQL with.
Semgrep licensing is both simpler and more complex depending on what features you care about. R2C provides an open source version of Semgrep which is available for use with both open and closed source software. This OSS version is limited to intrafile intraprocedural analysis. They also are selling a new Pro version which includes both intrafile with interprocedural analysis and iterfile with interprocedural analysis. This is available at the “team” tier and above which again is billed per dev per month. They also provide a “Is there a special pricing for security consultants / early stage startups?” FAQ on the pricing page which is realistically the path that most researchers are going to go for Pro. Unclear what licensing looks like with this option or how hard it’s going to be to get it if you want Pro. I’ve primarily been using OSS for my use cases thus far, and it’s done what I’ve needed for the narrow scope I’ve used. Aside from the interfile and interprocedural support, the other main difference between OSS and Pro is the addition of some Pro only language support. As of writing, the only one mentioned is Apex, a language used by Salesforce for customization.
Tooling
I’ve touched on tooling a little bit already in previous posts specifically talking about the CodeQL VS Code extension and the Semgrep playground. I’d like to dive a bit more into both of these and a few other tools that exist within the ecosystems. Given that Microsoft owns GitHub and GitHub owns CodeQL, it’s fairly obvious that they would choose VS Code to build on. I’m pretty vocal about disliking VS Code. I think the toaster popups are super irritating, a lot of the extensions have issues, and overall integrations are very hit and miss. That being said, the CodeQL extension is wonderful to work with. It truly makes the experience significantly better. Between code completion, quick queries, management of databases, integration into GitHub, AST visualization, query history, and interaction with query results, you really couldn’t ask for a better way to work with CodeQL. This is especially true if you’re using VS Code for your code exploration and auditing of a code base, it really lets you stay all in place and have everything very integrated. It’s nice if you’re iterating on a query because it allows you to go step by step, especially as you start to refactor such as turning freestanding queries into classes or start to use more advanced features such as data flow or taint tracking.
There’s definitely room for improvement and I’ve given them some direct feedback on it. As someone who prefers JetBrains products it would be nice to see an official extension in their ecosystem but I’m not holding my breath for it. But that’s really about it on the CodeQL side. GitHub itself has a number of integrations and brings it in for “Code Scanning” which is their SAST offering that leverages CodeQL but generally speaking, that’s probably not going to be particularly interesting for vulnerability research.
On the Semgrep side, the story isn’t quite as good. There is a VS Code extension as well though it calls out specifically that it’s not actively maintained. I haven’t personally used it because of that. It’s open source but hasn’t been updated since July of 2022.
In terms of interactively working with findings, the best option I’ve found is leveraging the SARIF output that Semgrep can generate. If you’ve never worked with SARIF it’s an open standard file format that uses JSON and is designed to be an interchange format between static analysis tools. VS Code has an extension that adds support for SARIF files. It can load them and bring up an interactive pane that allows you to dig through the findings and take you to the locations in code that have been matched. It even works well with taint tracking based findings and will specifically annotate the code where different taint steps take place. It’s a little finicky getting the files loaded but overall works well enough. Unfortunately, I haven’t found a good extension for the JetBrains ecosystem so again, stuck in VS Code. If that’s what you’re used to, you should feel at home.
R2C also provides the Semgrep app. It’s really meant to integrate, manage findings, and handle management of the rules. Overall I haven’t felt it was particularly useful for my use cases. I tend to write my rules locally and manage running/distributing them myself. If you were farming out across a cluster, I could see it being helpful but overall my workflow has held up pretty well.
Language Support
Depending on the languages you work with, this may be one of the determining factors on which tool(s) you can use. Both tools cover a wide range of common languages. You can see the support matrix for CodeQL here and the support matrix for Semgrep here. They both do a pretty good job of hitting a lot of the common languages though CodeQL specifically does not support PHP and C and C++ support in Semgrep is experimental. The other thing to note is that while support for various languages are listed, all support is not the same.
If you look through the issue tracker in both you’ll see numerous issues where specific languages aren’t behaving as intended. In my limited experience thus far, I’ve had less issues with CodeQL than I have with Semgrep in terms of bugs in language support. I can’t speak to objective quality or correctness but I can provide some examples of issues that I’ve run into. Before getting there, I think it makes sense to briefly look at how the two work as that will provide a bit of context.
My understanding of how CodeQL implements support for a language is that they build an extractor for a specific language. This typically leverages the compiler/interpreter for the language to model the code and create the necessary relationships that are stored in a database. Then a standard library needs to be built that models the language syntax and semantics which can be used to write queries. This standard library is written in CodeQL itself. In most cases the standard library is going to differ at least somewhat from language to language which means knowing one well won’t necessarily translate to another which can be frustrating but in my experience, the documentation has been pretty good and it hasn’t been too hard to go from one language to another. Most of the high level concepts are fairly similar.
My understanding is that the extractors themselves are open source and theoretically you could build your own though as someone who has done a fair bit of parser development and analysis work, this doesn’t seem super practical for a large language for an individual and I probably wouldn’t do it given the licensing issues discussed above. Ultimately what this means is that in some/most cases, the real parsers are used for parsing the code under analysis so if the language implementation thinks it’s valid code, CodeQL should too.
Semgrep on the other hand leverages Tree-sitter, a common parser generator library used in a lot of projects. With Tree-sitter and Semgrep OSS being open source, theoretically it may be possible to add support for a new language as well but it’s unclear how much work this would be and whether it would work with the Pro version of Semgrep. From GitHub issues it seems like they leverage the out of the box grammars for parsing languages which is sort of a double edged sword. It means they get lots of language support for free with minimal work but they are also limited to what exists, unless they want to write their own, and limited by the quality of the available grammars, unless they want to fix them. What this seems to result in is being able to offer support for more languages faster but those languages are not always the most correct.
An example of this is an issue I found in Semgrep’s PHP handling in which it treats function calls as case sensitive despite PHP not enforcing case sensitivity for functions. This results in false negative detections within Semgrep. In general, a lot of the pretty general stuff seems to work well across languages, with the exception of the occasional bug, but as you get into some of the more esoteric and lesser known capabilities within languages you start to see them handled incorrectly.
Automated Fixes
I wanted to call this out and something interesting that Semgrep provides which may or may not be something particularly useful to the vulnerability research use case but probably could be interesting to people leveraging these tools as part of a shift left strategy. Semgrep provides the ability to define automatic fixes within a rule. When a pattern matches the fix can be applied to the source. Theoretically this could be used to automatically submit Pull Requests and not require the engineers themselves to write the fix. It doesn’t seem super flexible and probably is not the ideal way to handle disclosure, if that’s something you want to do, but it is an option.
If it’s something that you want to leverage, the usefulness is likely going to be very rule dependent (eg. can you programmatically define how to fix the code in a correct way). I don’t know how often there is going to be a clear path to this and there are likely to be negative second order effects. For instance:
- Does is it break the downstream software (eg. api breakage)?
- Does it change serialization/data formats and corrupt/break persistent storage?
These are serious issues that I would have concerns about depending on this for all but probably the most simple issues until I had some real experience to back up its use with.
Wrapping Up
I hope this was helpful in outlining some of the similarities and differences between CodeQL and Semgrep and really showing where their strengths and weaknesses lie. In the future we’ll dive into the queries and rules and do an in depth comparison of both.