Learning CodeQL // Going Beyond Grep

Unlike many SAST products, CodeQL is more than just a tool and learning it requires learning more than just a tool. It’s a programming language, a tool, and a supporting ecosystem that come together to create something extremely powerful, flexible, and unique. Let’s start with the language. Query Language, or QL, is an object oriented logical programming language. If you don’t know exactly what that means, don’t worry, you will once you’ve dove into some of the content below.

One of the nice things about CodeQL is that there are so many resources and paths to learning it. This post is going to focus on aggregating that information together and providing some insight into what has been helpful to me so that you can effectively and quickly learn it without having to figure out a path yourself.

Depending on your learning style there’s a few different options. Do you need an instant dopamine hit? Skipping ahead to the workshops is probably for you. It’s not going to be as deep and you’ll eventually want to come back to the more traditional resources to learn more in depth, but they are going to get you going as fast as possible and let you see results almost immediately. If you’re good with the slow and steady approach and really want to learn the ins and outs as you go, stay the course and let’s start from the beginning.

Language Tutorials

CodeQL includes a number of language tutorials intended to teach you the QL language and get you familiar with its syntax and semantics. This is going to be critical down the line once you start building your own queries, libraries, and packs. You can find them here and go down the list one by one until you’ve completed them.

These tutorials are intended to introduce you to the language and teach you how to solve problems with QL. Presumably you have programming experience and many aspects of it will look familiar to you. However, if you’ve never worked with a logical programming language before, I’d definitely recommend spending some time going through these because there are some aspects that may be foreign such as quantification, how subclassing works, predicates vs. methods, and characteristic predicates, etc.

Something to call out is that these tutorials were originally written to leverage LGTM.com, a now defunct service that was used for running queries. When I went through these tutorials between December 2022 and the beginning of January 2023 they all still referenced LGTM and expected them to be run in that environment. My understanding is that they have since been updated with instructions to use either GitHub Codespaces, GitHub’s hosted development environment in the browser, or VS Code. The Codespace template should include everything needed to get started but if you’d like to run locally with VS Code, which I’d recommend because it’s likely what you’ll be doing once you really get going, you can download the template they provide here which will provide a working environment. Overall, I thought the tutorial experience was fairly good, once I was able to figure out how to get a working environment together, as the Codespace and template didn’t exist at the time, though maybe a bit slow. I do feel like I learned the language better than had I dove straight into the other resources we’ll cover below and definitely still reference materials covered as I’ve started to write my own rules. The one downside I think about this approach is that it really divorces you from how you will use CodeQL in a real environment and doesn’t teach you the necessary steps for getting it up and running with a code base or doing program analysis. There is documentation for these parts elsewhere, but it sort of feels a little unproductive while you’re getting the hang of the language in isolation.

Public Workshops

Additionally, there are a number of public workshops that GitHub has released that are intended to teach you the basics of CodeQL but are more focused on program analysis and vulnerability discovery. You can find the repository here which includes the workshops from various years of GitHub Universe, the GitHub conference. As of writing there is one for 2020, 2021, and 2022. Each one focuses on a different language (C/C++, Java, and Ruby) and a different vulnerability class. The README.md in each directory includes all necessary information and walks through the steps of the workshops. The workshops for 2020 and 2021 have video recordings available here and here respectively. Aside from the differences in the workshops themselves the general content of the videos is fairly equivalent so if you’d like to watch them you can really get away with just one. In addition, there is another workshop available here that covers finding a SQL injection vulnerability in a Java application.

In general, I think going through all the workshops is worthwhile. It really does teach you some of the approach of query writing and exposes you to a variety of languages that CodeQL supports for analysis which is good because the standard library differs between them. While the general approach is largely the same, the queries are likely to look different due to the way that languages are modeled, the implementation of the language in QL, and differences in the languages themselves. At times it definitely feels like you’re being dragged along not necessarily knowing how the author knew to take a specific step but I promise, once you’ve done a few and start to write your own queries it will feel a lot more natural and that feeling will go away. At least it did for me even a couple weeks in.

Staging Workshops

In addition to the public workshops that are available, this staging workshops repository also exists. The README.md describes it as a staging area for “good enough” content that has been used. It seems to be a bit more skill focused but has a little more language coverage as well. I haven’t dug through this too much because by the time I got through the tutorials and workshops I felt that I was competent enough to start diving into my own projects. It seems like it might be a good starting point for picking up a new language though.

CTFs

If you’re looking for some practice that’s a little more hands-on and a little less guided than the workshops there are a number of CTFs that the GitHub Security Lab runs. These are “competitions” where you’ll be given a target and be walked through, with less hand holding, the discovery of a vulnerability and how to write the query for it. These are also eligible for prizes if you happen to do them within the time that the CTF runs. I’m not sure what the cadence is like but there are a number from the past couple years. At the time of writing, there are not any active.

Blogs

This covers most of the introductory content designed to teach you the language and tooling. What the previous options don’t show you a ton of is real world scenarios and people using CodeQL outside of a tutorial experience. There’s two blogs, the original GitHub Security Lab blog, which is no longer updated and the GitHub blog with the “github-security-lab” tag, which is where all the new content is. These blog posts can be useful to see how others are using CodeQL to identify bugs in real products and libraries and give you insight on how to approach a code base.

GitHub Security Lab Slack

Lastly, the GitHub Security Lab has a Slack that’s worth joining. It’s a good place to learn about new features, talk with other people learning or using CodeQL, and ask questions if you run into any issues. It’s not the most active Slack out there but I’ve had pretty good experiences getting responses from people when I’ve had questions.

Where things could be better

As I said in the very first post introducing CodeQL it requires some setup process for running queries on a code base. The language is designed to run queries against a pre-generated database, not the code itself. This is definitely covered within the documentation once you start digging deeper and overall is not so complicated to figure out but there is a distinct lack of end to end instruction that shows you how to go from nothing to setting up a brand new code base.

I think part of this is probably done to make things easier and allow the person learning to focus on learning the language and tooling itself, rather than focusing on some of the aspects that may not seem as important, and partly to build a more integrated ecosystem (for better or worse). What this does is make you sort of dependent on databases and projects that already exist or going back on your own later and figuring this out.

In general, the availability of databases is sort of hand wavey and I think is sort of glossed over. This remains one of the real issues for people attempting to start looking for bugs in arbitrary projects. GitHub provides a significant amount of databases for all supported languages. This is nice and convenient but if you want to target a project that isn’t hosted by GitHub or isn’t one that is being built, you’re on your own. If you want to target a lot of code bases that aren’t already built, you need to figure out how to do this at scale, something that isn’t trivial.

My last complaint with the handling of databases is that as far as I can tell if a database happens to exist for a specific code base you’re interested in, there’s no way to pull a database for a specific commit, release, branch, or version. It’s convenient that you can just get something if it exists really easily but if you want to target a specific version you’re sort of on your own.