Learning Semgrep // Going Beyond Grep

edit: my understanding of the tweet about deep/pro was incorrect, it allows interprocedural and interfile analysis

I’d like to say that right off the bat, this is likely going to be significantly shorter than the post about learning CodeQL. There are just significantly less choices. However, I don’t think that’s necessarily a bad thing. To start, I personally think that Semgrep is easier to learn. I think this is one of its major strengths and one of the reasons so many companies are choosing to adopt it over other tools. Unlike CodeQL where you need to learn a whole language that is likely a new paradigm from anything you’ve worked with in the past, some tooling to really get going, and how to work within the ecosystem you really have none of this with Semgrep.

Semgrep Tutorials

Let’s talk about resources. As of right now, there really is only one that I’m aware of, the Semgrep Tutorials. These are a series of walkthroughs to teach the fundamentals of how to write rules. What is a rule? Like I mentioned, there’s not really a language you need to learn. Patterns are the most basic component. These are more or less snippets of code within the language the rule is for with some additional syntax that Semgrep introduces to simplify and support matching. These can be combined with different combinators and logical operations to build complex rules that can match more than a single pattern. This is all done in a YAML document which includes some additional metadata.

Documentation

R2C provides the full documentation for Semgrep. This is more of a reference and something you can use to look up how a specific part works rather than a learning tool. One interesting area that is definitely worth checking out is the experiments. Shout out to d0nutptr for pointing this out to me. This section talks about experimental features that are available but not considered GA. These are sort of a your mileage may vary situation and some of these features may or may not work in all languages or when they do may have bugs. It’s a good way to see what’s happening and changing and start to leverage new features early. If you’re running into situations where you’re hitting the limits of Semgrep, keeping an eye here may help get you past those sticking points.

Playground

R2C provides the Playground where you can write rules and test them without having to set up a local environment. You can type example code directly into the page and iterate on your rules in real time. I personally haven’t used this a ton because most of the times I’ve used Semgrep it’s been targeting a real code base while I was doing exploration or automation of a specific pattern. Where this is probably a really good option is if you need to test out a new feature you haven’t used before, have a really minified example you need to build a rule for, or want to demo something.

Semgrep Slack

R2C runs a slack server for Semgrep. You can join by going here and signing up. There’s relatively low traffic and I haven’t attempted to get help here personally, though that’s more due to not running into situations where I’ve needed it. Overall the documentation has been such that I’ve been able to help myself and not needed to go to others. The types of conversations I’ve seen have been a little more focused on tech support and integrating into CI builds rather than discussing new rule development, vulnerability research, or strategies of it. That’s not something that seems like it would be unwelcome but more due to the lack of people in there focused on it. Something I am curious about is if there is a community where this is happening. There’s definitely some of it in the GitHub Security Lab slack though having somewhere a little more vendor neutral would be cool.

Where things could improve

Bugs in tutorials

While the tutorials were generally pretty approachable and pleasant to work through I did run into a couple cases where they wouldn’t accept my answers as correct despite matching what was intended. It wasn’t always clear if I was doing something wrong or if the tutorial was broken. This led to spending a bunch of time trying out slight variations and eventually just moving on. I filed an issue for one of them which has since been fixed and the others all seem to work now with my solutions.

Lack of depth of tutorials

Overall while the tutorials were really good for getting up and going they definitely left a bit to be desired. They don’t cover any of the data flow/taint tracking support that Semgrep offers which, if you’re attempting to identify security vulnerabilities, is likely going to be a common feature set. This is all covered within the documentation linked and relatively straightforward to use but it would have been nice to have had some more depth and coverage in the more guided introductions.

Access to deep/pro

Semgrep comes in two flavors “normal” and pro (formerly deep?). The standard open source community version performs only intraprocedural analysis meaning that it will do analysis only within a single function. R2C introduced pro which adds interprocedural ~~(intrafile) meaning that it can analyze across functions within a single file~~ and interfile AND interprocedural and intrafile meaning that it can analyze across functions across files or across functions within a single file.

Intraprocedural only, our proprietary offering would be intrafile interprocedural
— Lewis Ardern (@LewisArdern) January 31, 2023

This is a paid only offering which is included within the Teams (I believe) and above plan. I’m unclear if this is out of beta yet and generally available. It’s something that I spoke to a lot of people about until I could get access. I had great conversations with R2C but something that I kept hearing was they wanted to have a partner to test/evaluate with rather than just turn it on and let someone run wild with. I wasn’t really set up to do this which led to a lot of back and forth. Eventually that seemed to have been relaxed and I was able to check it out; having the opportunity to just sort of explore with it.

I don’t want to get too deep (pun intended) here talking about it, as I’m not sure if this is a generally available product yet but it is something that’s available and exciting. It wasn’t the most convenient to get access and there were a number of issues where I lost access after upgrading versions, though that seems to be resolved now. The people I’ve spoken to have also been extremely responsive and helpful whenever I’ve run into issues which, if it’s not a rock solid mature product yet, is at least the best you can hope for.

Closing

Overall, getting up and running with Semgrep was a really positive experience and I was able to go from zero to finding bugs in a real target almost no time at all. I definitely had to re-read some sections in the documentation to understand some of the rule syntax for some of the more complex directives and features but it is fairly well written and Semgrep is not that complex. While there are definitely not a ton of options available for learning, I don’t feel like it was overly hard or time intensive to fill in the areas that the tutorials didn’t cover, especially as someone who was dedicated and had a goal in mind. A big shoutout to everyone at R2C who has reached out at various points to offer suggestions, request feedback, and discuss the product/use cases. It’s been a great example of how to interact with your users and an increadibly positive experience.