Introduction to Semgrep

The goal of this post is to give an introduction into how to create rules using Semgrep as well as getting a broad overview of what the tool is. The structure here will be to first explore the tool, then go into examples on how to use it with the User Interface(UI) to understand the logic, then finally go into how to actually code the rules on a more complex level.

What is Semgrep
Semgrep Basics
Introductory Examples
Coding Rules in Semgrep
Running Semgrep in the Commandline
Advanced Semgrep

What is Semgrep

“Semgrep is a customizable, lightweight, static analysis tool for finding bugs/enforcing code standards designed for security consultants and hackers”

It is useful because:

It is open source (free)
Supports many languages (go, python, java, javascript, php, and more)
There are about 1000 predefined rules you can use “out of the box”
Does NOT require buildable source code
There is no Domain Specific Language (DSL) to learn, you can make custom patterns to match the code you are targeting
Easy to use for any developer, does not require expertise

Assumptions Made by the Developers

Speed is important, scans should be completed in minutes, not hours/days
It’s better to have false negatives than false positives
Ease of use is important
- Prioritize first time users (as they are the average user)
- Should be accessible to developers, not just security professionals
Enforcing security standards is MORE important than finding every bug

Design Decisions

Focuses on single file/localized analysis
- Better for speed and enforcing security protocols
Has rules that look like source code
- Makes it easier for first time users

Other Features

Compatible with Regex
Confirming proper access given to files
- Map out application routes
Check for weak RSA keys
Check for calls on runtime objects
Match JSON
“Generic” mode to analyze code in languages not directly supported or any text
Autofix/suggest option to correct common mistakes
Software as a Service (SaaS) option when used on large scale (not open source)
- Can interact with GitHub, Jira, Slack, and more

The Basics

There are two operators that add functionality to Semgrep on top of whatever programming language you choose to use:

Ellipsis ... - the ellipsis operator is a filler for an unknown space with zero or more arguments regardless of what they are
Metavariables $VARIABLE - these are like capture groups, a fill in for a unknown variable you want to match prior to its a assignment of a value

To get a good introduction on these two operators, check out this site for a tutorial into Semgrep

Basic Examples

To test some of the functionalities of Semgrep, visit this site to play with the Semgrep Playground and follow along with the demos below:

Taking a look at the image below we can see the Semgrep Playground User Interface. The basic layout will have the top left dropdown allowing you to choose your language, the Rule Bar to enter your rule, the Test Code block to see the code that the rule will be tested on and any results from the query will show up on the right hand side Semgrep Playground Layout

Identifying a Function Call

Click Here to follow along with this example:

exec(...) will find all instances of the function exec() in your code. But it is smart, as you can see in the image below, the query will IGNORE comments and string literals while also IDENTIFYING the function even with varying syntax and aliases of the function call like safe_function

Cross Site Scripting (XSS)

Following this example, we can see the code is vulnerable to a XSS attack as the user inputted query is not being filtered for any potentially malicious input.

The benefit of Semgrep is we can search for whatever we want in out chosen language. So to start making a rule, we can just copy the source code and then begin to abstract it
Let’s replace /foo with ..., as we want to search for anytime a general string is used regardless of its value
We can add an $ in front of an all caps name to represent a Metavariable. In this case, we can replace name in req.query.name with $PARAM to represent an unknown parameter
- Similarly we can replace the variable resp with $VAR to represent an unknown variable being passed
This brings us to the rule of:
```
app.get('...', function (req, res) {
    var $VAR = req.query.$PARAM;

    res.write($VAR);
});
```
Which we can see, correctly identifies one of our XSS vulnerabilities, but not both

XSS Rule V1

We can see that the main differences between the two are the extra lines of code, and the extra arguments in functions. To take care of that, we can add some strategic ellipsis and arrows:
- What these arrow brackets are doing is saying that we know our Metavariable exists somewhere in the expression, but we don’t know exactly where it is

app.get('...', function (req, res) {
    var $VAR = req.query.$PARAM;
    ...
    res.write(<... $VAR ...>);
});

XSS Rule V2

Business Logic

Business Logic is when the code itself seems fine, but how the code works is fundamentally wrong (such as the order in which methods are called). Lets take a look at this example

Sometimes we have a need for conditionals in our rules, for example if we want to check for the order of, or presence of certain items. To start off, lets copy the first method (lines 9-13)
We don’t care about the return type, so lets replace void with $RETTYPE as our Metavariable
The function type and arguments don’t matter either so we can replace base_ok with $FUNC and Transaction t with ...
The main thing we are looking at here is the make_transaction() function, so we can add ellipsis above and below that statement as well as replacing the function’s argument leaving us with this:
```
public $RETTYPE $FUNC(...) {
...
    make_transaction(...);
...
}
```
If we run that, we get this result which locates everything that use the make_transaction() function
This is good but doesn’t actually check if they business logic is right, we need to add a conditional by clicking the plus button to the right of the rule block
We can see a dropdown of options, here we want and is not, and then copy the top rule into the bottom box
Above the make_transaction() line in our new rule section, lets add an ellipsis followed by verify_transaction(...);. What this does is make sure the verify function is never called prior to the make transaction function in any place the make transaction function exists, as we only want the transaction to be verified if it first has been made
```
public $RETTYPE $FUNC(...) {
...
    verify_transaction(...);
...
    make_transaction(...);
...
}
```
When we run, we can see that it works and returns every time the verify transaction function is called before the make transaction function, with one exception
In the very last method of the sample code, we see that it should be flagged as a vulnerability as the transaction that is verified is not the same one that is made
To fix this, we can replace the function ellipses in the and is not rule with a Metavariable $T

public $RETTYPE $FUNC(...) {
  ...
    verify_transaction($T);
  ...
    make_transaction($T);
  ...
}

And when we run this version we see it is able to catch all of the exceptions: Business Logic V3

Coding in YAML

What we were looking at before is the Semgrep UI, but as we make more complex rules it is easier to do so using Semgrep’s natural syntax language, YAML You can read up on the specifics of YAML here, but the main outline of a Semgrep command is as follows:

rules:
    - id: name-of-rule
        patterns:
        - pattern: $PATTERN_A
        message: |
            Write your message here
        languages: [languageName]
        severity: SEVERITYLEVEL

To break this down:

id - the name of the rule, this should be alphanumeric with dashes replacing spaces
patterns - this is where your specific rules are, more information below but make sure to have proper syntax
message - this is the message that will be displayed when something is flagged. Often it is meant to help developers by providing context to what needs to be changed
- Note the pipeline (|) in YAML is used to have multiple lines be compiled into single lines
language - a place where you can put the language that this rule applies to
severity - what kind of issue this code is, it can be either INFO, WARNING, or ERROR

Coded Patterns

There are five main pattern commands that can be used in combination with one another. They are very similar to boolean expressions. Here is a brief description of each one followed by an example:

pattern = search for “this” AND “that”

rules:
    - id: default-pattern-example
        patterns:
        - pattern: print(...)

This rule will search for any instance of the print() function being called, regardless of its arguments

pattern-either = search for “this” OR “that”

rules:
    - id: either-pattern-example
        patterns:
        - pattern-either:
            - pattern: if(...)...
            - pattern: print(...)

This rule will return any instances of an if() statement call OR a print() statement call

pattern-not = filters out patterns you do not want to match, ie. matches “this” AND NOT “that”

rules:
    - id: not-pattern-example
        patterns:
        - pattern: if (...)...
        - pattern-not: if ("..." == $VAR)...

This rule finds all instances of an if statement, but will filter out any instance where the first value is a string literal being compared to some other value

pattern-inside = allows you to search for patterns inside the pattern specified by pattern-inside. This means this command must be FIRST, and the following patterns will only search based on the set received from this command call

rules:
    - id: inside-pattern-example
        patterns:
        - pattern-inside: |
            func $FUNC(..., $VAR, ...) {
                ...
            }
        - pattern: print($VAR)

This rule will first narrow down its search to only functions that have at least one argument, and in those search for an instance of a print() statement that uses the same function argument

pattern-not-inside = filters out any matches inside the range defined by the pattern. This means this command can only come AFTER a previously defined pattern

rules:
    - id: not-inside-pattern-example
        patterns:
        - pattern: print(...)
        - pattern-not-inside: |
            if (...) ...
            ...

This rule will find all instances of print(), but then narrow down those results to only print() calls that occur BEFORE an if() statement

Running Locally

Installing

Run brew install semgrep to install semgrep locally on a Mac (you must have Homebrew already installed)
- For more details on installation on both Macs and other devices, look here

Environment

As Semgrep is now installed locally, you will be able to use semgrep as a command. Make sure that before running the command, your YAML rule file is in your current directory, which should be in the same directory that contains the directory/file you want to test. The general format is as follows:

semgrep --config ruleFile.yml --timeout 0 /directoryToTest

--config ruleFile.yml - this tells Semgrep which YAML rule file you want to use in your test
--timeout 0 - while this flag is technically optional, what it is doing is giving the time in seconds for how long Semgrep will wait for a single rule to run before it times out. When set to 0, it will never timeout, without the flag, the default is set to 30 seconds
/directoryToTest - this is the directory (or file) that you wish to run your rule(s) against

Shared Rulesets

While we know we have the ability to create our own rules, Semgrep has over 1000 pre-existing rules that have been vetted by the company for precision. Accessing these rulesets is pretty simple:

Find the ruleset you want to use on the Semgrep explore page
- Each ruleset will list all of it’s rules along with examples of how they work if you wish to explore them more in depth
Once you’ve selected your ruleset, in the top right corner there will be a section titled “Test and Run Locally” with a command you can copy as seen below
With this command copied, we can add our timeout 0 /directoryToTest to the end and the ruleset will be run against the selected directory
- Example: semgrep --config "p/r2c-security-audit" --timeout 0 /directoryToTest - this will run the r2c-security-audit ruleset on whatever directory you indicate

Advanced

While the instructions above will be able to guide you through ~80% of what you’ll need to know for Semgrep, there are several other useful rule syntaxes that are good to know. I will provide a basic description of each of them followed by an example showing how to use it. For a more in-depth explanation of each rule, check out the documentation

Regex Patterns

The five patterns we looked at before were all based on bitwise operators of the code. The following patterns are all based on bitwise operators of the regular expressions that can be found in the code, allowing for more specificity. While there are many different types of Regex syntax, Semgrep uses Perl Compatible Regular Expressions(PCRE):

pattern-regex = searches for a compatible regular expression (works best for Python)

rules:
    - id: regex-pattern-example
        patterns:
        - pattern-inside: print(...)
        - pattern-regex: \d

This rule will find all instances of print(), but then narrow down those results to only print() calls that include a decimal number in them

pattern-not-regex = will filter out compatible regular expressions (works best for Python)

rules:
    - id: not-regex-pattern-example
        patterns:
        - pattern-inside: print(...)
        - pattern-not-regex: \d

This rule will find all instances of print(), but then narrow down those results to only print() calls that DO NOT include a decimal number in them

Metavariable Rules

metavariable-regex = this works similar to pattern-regex, but instead of matching any instance of the regex in the code, this will only match regex patterns found in a given Metavariable

rules:
    - id: metavariable-regex-example
        patterns:
        - pattern: print($VAR)
        - metavariable-regex:
            metavariable: $VAR
            regex: (hello|hi|welcome)