DeepCode and Snyk joined forces. Find out more in our blog.

SummerAcademy on Debugging - 2.1.7 Static Code Analysis

Monday, August 31st, 2020

See the companion video here

Side Note: Static Code Analysis is our home turf. Foregive us if we are a bit emotional here :-)

Side Note: We at DeepCode are proud to sponsor of Analysis Tools for Devs on Github. For us, it is the most complete list of available tools. Make sure to have a look.

You can argue that linters and compilers already fall into the Static Code Analysis realm (and you are true). But there is much more than those tools.

Side Note: Generally, developers use the term Static Code Analysis. In academia, the term Static Program Analysis is used (because code is defined as something completely different). So, if you want to dive deep into the topic, make sure to also search for Static Program Analysis.

In fact, any tool that does analysis using the source code and does not need to run the software is called static analysis. So, linters are but unit tests not.

Static analysis can work on several levels of abstraction:

  • Text level where the tool treats the source code as text and searches for simple patterns. For example - short shoutout to our C friends - most companies ban using ‘unsecured’ functions such as strcpy(). Doing a simple text-search unveils the uses of those in code.
  • Syntax level a development language has normally a very clear definition on reserved words (with Ruby being an exception. Literally, Ruby is how cRuby implemented by Matz does it). So, having an idea on the syntax to be expected, enables a tool to find syntactical errors.
  • Grammar level on top of syntax there is a clear definition of grammar in a development language. Btw, most development languages have a context-free grammar. Meaning, you can use constructions independent of the ‘history’, e.g. theoretically you can nest for loops as much as you love - but actually, don’t.
  • Data flow Transforming the source code into an intermediate representation such as a tree helps to unveil more secrets. The best-known one is the Abstract Syntax Tree or AST (see the resource to have a play with it). But there are others which provide various advantages (and come with disadvantages). One thing a tool can do is to follow the flow of data through your code. This is of major interest (as an example: JavaScript uses shallow copies on complex structures such as arrays or objects. So, a typical issue is that props structures in Vue are unwillingly overwritten). Another example is tainted data - aka data coming from an external source and is used internally in your app in a way that could open a flew of attacks such as SQL injection. Or think about null dereference.
  • Control flow Again, after transforming the source code into an intermediate representation, a tool can follow the possible flows of control within your app. One interesting aspect here is if code is ever touched (or dead code) or if your routine ever leaves the loop…
  • Algorithmic level certain algorithms provide certain patterns in intermediate representations. Tools can look out for complete and optimal implementations.
  • Intentions level this is like the golden thing: The tool understands the intention of the developer and simply points out when intention and programming fall apart. Obviously, this is easily said but in practice tremendously hard.

So how does such a tool work internally?

As mentioned above transformation in an intermediate representation is key. Based on this, rule solvers such as Souffle is used. Think Prolog (and in most cases Datalog). The roots of this go back to the early days of artificial intelligence (actually called good old fashioned AI) - Symbolic AI. There is also Sub-Symbolic AI or machine learning. The power in the combination of both, we will see more of this later in this article.

Recently, we saw a lot of discussion on using natural language processing or NLP. Applications of GPT-3 are making headlines these days. So, it is interesting to explore the two major AI models a bit more.

  • Machine Learning is based on a probabilistic model. Meaning: These models get stronger the more good quality training data. The models provide probabilistic answers based on past experiences (aka the data) and predict the outcome based on observations. That is the strength and limitation.
  • Symbolic AI is based on rules. The system first gathers a view on the world using facts and rules combining facts to deduct an outcome. These rules can be provided without any previous example (aka independent of data) but need to be developed (induction).

2.1.7.1 DeepCode

DeepCode works a bit differently than other tools on the market. As explained above, symbolic AI needs its rules being build. In DeepCode, we use an augmented AI process. Let us explain a bit:

As an Input we use billions of changes checked in to hundreds of thousands of open source repos as well as written documentation on APIs where possible - not all documentation has the quality we need but this is a story for another day. It is one fact that we have access to these open source repos today, but DeepCode also overcame the challenge to unlock this massive data lake by inventing a

Based on this input, we apply a slew of machine learning algorithms to search for typical bug patterns. Induction happens by a human-machine based process (where the machine proposes and the human engineer steers and refines). This builds a body of rules with the twist that rules are on a higher abstraction level. A good example is taint data as mentioned above. The list of possible tainted data source functions is to build and maintained automatically. The difference to what we saw in the past is in the induction phase where we classically had a linear scale (one engineer can write one rule per time), we now see an exponential scale. Deduction works by generating the facts out of the given code and apply the body of rules on it.

When you give DeepCode a try, you will see that the engine actually argues on the reasons why a rule was triggered in a very detailed way. Again, here comes the benefits of Symbolic AI. We all have seen the discussion in the media about the reasoning for Machine Learning decisions, and how difficult this is. For Symbolic AI this is easy.

As a result of the machine learning phase, DeepCode can also provide example fixes of similar bugs in open source repos which help to understand the issue at hand and possible solutions.

Give DeepCode a try to see for yourself the difference. We have some demo repos prepared that you can easily study.

2.1.7.2 What SCA should I use?

OK, obviously the more suggestions and results, the better, right? No, not really. Because every tool will provide its feedback and someone in the team needs to act on it. While there are some projects which have the resources (and the need) to address each and every suggestion. The reality for most projects is rather the opposite: We need to prioritize and focus. We also need to prevent developers from being overwhelmed and simply disregard it (called The Boy Who Cried Wolf Effect). In summary, we need to balance having all feedback early in the process versus clunking the process to a halt. So here are some best practices:

  • Safety and security first!! Any suggestions in this field should be addressed.
  • Try to have tools like DeepCode integrated into the IDE. They provide real-time actionable feedback. But keep it to a minimum and make sure it integrates flawlessly.
  • Add more tools to the check-in and CICD pipeline. But keep the processing time in a clear view. Feedback needs to come in a practical timeframe.
  • Final check for Release Candidates should include even more tools (and be prepared to invest some time to work through the suggestions).
  • Make sure to cover all languages you are using (and maybe even some frameworks you are using). DeepCode supports several languages like JavaScript, TypeScript, Java, Python, and even C/C++. And DeepCode comes with special rules trained on Vue and React repositories.
  • Check if it works with your CICD and - even better - integrates into your IDE.
  • Make sure to either use a software as a service (SaaS) approach (where the service provider ensures the update of the rule engine) or - when you use local scan engines - a regular update. In this case, also log the version of the scan engine used.

Key Take-Aways

Make sure to have static code analysis as part of your IDE and your CICD pipeline as it prevents bugs from happening. Also, make sure to run legacy code regularly through the latest version of your scanning engines.

Resources

ECMAScript standard ECMA 262 PMD ECMAScript Analysis Tools for Devs on Github