See the companion video here
Side Note: Static Code Analysis is our home turf. Foregive us if we are a bit emotional here :-)
Side Note: We at DeepCode are proud to sponsor of Analysis Tools for Devs on Github. For us, it is the most complete list of available tools. Make sure to have a look.
You can argue that linters and compilers already fall into the Static Code Analysis realm (and you are true). But there is much more than those tools.
Side Note: Generally, developers use the term Static Code Analysis. In academia, the term Static Program Analysis is used (because code is defined as something completely different). So, if you want to dive deep into the topic, make sure to also search for Static Program Analysis.
In fact, any tool that does analysis using the source code and does not need to run the software is called static analysis. So, linters are but unit tests not.
Static analysis can work on several levels of abstraction:
strcpy(). Doing a simple text-search unveils the uses of those in code.
propsstructures in Vue are unwillingly overwritten). Another example is tainted data - aka data coming from an external source and is used internally in your app in a way that could open a flew of attacks such as SQL injection. Or think about null dereference.
So how does such a tool work internally?
As mentioned above transformation in an intermediate representation is key. Based on this, rule solvers such as Souffle is used. Think Prolog (and in most cases Datalog). The roots of this go back to the early days of artificial intelligence (actually called good old fashioned AI) - Symbolic AI. There is also Sub-Symbolic AI or machine learning. The power in the combination of both, we will see more of this later in this article.
Recently, we saw a lot of discussion on using natural language processing or NLP. Applications of GPT-3 are making headlines these days. So, it is interesting to explore the two major AI models a bit more.
DeepCode works a bit differently than other tools on the market. As explained above, symbolic AI needs its rules being build. In DeepCode, we use an augmented AI process. Let us explain a bit:
As an Input we use billions of changes checked in to hundreds of thousands of open source repos as well as written documentation on APIs where possible - not all documentation has the quality we need but this is a story for another day. It is one fact that we have access to these open source repos today, but DeepCode also overcame the challenge to unlock this massive data lake by inventing a
Based on this input, we apply a slew of machine learning algorithms to search for typical bug patterns. Induction happens by a human-machine based process (where the machine proposes and the human engineer steers and refines). This builds a body of rules with the twist that rules are on a higher abstraction level. A good example is taint data as mentioned above. The list of possible tainted data source functions is to build and maintained automatically. The difference to what we saw in the past is in the induction phase where we classically had a linear scale (one engineer can write one rule per time), we now see an exponential scale. Deduction works by generating the facts out of the given code and apply the body of rules on it.
When you give DeepCode a try, you will see that the engine actually argues on the reasons why a rule was triggered in a very detailed way. Again, here comes the benefits of Symbolic AI. We all have seen the discussion in the media about the reasoning for Machine Learning decisions, and how difficult this is. For Symbolic AI this is easy.
As a result of the machine learning phase, DeepCode can also provide example fixes of similar bugs in open source repos which help to understand the issue at hand and possible solutions.
Give DeepCode a try to see for yourself the difference. We have some demo repos prepared that you can easily study.
OK, obviously the more suggestions and results, the better, right? No, not really. Because every tool will provide its feedback and someone in the team needs to act on it. While there are some projects which have the resources (and the need) to address each and every suggestion. The reality for most projects is rather the opposite: We need to prioritize and focus. We also need to prevent developers from being overwhelmed and simply disregard it (called The Boy Who Cried Wolf Effect). In summary, we need to balance having all feedback early in the process versus clunking the process to a halt. So here are some best practices:
Make sure to have static code analysis as part of your IDE and your CICD pipeline as it prevents bugs from happening. Also, make sure to run legacy code regularly through the latest version of your scanning engines.