How we automated code evaluation for clean code

Krishnan Nair
5 min readDec 10, 2019
Codu high level architecture

“What? Automating for clean code?” followed by a look of skepticism is what happens when we talk about building Codu.

Well, the best way to convince folks, is to show, rather than tell. So go ahead, apply to get access if you’re interested to see if it works. Checkout my previous blog on what Codu is here.

But here’s how we did it

The first step is to define what clean code is, and while it can get complex, we tried to keep it simple — code that others can understand and modify, is clean code.

This definition has been with us for a while, and this is how we have been manually evaluating code for the last 4 years. Over time, we’ve broken down code review into multiple parameters for review, with each parameter having a score, and a shared understanding of what the score means.

This discipline has allowed us to have uniform data across the 1 million lines of code we’ve looked at over the years.

But of course, clean code is a largely subjective thing, and the levels of clean code will vary. That’s why it was hard to automate it. So we came up with this approach — trust the people who’ve been manually doing it for 4 years, and build out a machine learning model that rates code like they do.

So using our product will be like having senior techies in the team evaluate code for you.

Identifying features for our ML model

Overall, clean code and readable code are very closely related, and hence we picked Readability as the first parameter to predict.

The next step was to articulate how we define code as readable, and then identify these as features. So what came up were things like does the code have big functions, how well-named are the variables & methods, what is the code complexity (CCN) etc. Overall, after racking our brains, we came up with a combination of 20+ features.

Creating data for these features

Creating data for these features across the 6 languages we support was the next hurdle. Since Geektrust supports multiple languages, we wanted to support the ones that are commonly used. The idea was to parse the code and find meaningful chunks of it, which led us to Abstract Syntax Trees. We parsed all the code, created ASTs and then looked for data in them. From them we were able to identify some of the data like parameter names, method names, method size, classes etc.

Sample AST representation

Implementing this was hard, as ASTs are typically used on a single file but in our case, each code submission usually had multiple files in them. We also had to detect the language of submission, before parsing it. So we got all the source files (removing build, test and config files from the submission) and then built ASTs for each of them, manufactured data from each of these ASTs, pinged another (NLP) service to check if these variables & method were named well, and created one row of relational data which had 20+ columns. So we now have one row of data for one submission. The same task was repeated across all the code we have got till date.

Then came the hardest part. We had to clean up the data manually to remove duplicates, invalid numbers etc. Extremely laborious but crucial to our plans.

Testing our models

The next step was to identify which features made the most sense for us. We started with a manual approach of trying to logically figure this out. Very soon we dropped this, and moved to a combination of manual and automated process, which would run predictions overnight and get us results to look at by morning.

Now, how do we know which features would give us the best results? We needed a test suite or an equivalent to know which features were giving us the best results. In addition we had to identify the classification algorithm and the sampling mechanism. So we looked at these values to test for which prediction strategy was working well — Accuracy, Precision, Recall, Specificity, F1 Score, MCC, and Area Under Curve (a strategy was a combination of features, classification algorithm, and sampling mechanism).

Sample test result from our early days of testing

So whichever model would give the best result across these parameters was the one we would go with. As ML noobs, everything in this was new to us and the most unintuitive part was understanding precision vs. recall, and you can’t optimize for both.

Low precision meant high false positives (non-readable code being marked as readable), and low recall meant high false negatives (readable code being marked as non-readable). So while we didn’t want bad code to be marked as good, we absolutely didn’t want good code to be marked bad. So we optimized for high recall in our prediction model.

Testing in the real world

When we finally identified a model that worked best for us, we pulled around 100 code submissions from companies like ThoughtWorks and Gojek that were available on Github and ran our prediction against them to test how accurate we were. We manually looked at these codes as well so that we could compare.

Since we optimized for high recall (to make sure we don’t mark good code as bad), code which was borderline readable would be marked as readable. We were able to hit 95% accuracy with these codes. In the future, we plan to give users the option to choose how high they want the readability bar to be.

What we’re working on now, and the future

We’re trying out a couple of changes in our model for Readability. In addition we’re looking at more parameters around clean code — OOPS, maintainability etc. There’s a lot of trial and error involved, and we’re probably 6 weeks away from being ready with a new parameter.

A full-fledged product that allows you to manage all your code submissions, have Codu run tests on these submission, send invites to candidates to take your tests etc. is WIP. We currently have a playground ready. You can get access to check it out here.

We already have a generic engine to check for correctness of output. It’s used internally at Geektrust today, and will be made available on Codu soon. A generic plagiarism check is also on the cards.

Needless to say, I am looking for interested companies who want to use Codu and give us feedback as we build this out.

Want to know more? Connect with us — codu@geektrust.in

--

--