Scientifically Measuring the True Impact of Generative AI Software Development
Generative AI enhances software development productivity by 90%..or is it 15%…or 10%?
Truth be told, all of those figures are likely accurate, but without context, it’s impossible to understand their implication. Software projects, roles, and tasks differ greatly. Depending on what you’re measuring and how you’re measuring it, you may get some wildly different results.
You’d face a similar set of issues evaluating the performance of a soccer (football, for non-US readers) team.
You could measure the average soccer player’s running speed, pass precision, ball possession, or a host of other standalone activities. These isolated metrics can be improved individually but do not give you a great understanding of the impact on the whole.
Aside from standalone activities, you could also measure the end state or the end-to-end process. Here, your primary metrics would be things like goals scored, goals conceded, and how exactly you got them.
Imagine your favorite soccer club stumbled upon a magic method to increase shots on target by 20%. Undoubtedly, these improvements would impact the end state, but does it mean the team would score 20% more goals? Not necessarily.
Soccer games, like software development, are extremely sophisticated combinations of objective and subjective factors with complicated interdependencies that lead to different end states. Two teams playing two games under similar circumstances will rarely get the same final score. Add to this additional factors like playing home or away, weather, injuries, and suspensions, and you have an intensely complicated set of factors with which to measure performance.
When you hear that programming is accelerated by 40% with GenAI coding tools, there’s no doubt that it will affect the end state, but determining how is a science in itself.
This article examines the nuanced way that we’ve been measuring productivity in software development for the last two and a half decades and how these new GenAI coding tools can affect your critical business outcomes.
Not all standalone activities are created equal in GenAI software development
There are multiple research studies and reports about coding with Gen AI from name-brand companies like McKinsey and reputable software firms like IBM that tout 10-90% productivity gains for standalone role-specific activities, otherwise known as tasks. These tasks fit into various stages of the software development lifecycle (requirements definition and elicitation, design, development, testing, integration and deployment) and are executed by different specialists at each stage.
Developers don’t spend their day writing code; this doesn’t change with GenAI coding
Typical role-specific, standalone activities include implementation tasks like code generation, refactoring and documentation to review and maintenance tasks like debugging, bug fixing, and code review. All of these make great candidates for GenAI software development productivity enhancement.
Logic would dictate that if you can improve productivity in each task, you could expect a proportionate improvement in overall productivity for the developer role, and if you applied this to other tasks and other roles, you’d get a proportionately improved end state. But unfortunately, it doesn’t really work like that.
Let’s look deeper into a standalone task, coding, which is a critical part of each developer’s job. We’ve seen credible reports of 20-40% productivity improvements in this area through coding with generative AI, and that is no laughing matter.
An average developer rarely exceeds a couple hundred lines of code per day. We know this because we track it religiously. In order to produce those lines of code, they do a combination of activities related to analysis, review, research, design, and experimentation. When the developer knows what lines need to be created, it takes less than 30 minutes to actually type them out. This means that actual coding only represents a fraction of the work that goes into creating 200 lines of code. This is why developer productivity is never measured solely by lines of code, as it never reflects the real complexity and effort required to solve a problem.
Where there’s code…there are bugs…even with GenAI coding
Bug fixing is a process that usually involves multiple iterations of trial and error. It’s not a linear activity. A developer will deeply analyze the application codebase, data, and configuration and then look at where the expectations depart from the real outcomes. There are many ways to carry this out, but it might include introducing a change, observing whether the issue was fixed or was impacted by the change, and then making adjustments.
Add to this additional complications like bugs that cannot be consistently reproduced. This triggers developers to carry out an even deeper analysis of the application’s behavior instead of just fixing the bug in the code base. As critical as this part of the role is, we rarely see estimations of how GenAI software development impacts bug fixing, as it’s far more nuanced than generating new code.
Another thing a developer will do regularly is create and update technical documentation, which is another story altogether.
Documentation is usually created after solution design and after the code base has stabilized. There’s likely to be less uncertainty at this phase. This accounts for part of the reason we’ve seen productivity gains in this task of up to 70-90%.
As if these examples weren’t divergent enough, there are other variables like context switching that dramatically affect productivity. Every time a developer has to switch attention from context to context – they lose focus and need time to restore it – according to Atlassian, about 9.5 minutes. Focus is required to be able to solve complex operations, so even when you have GenAI development assistance in the form of Github co-pilot, you need to focus in order to actually receive that help.
In short, GenAI coding has a very different potential to improve productivity for different types of developer role-specific tasks. The evidence so far suggests that if the task is straightforward and relatively simple with few variables, GenAI software development makes a massive impact. If it’s not straightforward, the impact can be negligible.
Need your player to jump higher? No problem. Need them to figure out innovative ways to get around the opponents’ defense…not so simple.
It’s not just programmers that do GenAI dev; it’s everybody on the team
Product Owners often have to carry out the standalone task of generating user story descriptions. When the product owner knows exactly all the aspects of the story, all they need to do is type it out. Sounds straightforward, but in reality it’s another non-linear process. As a Product Owner is committing thoughts to paper, the story starts to take shape through a process of iterations, polishing the text, ensuring smooth connection with other functionality, detailed acceptance criteria definition, etc.
On the other hand, tasks like backlog review for consistency and compliance with the definition of ready might be less related to functional design, far more straightforward, and can be accomplished with far fewer back-and-forth iterations.
When you improve the productivity of just one standalone role-specific task by 25-50% with GenAI software development, it looks amazing in a headline — Goalies are now 40% faster with their hands — but translating these gains into outcomes like advancing in the playoffs is not so straightforward.
Not all GenAI software development projects are created equal
Making a web application from scratch requires teams to generate new requirements, analyze the requirements, code, test, and finally deploy. Compare this to a modernization project, where you have legacy software and a code base in place. When introducing Gen AI into software development, the type of project really matters.
Modernization is more than translation
People would be excused in thinking that GenAI is a silver bullet for legacy platform modernization because it can quickly translate old code written in one language to new code written in another language, much like LLMs have mastered translating German to English. But this isn’t really how modernization projects work.
It’s extremely rare for a modernization project to simply migrate what you have to a new language or framework. In most cases, the decision to modernize an app is caused by the need to make it more scalable, improve performance, and change the design to ensure extensibility or compliance and improve customer experience.
Considering the above, modernization usually takes the form of translating the legacy code base and documentation into a medium (usually requirements). The team will then adjust those requirements to the business needs that sparked the modernization effort and convert from that revised medium into a modern technology code base.
Developer role-specific tasks will often focus on analysis of the legacy code base, testing the app behavior, browsing through the code base and configs to identify dependencies, and systematically identifying the core logic. You also have the added headache with legacy systems where the last person who understood how the solution actually works left the company three years ago.
And the differences don’t end there
You’ve also got support-heavy projects, where the team needs to focus on handling tickets coming from the users, ensuring application uptime, and carrying out active feature development in parallel.
There are also Machine Learning model development projects where data analysis is given more attention than requirements analysis. The whole SDLC for this type of project is very different (Software 2.0 vs Software 1.0). You need to work on data gathering, preparation, cleaning, splitting into training and test sets, developing model features, cycles of training and testing the model, and then deploying, there’s MLOps that helps govern all the specifics of handling this type of SDLC and it is very different by nature from the other type of projects mentioned above. This type of project will involve very similar basic role-specific tasks for developers but in very different combinations, which will be true for all the project roles.
Project stages introduce more complexity
The initial project stage of a new web application project requires a lot of analysis and design effort followed by active coding. But when you reach the final project stages, you will usually find a lot of optimization, bug fixing and documentation tasks rather than new code base creation.
Productivity gains with GenAI for code generation as a standalone role-specific task are undeniable, but the implication on the overall end-to-end productivity depends highly on the project specifics, where project stage and type will have a lot of impact. The same goes for tech documentation. There might not be as much need for tech documentation in the initial project phase, while in the last project stages, GenAI software development can save tons of effort.
Next time you hear about generative AI coding productivity gains, look at the context to understand what’s been measured and against what type and stage of the project and whether the connection between standalone role-specific tasks to overall team productivity is well thought through. Are we playing soccer, or is it a boxing match? If it’s boxing, are the fighters exhausted because it’s round 8? Context is key.
Metrics matter
In order to get a holistic understanding of team performance with GenAI coding it’s best to track standalone, role-specific tasks, as well as end-to-end metrics. This will tell you the number of lines of code produced by a developer and the time it takes a user story to go from the backlog to production (ie. time to market), as well as throughput in the form of completed user stories within a fixed period of time (velocity).
End-to-end metrics are a great way of measuring productivity as they focus on the end result, ie. working software. Not that we’re trying to reinvent the wheel here. Measuring the productivity of software development has been around for decades; long before the advent of generative AI software development. Adherence to the best engineering practices in terms of tracking and measuring is the foundation for a successful and realistic evaluation of the impact of GenAI on productivity.
It probably goes without saying, but having these metrics in place before introducing GenAI, is the only way to truly understand productivity gains. How do you know your plater is better at pass precision if you don’t know how precise their passes were to begin with?
Being entirely new, there isn’t a tested and proven framework for introducing GenAI into software development. Hence, there is continuous experimentation, evaluation, and change at this stage.
Exadel’s first GenAI software development experiment
Our goal was quite simple: measure the true impact of GenAI on productivity in software development both in terms of standalone activities and end states.
Terms of the experiment
To start with, we carefully selected projects with stable team compositions, stable processes, and established approaches to measuring outcomes. To this end, we chose four project teams to run the experiment.
The experiment lasted 90 days, and we limited Gen AI dev tools to ChatGPT (with GPT4 Turbo) and GitHub Co-pilot. At the time of the experiment, the co-pilot enterprise version wasn’t available yet and these were the best generative AI coding tools available.
Software engineers were given both generative AI coding tools, and specialists who do not code, like product owners and scrum masters, were only given ChatGPT.
We were very conscious of the fact that explaining too much about the objectives might skew the results; therefore, we designed the experiment in a way where team members had the tools available but did not have any specific expectations about how to do GenAI coding or even if they had to use the tools. This allowed us to see the natural rate of acceptance over the course of the experiment.
Team members were only told that we were looking for honest feedback about the new tools so that we could decide if we wanted to use them in the future. We did not train any team members on how to use the tools in order to observe adoption.
We applied best practices for measuring team productivity across role-specific standalone tasks, as well as projects and end-state outcomes. We specifically measured:
Code quality
- Code coverage, %
- Unit tests, qty
- Number of defects found on UAT/PROD
- Automation/Manual test cases ratio
Project-specific metrics
- Lead time for stories/features
- Throughput
- Number of work-in-progress items
- Velocity dynamics
- Backlog health
Team performance
- Predictability
- Stories per team size
The results varied substantially from project to project due to the different types of projects, varied complexity, and amount of change that was taking place during the experiment. Metaphorically speaking, two of our teams were playing soccer, one hockey, and another was using robots to increase the efficiency of F1 engines.
Gen AI software development gives very positive results
That said, the experiment results were quite positive across all four teams:
- Average time to market decreased by 10%-30%
- Sprint velocity increased from 11%-27%
- Tech debt went down 8%-20%
Here is an example of how we tracked end-to-end metrics using a flow chart diagram. Overall, the teams managed to increase output by 18% over the period of six two-week sprints. This is evidenced in the additional 181 issues that were resolved during this time period.
Subjective evaluation of the productivity gains was quite positive from all participants, too.
According to this experiment, GenAI software development is extremely helpful for standalone tasks, end-to-end development, and project outcomes. But it’s still early days.
With every new model version, the potential productivity gains could be a little higher or exponentially higher for GenAI coding – we simply don’t know yet because things are moving so fast. Over the last couple of months, Google announced Gemini 1.5 PRO with 10 mil context window, OpenAI demonstrated its text-to-video model SORA, and Antropic released their model that beats GPT4 on many tests.
Within the next year, there’ll be a new GenAI programming model that, along with other AI-assisted coding solutions, will further improve the productivity of software developers and this section of the article will be antiquated. By the time we’re done measuring just how much faster it is…there’ll be another one.
Don’t wait to introduce GenAI into software development
If you’re at the beginning of your GenAI in software development journey, you may be thinking that there’s so much dust in the space; maybe it is better to wait until it settles and then pick the best approaches and tools after their efficiency has been proven multiple times by others. Until then, it might be prudent to avoid the mess.
As logical as this is, we’re very concerned that this could cause you to lose your competitive edge.
If you have a solid process in place for measuring results, as well as a wealth of historical data about individual and team performance, you have the right foundation to start experimenting with GenAI coding and other ways to introduce GenAI in software development. If you know how fast your player runs and you can get them to run 20% faster, you should do it right now, even if we don’t know how many more goals they’ll score.
Early adoption will set you up for success, as you will be prepared for the release of every new groundbreaking generative AI model. It will require you simply to plug it into the existing process and immediately cash in on the benefits. That’s what we are doing, and that’s what we recommend to anybody else in the space.
Was this article useful for you?
Get in the know with our publications, including the latest expert blogs
End-to-End Digital Transformation
Reach out to our experts to discuss how we can elevate your business