A year ago, we explored how generative AI might transform software engineering (see blog post: Generative AI in Software Engineering: Scenarios and Challenges Ahead). Since then, the focus has rapidly shifted from simple coding assistants to complex, AI-supported software engineering that touches nearly every phase of the Software Development Lifecycle (AI-orchestrated SDLC). However, this newfound ease of generation brings a critical new challenge: the emergence of a “knowledge landfill”. In this article, we reflect on the developments of the past twelve months. We analyze why the question is no longer just about technical capability but about reliable control and quality assurance. As we move toward more automated workflows, the central task becomes preventing our systems from turning into unmanageable piles of generated artifacts and ensuring that humans maintain substantive control over the complex systems AI helps us build.
From AI coding assistants to AI-orchestrated SDLC
Since our first blog post, the landscape has changed significantly. Modern AI development environments, which include both AI-IDEs and coding assistants (such as Cursor, OpenHands, Aider, Claude, Open Code, Kilo Code –the list seems to grow every minute), now play a central role in daily work. They do more than complete individual lines of code. They read and modify many files at once, call tests and linters, propose refactorings, and integrate with CI pipelines and external tools. They behave like agents embedded in the development environment.
At the same time, the Model Context Protocol (MCP) standardizes how models call tools and access data sources. Conventions such as Cursor Rules, AGENTS.md and SKILLS.md describe how an AI should work with a specific repository. They act as machine-readable guidelines for agents.
As a result, almost every phase of the SDLC has an AI entry point. Models can help draft requirements, recommend architectures, implement features, generate tests, modify deployment pipelines, and write documentation. The key question therefore shifts. It is no longer “Can AI do this task?”. It becomes, “How do we control these systems, keep their behavior coherent, and trust what they produce?”.
Automation vs. operability
Public demos show impressive AI workflows: an agent that generates requirements, writes code, runs tests, fixes bugs, and updates documentation. These examples prove that individual tasks are technically automatable.
However, organizations need more than this proof of concept. They require systems that behave reliably over time, are predictable under change, leave clear traces for auditing and debugging, and, last but not least, if they fail, fail in understandable ways. This is a question of operability, not only of capability.
AI-based workflows are complex. They combine prompts, tools, skills, configuration, and access rights into one system. A small change in a model version, a tool, or a prompt can lead to a large change in behavior. Failures are often challenging to reproduce. It can be difficult to explain why an agent chose a particular solution path or why it ignored certain information.
In addition, software engineering is not only the production of artifacts. It includes aligning stakeholders, deciding priorities, and managing risks and trade-offs. It must respect informal constraints, organizational politics, and external regulations. AI is strong at producing artifacts, but far weaker at handling conflicting interests and implicit constraints.
Therefore, even if AI can produce plausible outputs for every step of the SDLC, this does not mean that we can safely operate a fully automated SDLC in a real organization.
The main bottlenecks move from “Can the model do it?” to “Can we run this system in a controlled and transparent way?”.
Knowledge explosion: from too little to too much
Traditional software projects regularly suffer from missing or outdated documentation. Generative AI reverses this problem. It is now very easy to generate large volumes of requirements, specifications, architecture documents, test plans, runbooks, and user or developer documentation.
This offers obvious benefits. Drafts appear quickly. Updating documentation becomes cheap. The same underlying information can be rendered as different views for different roles. Teams that previously had almost no documentation can, in principle, obtain comprehensive descriptions of their systems.
However, this shift creates a difficulty: the risk of a “knowledge landfill”. When generation is cheap and fast, many artifacts are produced but not carefully reviewed. Some documents are simply wrong. Others become outdated soon after they are written. Documents may contradict each other. All AI-generated documents look polished, well-structured, and “professional”, which makes it easy to overestimate their quality.
Abstraction levels add further complexity. Different roles need different levels of detail. Product owners care about goals, users, and constraints. Architects need capabilities, interfaces, and non-functional requirements. Developers require APIs, edge cases, and concrete examples. Operations teams need runbooks, alerts, and failure modes. Models can summarize content on demand, but they do not enforce a stable and shared hierarchy of abstraction across an organization. Over time, high-level intent and low-level artifacts can drift apart unnoticed. Misalignments are sometimes only visible in production incidents or during audits.
Skill catalogs, such as SKILLS.md files, describe what agents can do in a given project. However, they also fragment practices. Different teams may define overlapping skills for similar tasks, each with slightly different assumptions and styles. Over time, this leads to many small, implicit “micro-methods” that are not fully aligned.
A more sustainable approach introduces constraints and structure. Important artifacts should be treated as code: stored in version control, changed through pull requests, and subject to review. Teams can adopt simple, explicit templates for requirements, architecture decisions, test plans, and runbooks and use AI to populate and update these templates rather than invent arbitrary new structures. They can use AI to check consistency across artifacts, for example, by ensuring that requirements have corresponding tests and documentation and that APIs match their descriptions. Finally, they can maintain a few canonical documents per topic and derive other views from them instead of generating many independent documents that quickly diverge.
Risk of a downward spiral of maintainability
We have not yet seen many large systems that were created and evolved primarily by agents in production. Yet, there exists an inherent maintainability risk. Models tend to overshoot: they generate more code than necessary, introduce additional abstractions “just in case”, and create helper functions and configurations that are only weakly justified by current requirements. Agents often optimize for local task completion and plausibility of the solution, not for long-term simplicity and maintainability.
Like any technical debt, if it is left unchecked, it can trigger a downward spiral of maintainability. Each agentic generation step builds on code that was itself produced by agents with incomplete context and limited understanding of the domain. Dead code, duplicated logic, and inconsistent patterns accumulate. The next agent invocation sees a larger, noisier code base and has to infer intent from noisy code. This increases the probability of further redundant additions. Over time, the system may become hard to understand not only for humans but also for agents, because much of the existing structure is an artifact of earlier, suboptimal generations rather than of clear design decisions.
Trust: automated quality feedback mechanisms are now more important than ever
For code, we already have strong feedback mechanisms. We can compile or interpret it, run unit and integration tests, apply static analysis and linters, use security scanners, and profile performance. These tools detect many classes of errors and provide concrete signals about correctness, style, security, and efficiency.
This allows a pragmatic pattern: AI writes or changes code, automated tools evaluate it, and humans review those areas where tools are weak, such as system design, tricky logic, and security-sensitive paths. Over time, teams can measure defect rates, incident patterns, and performance trends and can adjust how and where they use AI. Trust in AI-generated code becomes an empirical question, not a matter of belief.
For non-executable artifacts, the situation is more difficult. Requirements, user stories, product documents, roadmaps, architecture descriptions, test strategies, and risk analyses do not “fail to compile”. A system can pass all tests and still implement the wrong requirement. A requirements document can look clear and thorough and still omit important constraints or stakeholders. There is no compiler or test suite that directly validates the content.
“LLM-as-a-judge” is a common suggestion for this gap. One model generates an artifact. Another model scores its quality. This approach has structural limitations. The judging model often shares the same training biases and blind spots as the generating model. It tends to reward surface features such as structure, style, and familiar phrases. It needs access to real-world facts, local organizational constraints, or worst, it would need information that is currently missing to perform its evaluation. For many of these artifacts, there is no objective, external ground truth that a model could enforce.
Sycophancy makes things worse. Models are trained to be helpful and agreeable. They rarely insist that an input is underspecified or that a human must consult additional stakeholders. They typically state that a document is “comprehensive” or “well covered”, even if important aspects are missing. They express confidence even when their internal support for an answer is weak.
As a result, the question “Can we trust LLMs?” is not very helpful. A more precise question is: For which types of artifacts, at which criticality level, and with which feedback or review processes, can we accept AI-generated outputs?
Some patterns are promising. Multi-perspective AI critique asks different models, or different roles in a single model, to review an artifact from different perspectives, such as architect, tester, product owner, or security officer. Instead of giving a single quality score, each “role” lists assumptions, missing scenarios, and potential failure modes. This does not solve the problem of ground truth, but it can surface weaknesses more systematically.
Outcome-based calibration is also important. Organizations can observe defects, incidents, audit findings, and customer feedback. They can use these observations to adjust how much autonomy they grant to AI for different tasks and artifact types. In this view, trust is not a binary property of the model but a calibrated property of the entire socio-technical system.
Humans as bottlenecks – and new skills
As AI generates output faster, humans become the main bottleneck in their review and approval. This is expected, but it also creates risks. If the amount of AI-generated material is too large, reviewers may skim or rubber-stamp changes. Processes exist formally but lose their substantive effect. The perception of governance remains, but actual control over quality and risk weakens.
Responsibility also becomes blurrier. If AI writes the specification, the implementation, and the test cases, it is not obvious who “owns” a defect. Is it the developer who accepted the suggestions? The team that configured the AI tools? The vendor of the AI platform? Current legal and organizational frameworks assume that identifiable humans author key artifacts. AI-generated pipelines break this assumption and create gaps in accountability.
At the same time, there is a real risk of skill erosion/decay. If junior engineers rely heavily on AI for implementation, debugging, and explanation, they face fewer raw problems. They develop fewer deep mental models of systems. They learn to trust and interpret AI explanations, rather than building their own understanding from first principles. Over time, this can reduce the pool of people able to debug complex failures, design robust architectures, or reason under novel constraints, especially when AI assistance is unavailable or wrong.
AI also makes it cheap to add complexity. It becomes easy to create new services, workflows, and skills. System-level understanding does not grow as quickly. Many small, locally reasonable decisions can accumulate into a globally fragile system with hidden couplings and emergent behaviors that few people fully understand.
These developments suggest new core competencies for software organizations. AI systems architecture becomes even more important: designing how humans, agents, tools, and processes interact, and treating prompts, skill definitions, and tool integrations as part of the architecture. AI operations and governance also become central: managing prompts, skills, access rights, monitoring, and evaluation policies as first-class configuration items, and tracking changes to AI behavior over time. Finally, engineers need critical AI literacy: the ability to understand typical failure modes such as hallucination and sycophancy, to know when to trust and when to doubt AI outputs, and to ask questions that expose gaps and hidden assumptions.
Technological constraints
Agentic systems benefit from rapidly improving models, tools, and infrastructure, but they still operate within clear technical limits. These limits do not make agentic SDLC impossible, but they shape what works reliably today.
Context and compute
Larger context windows enable multi-file refactorings and richer “conversations with the codebase”. Yet attention still scales poorly, and reasoning workflows consume many more tokens. Models cannot simply “load the entire system and reason holistically”; they still require modular architectures, careful scoping, and good prompt and tool design.
Tool use and orchestration
Tool calling lets models run tests, static analyzers, CI jobs, and internal services. This moves AI from pure text generation to verifiable actions. But orchestration becomes a new source of failure: tools may be called in the wrong order, with wrong parameters, or misinterpreted. Deterministic tools (compilers, linters, analyzers) remain more efficient for many checks; agents add value by combining and interpreting them, not by replacing them.
Memory and retrieval
RAG and agent “memory” extend what a model can draw on beyond the current prompt and make longer-running tasks feasible. However, these memories are often local, opaque, and lossy, optimized for answering the next query rather than preserving a coherent, auditable knowledge base. They complement, but do not replace, curated documentation and decision records in version control.
Task structure
Agent systems shine on parallelizable tasks, such as generating tests for many functions or summarizing multiple documents. They struggle more with long, tightly sequential workflows where early mistakes propagate and compound. Reliable systems therefore insert tests, checkpoints, and human review between agent steps instead of delegating long chains of decisions to agents alone.
A pragmatic path forward
In the short term, it is sensible to treat AI as a powerful assistant rather than as the autonomous owner of the SDLC. Overall, today’s constraints reward designs that use agents where their strengths—contextual reasoning, tool orchestration, parallelization—matter most, and that lean on traditional engineering practices for structure, verification, and long-term knowledge management. All substantial AI-generated changes should go through version control. Tests, linters, and scanners should run automatically. Human review should remain mandatory for artifacts that encode goals, architectures, and risks.
In the medium term, organizations can invest in foundations for quality and evaluation. They can define simple internal standards for requirements, test plans, architecture documents, and runbooks. They can build or adopt tools that check consistency between requirements, code, tests, and documentation. They can experiment with multi-model critique and role-based AI review for non-executable artifacts and incorporate the results into their processes.
In the long term, the nature of software engineering will likely change. Many mechanical tasks will be largely automated. Human work will focus more on problem framing, stakeholder alignment, governance, system-level design, and risk management. Education and career paths will need to adapt. Engineers will still need to understand programming and systems, but their main value will increasingly lie in designing and governing complex AI-augmented socio-technical systems rather than in manually producing every artifact.
Engineering the Future of AI-Driven Development
Niels Bohr is credited with the remark: “Prediction is very difficult, especially if it is about the future.” In our first post, we discussed possible futures for generative AI in software engineering. Today, we already see many elements of those futures in practice.
AI development environments, MCP, and shared conventions now enable AI participation in every phase of the SDLC. The central challenges are no longer about raw capability. They concern knowledge management, evaluation of non-executable artifacts, responsibility, and human control. We can, in principle, build an “automated SDLC”. The hard questions are what we should automate, how we validate AI-generated artifacts, and how we ensure that humans remain in meaningful control of the systems that AI helps us build.
These questions are technical, organizational, educational, and ethical at the same time. How we answer them will shape what “software engineering” means in an era where generative models can produce almost every artifact, but humans remain responsible for the outcomes.
