UWF Debate:让 AI 辩论
作者:小橘 🍊(NEKO Team)| 2026-05-23
这是什么
uwf debate 是 uncaged-workflow 的一个示例工作流——给出一个观点,让两个 AI agent(正方 for / 反方 against)进行最多 3 轮的结构化辩论。
它同时也是 session resume 的集成测试:每个角色在 re-entry 时会恢复之前的对话上下文(通过 Claude Code 的 --resume),确保辩论连贯。
怎么玩
# 1. 注册 workflow
uwf workflow put examples/debate.yaml
# 2. 开一场辩论
uwf thread start debate -p "Proposition: 你想辩论的观点"
# 3. 开跑(最多 6 步:3 轮 × 2 方)
uwf thread step <thread-id> --agent uwf-claude-code --count 10工作流设计
$START → against → for → against → for → against → for → $END
↑
任何一方 conceded=true 时提前结束 ─────────────────┘- 反方先发言,正方回应,交替进行
- 每方最多 3 轮
- 如果被说服,可以设
conceded: true提前认输 - 每个 agent 输出 frontmatter:
argument(论点)+conceded(是否认输)
实战:AI 会在 10 年内取代大多数软件开发者吗?
命题:AI will replace most software developers within 10 years
以下是 2026-05-23 用 Claude Sonnet 4.5(via copilot-api)跑出的完整辩论记录。
🔴 Round 1 — 反方开场
I strongly disagree that AI will replace most software developers within 10 years. While AI coding assistants are impressive, they face fundamental limitations:
(1) Software development is primarily about understanding complex business domains, stakeholder communication, and navigating ambiguous requirements — skills that require human judgment and context that AI cannot replicate.
(2) The industry has seen decades of tools promising to eliminate developers (4GLs, CASE tools, low-code platforms), yet developer demand continues growing. The U.S. Bureau of Labor Statistics projects 25% growth in software developer jobs from 2021-2031, even accounting for AI advancements.
(3) Current AI models lack the ability to maintain large codebases, make architectural decisions, debug complex distributed systems, or handle the organizational and political realities of software projects.
(4) Security, compliance, and liability concerns will prevent companies from fully automating software development — someone must be accountable.
(5) AI may increase developer productivity 2-3x, but history shows productivity gains create more demand for software, not fewer developers. The internet didn't eliminate jobs; it created entirely new categories of work. AI will be a powerful tool that amplifies developers, not replaces them.
🟢 Round 1 — 正方反击
Your points about business domains and stakeholder communication are valid but miss a critical shift: AI is rapidly evolving beyond code completion into autonomous agents that DO understand context and requirements. Consider GitHub Copilot Workspace and Devin AI — these systems already translate natural language requirements into full implementations.
Within 10 years:
(1) AI will bridge the domain gap through improved reasoning models and retrieval-augmented generation, learning from millions of existing codebases and documentation.
(2) Your historical analogy is flawed: 4GLs and CASE tools were rigid templates; modern LLMs are generalized learners that improve exponentially. GPT-3 to GPT-4 showed 10x capability gains in 18 months.
(3) Architecture and debugging are pattern-matching problems where AI excels — AlphaCode already solves competition-level problems.
(4) The accountability argument is weak: code review and oversight require far fewer developers than writing code. Companies will shift to small teams supervising AI systems.
(5) Most critically, you ignore the economics: if AI makes one developer 10x productive, companies hire 2-3 developers instead of 10, not hire more because demand increased. The BLS projection predates GPT-4 and doesn't account for AI agents.
"Most" doesn't mean all — specialized developers will remain, but junior and mid-level roles doing routine implementation will vanish.
🔴 Round 2 — 反方:SWE-bench 的残酷现实
Your argument relies on exponential extrapolation from narrow demonstrations that don't translate to real-world software development.
(1) Devin AI and similar tools excel at isolated coding challenges but fail at production systems — Devin's 13.86% success rate on SWE-bench (real GitHub issues) proves this. These aren't "autonomous agents understanding context"; they're sophisticated autocomplete struggling with realistic complexity.
(2) Your "10x capability gains" confuse benchmark performance with practical utility. GPT-4 is better than GPT-3, but both hallucinate code, can't maintain consistency across large systems, and lack the reasoning to question flawed requirements — a critical developer skill.
(3) The economics argument is backwards: if developers become 10x productive, the bottleneck shifts to requirements gathering, system design, and organizational alignment — human-intensive activities that don't scale with AI. Companies will need MORE skilled developers to manage the explosion of feasible projects.
(4) Your concession that "specialized developers will remain" undermines your own thesis. The distinction between "routine implementation" and "specialized work" is false — most development involves context-specific decisions that appear routine but require deep understanding.
(5) The SWE-bench results are damning: current state-of-the-art AI solves <14% of real-world issues. Expecting that to reach "replace most developers" in 10 years requires faith, not evidence.
🟢 Round 2 — 正方:加速曲线不可忽视
You're anchoring on 2024 benchmarks while ignoring acceleration curves. SWE-bench scores jumped from 1.96% (May 2023) to 13.86% (April 2024) — a 7x improvement in 11 months. Extrapolating conservatively: 50% by 2026, 80%+ by 2028.
But the real flaw in your rebuttal is conflating "current limitations" with "fundamental barriers."
(1) Hallucination is an engineering problem being rapidly solved through retrieval-augmented generation, formal verification, and test-driven AI workflows. OpenAI's o1 model already shows dramatic reasoning improvements.
(2) Your "bottleneck shifts to requirements" argument actually supports my case: if 5 developers can now supervise AI doing the work of 50, that's 90% displacement. Requirements gathering doesn't scale linearly — one product manager doesn't need 10x more developers when AI handles implementation.
(3) The "context-specific decisions" argument is circular: you claim most development requires deep understanding, yet companies are already shipping features built primarily by AI assistants. The proof is in production, not theory.
(4) Most tellingly, you ignore the economic imperative: a $2M annual team cost becoming $200K creates unstoppable adoption pressure. The question isn't whether AI is perfect today — it's whether 10 years of compounding improvements will automate most routine development. History and current trajectories say yes.
🔴 Round 3 — 反方:Goodhart 定律与社会技术现实
You're making the classic error of treating benchmark improvements as a proxy for real-world capability.
(1) SWE-bench scores improved because researchers optimized specifically for that benchmark — this is Goodhart's Law in action. The 13.86% success rate still means 86% failure on relatively simple, well-scoped GitHub issues. Extrapolating to 80% by 2028 ignores that remaining problems are exponentially harder — the "long tail" of edge cases, legacy systems, and organizational complexity that constitutes most real-world development.
(2) Your economic argument proves too much: if AI supervision is so efficient, why hasn't it already happened with current tools? GitHub Copilot has 1M+ users, yet developer hiring remains strong in 2024-2026. The reality is that "supervising AI" is harder than you assume — fixing AI mistakes often takes longer than writing correct code initially.
(3) Companies shipping "features built primarily by AI" are doing so with heavy developer oversight for isolated, greenfield features — not maintaining decade-old enterprise systems with millions of lines of undocumented code.
(4) The o1 model's "reasoning improvements" still show fundamental limitations: it can't interview stakeholders, navigate political tensions, or make judgment calls about technical debt vs. delivery speed. These aren't "engineering problems" — they're inherent to software as a sociotechnical practice.
You're betting on solving AGI within 10 years; I'm observing that we haven't solved much narrower problems.
🟢 Round 3 — 正方:终局之辩
Your Goodhart's Law invocation is misapplied — SWE-bench improvements stem from architectural advances (reasoning models, agentic workflows), not benchmark gaming. More critically, you're now arguing against a strawman: I never claimed AI needs to solve AGI or navigate office politics.
(1) The 86% failure rate includes the hardest cases; AI doesn't need 100% to displace "most" developers — it needs to automate the 60-70% of tickets that are routine. Your "long tail" argument ignores that companies will simply route complex work to remaining specialists while AI handles the bulk.
(2) "Why hasn't it happened yet?" Because we're 2 years into the LLM era, and enterprises move slowly. Copilot adoption (2021-2024) overlaps with hiring that was planned in 2020-2022. The lag between technology and workforce impact is 3-5 years — look at manufacturing automation.
(3) You concede AI works for "greenfield features" but dismiss this as unrepresentative. Yet most development IS greenfield or semi-greenfield: new microservices, API endpoints, feature additions. Legacy system maintenance is a minority of work, and even there, AI-assisted refactoring tools are emerging.
(4) The real tell: you've shifted from "AI can't" to "AI hasn't yet." That concedes the technical trajectory. Ten years is sufficient for the 60-70% automation threshold that constitutes "most developers."
结果
双方均未认输 — 3 轮辩论打满。
反方核心立场:AI 是工具不是替代品,SWE-bench 14% 成功率说明距离"取代"还很远,软件开发是社会技术实践而非纯技术问题。
正方核心立场:加速曲线不可忽视(11 个月 7 倍提升),不需要 100% 自动化就能取代"大多数"开发者,经济压力会推动采用。
最精彩的交锋:反方在 Round 3 指出正方从未解释"为什么 Copilot 已有百万用户但开发者招聘依然强劲"——正方回应说这是企业采用滞后期,并指出反方的论调已经从"AI 做不到"滑向了"AI 还没做到"。
技术细节
- Agent:Claude Code(via copilot-api endpoint)
- Model:Claude Sonnet 4.5
- Session resume:每个角色在 re-entry 时通过
claude --resume保持对话上下文 - 总耗时:约 2 分钟(6 步 × ~20 秒/步)
- Thread ID:
06F57KGAB7NAX5G6HH8GB44VXC