GPT-5 Is Here: Impressive Numbers, But Claude Remains My Co-Dev

AI10 min read

August 10, 2025 – 10 min read


Three days ago, OpenAI released GPT-5. After intensive testing with the new model, I'd like to share my impressions - as someone who develops daily with AI tools. Since April 2025, I've been using Claude as my primary co-developer, after previously working with OpenAI o3, GPT-4, and GPT-4.1.




The Technical Specifications


OpenAI presents impressive specifications:

  • **500,000 token context**
  • **74.9% on SWE-bench** (compared to Claude 4.1's 74.5%)
  • **45% fewer errors than GPT-4o**
  • **3 model sizes** (gpt-5, gpt-5-mini, gpt-5-nano)

  • Community reactions on X are mixed. Many users report rate limit restrictions, miss the personality of earlier models, and note that benchmark results don't necessarily reflect real-world practice. Some still prefer GPT-4o, while others rate the new model as very good for everyday tasks.




    My Test with AIpuna Code


    I tested GPT-5 with the same tasks I use for other models:


    The Task:

    Refactoring a complex Flutter widget with Dart, state management, and Stripe payment integration.


    GPT-5 delivers working code, that's undeniable. The quality is good for coding tasks. But in direct comparison, I miss what distinguishes Claude: Claude feels like a real senior co-developer who understands practical requirements in production environments.




    The Large Context: Theory vs. Practice


    500k Token Context - Actually What We Need


    Large context is actually very important for development tasks. The problem isn't the size, but the implementation:


  • GPT-5 loses context despite 500k tokens - especially in the middle
  • You have to work with sequential tasks, create memories, compress chats
  • Without precise instructions and rules, focus is lost
  • Response times vary greatly - likely due to overload

  • The Real Problem:

    High demand probably exceeds OpenAI's computing capacity. Benchmark performance is only achievable under optimal conditions - in reality, the system struggles with the load.


    Google Gemini 2.5 Pro as Alternative:

    Interestingly, Google Gemini 2.5 Pro offers one of the largest context windows on the market and is also very good for development tasks. In some situations, it delivers more stable results than GPT-5.




    Benchmarks and Reality


    GPT-5 shows strong benchmark results:

  • SWE-bench: 74.9%
  • HumanEval: 95.2%
  • MMLU: 91.8%

  • In practical application, I observe:

  • Generated code is mostly correct
  • Tendency toward more complex solution approaches
  • Occasionally suggests outdated or non-existent framework features
  • Consistency over longer sessions varies



  • Community Feedback on X


    Discussions on X show a nuanced picture:


    Frequently mentioned criticisms:

  • Rate limits and availability
  • Missing "personality" compared to earlier models
  • Discrepancy between benchmarks and practical experience
  • Many want GPT-4o back

  • Positive voices:

  • Good for everyday programming tasks
  • Strong performance on structured tasks
  • Good integration into existing workflows
  • Reliable results for standard tasks



  • The Comparison: GPT-5 vs Claude 4.1 vs Gemini 2.5 Pro


    After three days of intensive use:


    GPT-5 - Observations:

  • Very large context available (but loses it anyway)
  • Good for standard coding tasks
  • Broad knowledge spectrum
  • Struggles with high demand load

  • Claude 4.1 - My Experience:

  • Feels like a real senior co-developer
  • Better understands developer intentions
  • More pragmatic solution approaches
  • More consistent quality across sessions

  • Gemini 2.5 Pro - The Surprise:

  • One of the largest available context windows
  • Very good for development tasks
  • Often more stable performance than GPT-5
  • Interesting alternative for large projects



  • Payment Integration and Specifics


    Differences emerge with payment-specific requirements:


    Stripe Integration:

    GPT-5 knows Stripe well and can deliver solid implementations. However, Claude better understands the nuances of webhook handling, idempotency, and edge cases in payment flows.


    Flutter-specific Patterns:

    All three models master Flutter, but Claude often shows more pragmatic approaches to state management and widget architecture. Gemini 2.5 Pro surprises with detailed Flutter knowledge.




    Workarounds for GPT-5


    To work effectively with GPT-5, I've developed the following strategies:


    Sequential Tasks:

  • Break tasks into small, sequential steps
  • Set clear checkpoints between tasks
  • Regularly summarize context

  • Memory Management:

  • Create own memories for important project details
  • Regularly compress chat histories
  • Capture important information in instructions

  • Precise Rules:

  • Use very specific instructions
  • Define clear constraints
  • Specify output format exactly



  • My Personal Conclusion


    GPT-5 is a good model for coding - I want to be clear about that. It does solid work and is appreciated by many developers. The technical improvements are real.


    But for my specific requirements in AIpuna development, I'm sticking with Claude 4.1. Not because GPT-5 is bad, but because Claude feels like a real senior co-developer to me. After four months of intensive collaboration, Claude understands my working style and delivers pragmatic, production-ready solutions.


    My Toolchain:

  • **Claude 4.1**: Primary co-developer for complex tasks
  • **GPT-5**: For specific tasks when Claude is overloaded
  • **Gemini 2.5 Pro**: For projects requiring very large context



  • Recommendations


    Who might find GPT-5 interesting:

  • Teams already invested in OpenAI ecosystem
  • Projects with standard development tasks
  • When broad tool integration is important

  • When Claude 4.1 is the better choice:

  • When you want a real co-developer feeling
  • Complex architecture decisions
  • Agent-based development
  • Consistency over long sessions important

  • When Gemini 2.5 Pro might surprise:

  • Very large codebases
  • When stable large context is important
  • As backup when other models are overloaded



  • Outlook


    Development continues. GPT-5 will improve once OpenAI expands infrastructure. Gemini 2.5 Pro shows that Google is seriously competing. And Claude remains my reliable partner.


    For me personally, the current situation is ideal: We have several very capable AI assistants available. Depending on the task, I can choose the right tool. That's a luxury we didn't have a year ago.


    The most important insight: It's not about the "best" model, but about the right tool for each task. And for my daily work, that's Claude - but it's good to know there are alternatives.


    Patrik Germann

    30 years IT experience

    Pragmatic multi-model user

    Comments

    Ready for AI Transformation?

    Let's explore the possibilities of AI for your business together.

    Schedule Consultation
    Book consultation
    TOBG - DLT, Crypto, Mindset, Community