GPT-5 Is Here: Impressive Numbers, But Claude Remains My Co-Dev

August 10, 2025 – 10 min read

Three days ago, OpenAI released GPT-5. After intensive testing with the new model, I'd like to share my impressions - as someone who develops daily with AI tools. Since April 2025, I've been using Claude as my primary co-developer, after previously working with OpenAI o3, GPT-4, and GPT-4.1.

The Technical Specifications

OpenAI presents impressive specifications:

**500,000 token context**

**74.9% on SWE-bench** (compared to Claude 4.1's 74.5%)

**45% fewer errors than GPT-4o**

**3 model sizes** (gpt-5, gpt-5-mini, gpt-5-nano)

Community reactions on X are mixed. Many users report rate limit restrictions, miss the personality of earlier models, and note that benchmark results don't necessarily reflect real-world practice. Some still prefer GPT-4o, while others rate the new model as very good for everyday tasks.

My Test with AIpuna Code

I tested GPT-5 with the same tasks I use for other models:

The Task:

Refactoring a complex Flutter widget with Dart, state management, and Stripe payment integration.

GPT-5 delivers working code, that's undeniable. The quality is good for coding tasks. But in direct comparison, I miss what distinguishes Claude: Claude feels like a real senior co-developer who understands practical requirements in production environments.

The Large Context: Theory vs. Practice

500k Token Context - Actually What We Need

Large context is actually very important for development tasks. The problem isn't the size, but the implementation:

GPT-5 loses context despite 500k tokens - especially in the middle

You have to work with sequential tasks, create memories, compress chats

Without precise instructions and rules, focus is lost

Response times vary greatly - likely due to overload

The Real Problem:

High demand probably exceeds OpenAI's computing capacity. Benchmark performance is only achievable under optimal conditions - in reality, the system struggles with the load.

Google Gemini 2.5 Pro as Alternative:

Interestingly, Google Gemini 2.5 Pro offers one of the largest context windows on the market and is also very good for development tasks. In some situations, it delivers more stable results than GPT-5.

Benchmarks and Reality

GPT-5 shows strong benchmark results:

SWE-bench: 74.9%

HumanEval: 95.2%

MMLU: 91.8%

In practical application, I observe:

Generated code is mostly correct

Tendency toward more complex solution approaches

Occasionally suggests outdated or non-existent framework features

Consistency over longer sessions varies

Community Feedback on X

Discussions on X show a nuanced picture:

Frequently mentioned criticisms:

Rate limits and availability

Missing "personality" compared to earlier models

Discrepancy between benchmarks and practical experience

Many want GPT-4o back

Positive voices:

Good for everyday programming tasks

Strong performance on structured tasks

Good integration into existing workflows

Reliable results for standard tasks

The Comparison: GPT-5 vs Claude 4.1 vs Gemini 2.5 Pro

After three days of intensive use:

GPT-5 - Observations:

Very large context available (but loses it anyway)

Good for standard coding tasks

Broad knowledge spectrum

Struggles with high demand load

Claude 4.1 - My Experience:

Feels like a real senior co-developer

Better understands developer intentions

Payment Integration and Specifics

Differences emerge with payment-specific requirements:

Stripe Integration:

GPT-5 knows Stripe well and can deliver solid implementations. However, Claude better understands the nuances of webhook handling, idempotency, and edge cases in payment flows.

Flutter-specific Patterns:

All three models master Flutter, but Claude often shows more pragmatic approaches to state management and widget architecture. Gemini 2.5 Pro surprises with detailed Flutter knowledge.

Workarounds for GPT-5

To work effectively with GPT-5, I've developed the following strategies:

Sequential Tasks:

Break tasks into small, sequential steps

Set clear checkpoints between tasks

Regularly summarize context

Memory Management:

Create own memories for important project details

Regularly compress chat histories

Capture important information in instructions

Precise Rules:

Use very specific instructions

Define clear constraints

Specify output format exactly

My Personal Conclusion

GPT-5 is a good model for coding - I want to be clear about that. It does solid work and is appreciated by many developers. The technical improvements are real.

But for my specific requirements in AIpuna development, I'm sticking with Claude 4.1. Not because GPT-5 is bad, but because Claude feels like a real senior co-developer to me. After four months of intensive collaboration, Claude understands my working style and delivers pragmatic, production-ready solutions.

My Toolchain:

**Claude 4.1**: Primary co-developer for complex tasks

**GPT-5**: For specific tasks when Claude is overloaded

**Gemini 2.5 Pro**: For projects requiring very large context

Recommendations

Who might find GPT-5 interesting:

Teams already invested in OpenAI ecosystem

Projects with standard development tasks

When broad tool integration is important

When Claude 4.1 is the better choice:

When you want a real co-developer feeling

Complex architecture decisions

Agent-based development

Consistency over long sessions important

When Gemini 2.5 Pro might surprise:

Very large codebases

When stable large context is important

As backup when other models are overloaded

Outlook

Development continues. GPT-5 will improve once OpenAI expands infrastructure. Gemini 2.5 Pro shows that Google is seriously competing. And Claude remains my reliable partner.

For me personally, the current situation is ideal: We have several very capable AI assistants available. Depending on the task, I can choose the right tool. That's a luxury we didn't have a year ago.

The most important insight: It's not about the "best" model, but about the right tool for each task. And for my daily work, that's Claude - but it's good to know there are alternatives.

Patrik Germann

30 years IT experience

Pragmatic multi-model user