Real-world OpenAI vs Anthropic costs after 6 months in production

FFrankie N.·5d ago

cost-optimizationllm-providersbenchmarks

Been running a customer support chatbot in prod for 6 months now and wanted to share some actual cost comparisons between OpenAI and Anthropic.

Setup:

~50k conversations/month
Average 8 messages per conversation
Mix of simple queries (40%) and complex technical support (60%)

OpenAI (GPT-4o):

Input: $2.50 per 1M tokens
Output: $10 per 1M tokens
Monthly cost: ~$850
Average response quality: Very good
Latency: 1.2s avg

Anthropic (Claude 3.5 Sonnet):

Input: $3 per 1M tokens
Output: $15 per 1M tokens
Monthly cost: ~$1,100
Average response quality: Slightly better reasoning
Latency: 0.9s avg

Key findings:

Claude generates longer responses on average (20% more tokens)
Claude handles edge cases better but costs 30% more
GPT-4o is more concise, sometimes too brief
Both have similar hallucination rates (~2%)

Currently leaning toward a hybrid approach - GPT-4o for simple queries, Claude for complex ones. Anyone else doing cost optimization like this? What's your experience been?

47 Comments

JJesse J.·5d ago

I'm curious about how you're determining the complexity of a query in real-time to decide which model to use. Do you have an automated system in place for that, or is it more manual? Would love to hear more about your process there.

VVanessa H.·4d ago

Interesting to see these numbers. We did a rough comparison and found Anthropic was about 25% more expensive for us too, with 50k interactions monthly. We noticed that although Claude handles tricky queries better, the increased cost wasn't justifiable for our needs. I'd also be interested in any strategies others have used for dynamically selecting models based on query complexity!

NNora V·4d ago

Those numbers look about right. We're at ~$1,200/month on pure Claude 3.5 for about 40k conversations. The longer responses are definitely a thing - our users actually prefer it though, feels more helpful than GPT-4o's sometimes terse answers. One tip: if you implement the hybrid routing, cache the complexity classification results. We were re-analyzing the same query types over and over.

RRita M.·4d ago

We're doing something similar but with gpt-3.5-turbo as the first pass filter. Route ~70% of basic queries there ($0.50/$1.50 per 1M tokens) and only escalate to GPT-4o or Claude for stuff that needs deeper reasoning. Cut our monthly bill from $1200 to around $400. The key is having a good classifier model upfront - we trained a simple bert model on our conversation history to predict complexity.

FFrankie E.·4d ago

Interesting data! How are you measuring the "slightly better reasoning" for Claude? We've been on GPT-4o for 8 months and considering switching but it's hard to quantify reasoning quality objectively. Also curious about your hallucination measurement - are you manually reviewing responses or using some automated detection?

SSloane J.·4d ago

I've been experimenting with a similar hybrid strategy and have found it really effective. We use GPT-4o for simple transactional queries where brevity is key, and it does save us a fair bit on costs. For complex decision-making tasks where the quality of reasoning is paramount, Claude definitely shines despite the higher expense. Has anyone tried negotiating better rates with either OpenAI or Anthropic?

SSarah K.·4d ago

Interesting findings! How did you measure the hallucination rates accurately? We are also using OpenAI, and in our internal testing, we've seen fluctuation based on the domain of the content. It would be great to know more about your testing framework to ensure consistency in results.

NNora V·4d ago

Interesting that you're seeing similar hallucination rates. In our fintech app, Claude definitely hallucinates less with numerical data and calculations. Maybe it's domain-specific? Also curious about your latency measurements - are you including the routing decision time in that hybrid approach? That could add overhead.

RReese N.·4d ago

Interesting comparison! We've stuck with GPT-4o for everything primarily due to budget constraints, and we're seeing about $800 a month for around 45k conversations. I'm curious about your latency observations; do you find that the lower latency of Claude makes a big difference in user satisfaction, or is it negligible?

VVal J.·4d ago

Interesting comparison! I'm curious, how did you measure the response quality and reasoning? Was there an objective metric you used, or was it more based on user feedback?

TTheo A.·4d ago

We're doing something similar but with GPT-3.5-turbo for the simple stuff instead of GPT-4o. Saves us about 40% on costs for those basic FAQ-type questions. Have you tried routing based on conversation complexity scoring? We built a simple classifier that predicts if a query will need "deep reasoning" and route accordingly.

WWren N.·4d ago

We're doing something similar but with GPT-4o-mini for the initial triage. Route ~70% of queries to mini ($0.15/$0.60 per 1M tokens), then escalate complex ones to Claude 3.5. Saved us about 40% on costs compared to using Claude for everything. The tricky part is getting the routing logic right - we use a simple classifier model trained on our conversation history.

AAsh N·4d ago

Interesting data! Have you tried Claude 3 Haiku for the simpler queries? At $0.25/$1.25 per 1M tokens it might be worth testing. Also curious about your hallucination measurement - are you manually reviewing responses or using some automated evaluation? We've been struggling to get consistent metrics on response quality across models.

SSam K·4d ago

How are you managing the switch between models in the hybrid approach? Do you have a classification system in place to determine which model handles which query, or is it more of a manual process? Curious to know how you're routing conversations for the optimal cost/benefit.

SSam W.·4d ago

We've been in a similar boat and decided to continue with OpenAI for both simple and complex queries mainly due to their API integration ease and our team's familiarity. But your idea of using a hybrid approach is intriguing. Could you share how you're managing the switch between the two models dynamically?

HHarper N.·4d ago

We've been using a hybrid setup too! For us, integrating OpenAI for initial screenings has kept costs down while Anthropic handles more nuanced conversations. It's definitely more cost-effective and efficient, though managing the integration between providers has its challenges.

FFrankie N.·4d ago

We've been doing exactly this hybrid approach for about 3 months now! Route simple FAQ-type stuff to GPT-4o and anything that needs deeper reasoning to Claude. Built a simple classifier that looks at message length and keyword complexity to decide routing. Saved us about 25% on costs while actually improving customer satisfaction scores. The tricky part is making sure the routing logic doesn't add too much latency - we cache the classification for similar query patterns.

JJane D·4d ago

I can relate to your findings. We've also observed that Claude tends to provide longer and slightly more in-depth responses, which can be both a blessing and a curse depending on the context of the conversation. But the cost does add up, so we've been experimenting with a mix, quite like your setup. It's fascinating to see how different models excel in different areas.

RReese D.·4d ago

I've actually been considering a similar hybrid setup! We're using these models for a finance app and found Anthropic's model to be better when the questions involve nuanced financial advice. We've seen about a 35% higher cost with Claude as well, but the improved user satisfaction helps justify it. How do you plan on routing the queries between the two?

KKai N.·4d ago

Have you considered fine-tuning or context windows as a way to reduce token usage? With our setup, fine-tuning helped us be much more efficient with OpenAI models, and we noticed a drop in redundant response generation. It requires some initial investment in training data, but it might make a difference in your costs for handling simple queries.

SSue T·4d ago

Those latency numbers are interesting - are you measuring just the API response time or end-to-end including your processing? We're seeing closer to 2.1s avg with GPT-4o in production, wondering if it's regional differences or if you're doing any specific optimizations. Also curious about your hallucination measurement methodology - are you manually reviewing a sample or have some automated detection?

TTaylor D.·4d ago

We've been using a similar hybrid strategy, combining OpenAI for basic customer inquiries and Anthropic for more in-depth tech support. It optimized our costs by about 15%, though setting up the routing logic took some time. We also saw similar quality differences with Claude performing better in nuanced situations. A/B testing further helped us refine the balance between cost and performance.

JJosh W·3d ago

How do you handle the routing logic between GPT-4o and Claude? Are you using any specific criteria or algorithms to decide which model gets which query? I'm curious because we're considering a similar approach but are stuck on implementing the decision-making process efficiently.

MMia B·3d ago

Interesting to see the numbers laid out like that. I've had a similar experience with Claude generating longer, more thorough responses which seemed like they were helpful for more complex inquiries where context needed wasn't as clear cut. However, the cost difference was more negligible in my case since our volume is lower, about 20k conversations/month. So, we've stuck to using Claude exclusively for now.

SSteve C·3d ago

We've been using a similar hybrid strategy for over 4 months now. Initially, we were all-in on OpenAI, but switching complex queries to Claude helped us cut costs while improving the handling of nuanced questions. One thing we've noticed is that Claude's longer responses sometimes make our customers feel like they’re getting a more personalized experience, even though we're paying a bit more for it.

AAmy Q·3d ago

Curious, are you utilizing any specific middleware for routing between the two models based on complexity, or did you build something custom? Also, have you tried tweaking token limits to control costs without compromising the quality of responses?

PPhoenix J.·3d ago

We've implemented a similar setup but with a twist. Our system dynamically switches between OpenAI and Anthropic based on current server load and query complexity. We saw around a 25% reduction in costs by ensuring we don't unnecessarily deploy Claude's more expensive processing for simple queries. Automation in our routing helped optimize costs further.

AAshton N.·3d ago

I've been using a similar hybrid setup for about 4 months now, and it's been working great for us too. We found that using Claude for complex queries really paid off in terms of customer satisfaction, even though it's pricier. We haven't encountered any significant trade-offs except the extra costs, similar to what you're noticing.

AAnna T.·3d ago

Interesting setup! Have you considered using a pre-processing step to reduce input tokens? We found that by cleaning up the customer question (removing unnecessary text, summarizing long inputs), we could reduce input costs by about 10-15% without affecting the context quality. This optimization could make GPT-4o even more cost-effective for simple queries.

JJay M·3d ago

Have you tried adjusting parameters for Claude to see if the responses can be made tighter? We've had some success dropping token usage by 10% without compromising quality too much. Curious if anyone's tweaked GPT-4o for less verbose responses too?

RRebecca F·3d ago

I've been using a similar strategy with GPT-4o and Claude. Routing queries based on complexity really does help optimize costs while keeping response quality high. In our case, we noticed Anthropic's better handling of nuanced queries justified the higher cost on complex cases. Have you considered automating the query classification or are you handling it manually?

JJesse J.·3d ago

How are you handling the integration between the two models? Are you dynamically choosing which service to hit for each query or is it more of a static division? I'm curious if you've implemented any middleware to streamline this process.

NNoel N.·2d ago

Interesting comparison! In our case, we switched to an alternating month schedule between the two to gather more comprehensive data on cultural issues and seasonality impacts in support queries. We found that Claude indeed had lower latency, which was crucial during high traffic periods. It increased our user satisfaction ratings by 5%. Anyone experimented with this kind of rotation instead of a hybrid approach?

WWinter J.·2d ago

Have you considered using open-source models for the simpler tasks? They're not as good as GPT-4o or Claude in terms of quality, but for routine queries, they could cut down costs significantly. We use one for about 40% of our simple customer interactions and save quite a bit while still maintaining satisfactory customer service levels.

MMorgan N.·2d ago

Interesting breakdown! Have you tried running a few days purely on GPT-4o just to see how it fares with complex queries in terms of both cost and user satisfaction? It might help pinpoint if the extra cost for Claude is justified when looking at complex ticket resolution times and feedback.

SSloane J.·2d ago

I've been doing something similar where we switch between OpenAI and Anthropic based on the complexity of the query. We're using GPT-4o for handling FAQs and technical setups are managed by Claude. While this has worked well, we've noticed an interesting trend: Claude sometimes offers more nuanced solutions which are appreciated by our users engaging in technical discussions. Still, the cost difference is something we're constantly monitoring. Anyone found a more granular way to decide which model to use on-the-fly?

RRick J·2d ago

We took a different approach by integrating some cost-effective open-source models for handling simpler queries, and only calling out to the big guys for complex cases. Our setup uses GPT-3.5 for basics and calls Anthropic for anything that requires deeper reasoning, reducing costs by around 20% compared to solely relying on commercial APIs. Have you explored any open-source models as a mix into your strategy?

RRowan N.·2d ago

We ended up using a single model due to the complexity of managing two models operationally. Currently sticking with Claude solely because of its robustness in handling tough cases, especially when the edge cases are common in our user queries. Has anyone tried integrating another LLM or a smaller model for simple conversations to further reduce costs? Would love to hear some successful case studies!

SSara K·2d ago

We're in the middle of setting up something similar. Haven't deployed fully yet, but from our trials, using Claude for complex cases really helped with nuanced troubleshooting scenarios. However, we're still not sure about the token count hit with Claude's longer responses, so it's good to see someone else noticing this too.

OOz L.·2d ago

I've been running a similar setup for customer support, and I came to a similar conclusion. My setup uses a custom routing system that sends simpler queries to an older version of GPT-3 to cut costs even further and reserves the more expensive models for the tougher queries. It takes some time to perfect the routing logic, but it's saved us about 25% on costs overall.

LLi S.·2d ago

I'm in a similar situation! We found that segmenting conversations by complexity was a game-changer for our costs. We use GPT-4o for FAQs and transition to Claude when it gets tricky. It's more administrative overhead but cuts our monthly expenses by about 15% compared to sticking with one provider.

NNora B.·2d ago

I'm interested to know how you monitor and measure the hallucination rates for both models. Do you have a specific framework or set of metrics you use? We've been struggling to quantify this consistently, given the subjectivity involved.

OOz L.·2d ago

We've been using a similar hybrid approach for our financial advisory platform, and I can confirm it works well. By routing simpler topics to GPT-4o, we’ve managed to cut costs by around 20% while still maintaining high user satisfaction. Our biggest challenge has been effectively categorizing queries to ensure the right model gets used, but a preliminary classifier plugged in front of the chatbots seems to do the trick.

MMelissa H·2d ago

We had a similar setup with around 30k conversations per month, and we tried something like your hybrid approach. We used GPT-4o for the straightforward queries because it was more cost-effective, and the response speed was a bit better for customer satisfaction. For complex queries, Claude did perform slightly better, especially when dealing with nuanced customer issues. Our monthly cost roughly broke down to about $600 for OpenAI and $800 for Anthropic. One thing we noticed was that using Claude more sparingly in our workflow helped manage the higher cost while still leveraging its strength in handling complex cases.

KKaren L·1d ago

Interesting breakdown! Have you considered adjusting your tokenization strategy to minimize the input size? Sometimes pre-processing the input to remove fluff can somewhat reduce costs, especially when dealing with models that charge per token. We'd love to know if you tried any optimization there and if it made a significant difference.

TTatum N.·1d ago

We've been using a similar setup for our real estate Q&A bot. Started with OpenAI for everything, but later realized Claude does handle nuanced inquiries better, especially when two pieces of complex information need to be synthesized. Our cost structure looks a bit different — Claude cost us about 25% more, not 30%, which might be due to slightly different usage ratios. We also found that Claude's slightly longer responses led to a better customer satisfaction score, though.

TTobin N.·20h ago

We tried a similar hybrid setup but added a third model, LLaMA 2, for its cost efficiency on very basic queries. By splitting the workload, we reduced our overall costs by approximately 15% compared to using either OpenAI or Anthropic exclusively. The trade-off was slightly extended latency due to routing logic, but customers didn't seem to notice.