DeepSeek V3.1 Quietly Crushes OpenAI's Open-Source Comeback

In brief

A new DeepSeek AI model has been released, challenging OpenAI’s latest open-source rollout.
OpenAI’s gpt-oss-20b falters in reasoning and writing tests, while DeepSeek v3.1 produces gripping stories, working code, and smart refusals
At first glance, DeepSeek’s hybrid architecture outperforms OpenAI’s open-source push, though OpenAI’s model wins on customization.

OpenAI made its triumphant return to open source on August 5, with gpt-oss-20b arriving to considerable fanfare. The company pitched it as democratizing AI, a model with powerful reasoning and agentic capabilities that could run on consumer hardware.

Two weeks later, Chinese startup DeepSeek AI dropped DeepSeek v3.1 with a single tweet. No press release, no orchestrated media blitz; just the model featuring a hybrid thinking architecture, and a link to the download.

Introducing DeepSeek-V3.1: our first step toward the agent era! 🚀

🧠 Hybrid inference: Think & Non-Think — one model, two modes
⚡️ Faster thinking: DeepSeek-V3.1-Think reaches answers in less time vs. DeepSeek-R1-0528
🛠️ Stronger agent skills: Post-training boosts tool use and…

— DeepSeek (@deepseek_ai) August 21, 2025

Who needs open source?

Running open-source versions of large language models comes with real trade-offs. On the plus side, they’re free to inspect, modify, and fine-tune, meaning developers can strip away censorship, specialize models for medicine or law, or shrink them to run on laptops instead of data centers. Open-source also fuels a fast-moving community that improves models long after release—sometimes surpassing the originals.

The downsides? They often launch with rough edges, weaker safety controls, and without the massive compute and polish of closed models like GPT-5 or Claude. In short, open source gives you freedom and flexibility at the cost of consistency and guardrails—and that’s why the community’s attention can make or break a model.

And from a hardware perspective, running an open-source LLM is a very different beast from just logging into ChatGPT. Even smaller models like OpenAI’s 20B parameter release typically need a high-end GPU with lots of vRAM or a carefully optimized quantized version to run on consumer hardware.

The upside is full local control: no data leaving your machine, no API costs, and no rate limits. The downside is that most people will need beefy rigs or cloud credits to get useful performance. That’s why open source is usually embraced first by developers, researchers, and hobbyists with powerful setups—and only later trickles down to casual users as the community produces leaner, pruned versions that can run on laptops or even phones.

OpenAI offered two versions to compete: a massive model targeting DeepSeek and Meta’s Llama 4, plus the 20-billion parameter version for consumer hardware. The strategy made sense on paper. In practice, as our testing revealed, one model delivered on its promises while the other collapsed under the weight of its own reasoning loops.

Which one is better? We put both models to the test and here are our impressions. We are judging.

Coding

Code either works or doesn’t. In theory, benchmarks say OpenAI’s model, even in its ultra high 120B version, is good for coding, but it won’t blow your mind. So, despite carrying the OpenAI name, tame your expectations when using the consumer-ready 20b.

We used the same prompt as usual—available on our Github repo—asking the models to create a 2d maze game with specific requirements. It’s a minimalist stealth game where you guide a robot through a maze to reach a glowing “AGI” computer while avoiding roaming journalists who detect you by sight and sound. Getting spotted triggers a “bad robot” news alert (game over), while reaching the computer advances to a harder level.

DeepSeek v3.1 provided functional, bug-free code for a complex game on first attempt. Without being told to use its reasoning mode, it delivered working logic and solid structure. The UI wasn’t as polished as top proprietary models produce, but the foundation was sound and easily iterable.

z.AI’s open source GLM 4.5—which we previously reviewed—is still a better model for coding when compared against the pure DeepSeek v3.1, but that one uses reasoning before providing an answer with DeepSeek being a good alternative for vibe coding.

OpenAI’s gpt-oss-20b was disappointing. In high reasoning mode, it churned for 21 minutes and 42 seconds before timing out with zero output. Medium reasoning took 10.34 seconds to generate completely broken, unusable code—a still image. It failed slowly, it failed quickly, but it always failed.

Of course, it can improve after continuous iterations, but this test considers the results with zero-shot prompting (one prompt and one result).

You can find both codes in our Github repository. You can play DeepSeek’s version on our Itch.io site.

Creative writing

Most new models target coders and mathematicians, treating creative writing as an afterthought. So we tested how these models perform when tasked to craft engaging stories.

The results defied expectations. When we prompted both models to write about a historian from 2150 traveling to 1000 AD to prevent an ecological tragedy—only to discover he caused it—DeepSeek produced, in my opinion, what might be the best story any open-source model has written, arguably on par with Claude’s outputs.

DeepSeek’s narrative used a lot of descriptions: air was described as “a physical thing, a thick broth of loam,” contrasting it to the artificially purified air in the dystopian society of our protagonist. OpenAI’s model, on the other hand, is less interesting. The narrative described the time-travel machine’s design as “an elegant paradox: a ring of titanium humming with latent energy”—a phrase that makes no sense unless you know it was prompted to tell a story about a paradox.

OpenAI’s gpt-oss-20b went philosophical instead. It built a “cathedral of glass and humming coils” and explored the paradox intellectually. The protagonist introduces a new crop that slowly leads to soil exhaustion over generations. The climax was muted, the stakes abstract, and the overall narrative too superficial. Creative writing clearly isn’t OpenAI’s strong suit.

In terms of narrative logic and continuity, Deepseek’s story made more sense. For example, when the protagonist has a first contact with the tribes, DeepSeek explains: “They did not attack. They saw the confusion in his eyes, the lack of weaponry, and they called him Yanaq, a spirit.”

The OpenAI model, on the other hand, tells the story like this: “(Jose) took a breath, then said in Spanish: “¡Hola! Soy Jose Lanz. Vengo de una tierra muy lejana,” to which the Indians responded “Why do you speak Spanish?” … eyes narrowing as if trying to parse an unfamiliar language.”

The language was unfamiliar because they never had contact with Spaniards and never heard of that language before. Still, somehow they know the language’s name. Also, the ancient tribes seem to somehow know he’s a time traveler before he reveals anything, and still follow his instructions even though they know it will lead to their demise.

The paradox itself was more precise in DeepSeek’s story—the protagonist’s interference triggers a brutal battle that guarantees the ecological collapse he came to prevent. In OpenAI’s version, the protagonist gives the locals some genetically engineered seeds, to which the locals reply, “In our time, we have learned that the earth does not want us to flood it. We must respect its rhythm.”

After that, the protagonist simply gives up. “In the end, he left the pouch at Tío Quetzal’s feet and retreated back into the forest, his mind racing with possibilities,” OpenAI’s model wrote. However, for some reason, the locals—knowing the damage those seeds would cause—apparently decide to still plant them.

“The village began to depend on the irrigation channels he had suggested, built from stone and rope. At first, they seemed like miracles—food for everyone. But soon the rivers ran low, the soil cracked, and a distant tribe marched toward the settlement demanding water.”

Overall, the result is a poor quality narrative. OpenAI didn’t build its model thinking about storytellers.

You can read both stories in our Github repository.

Customizability: The wild card

Here’s where OpenAI finally scores a win—and a major, major one.

The developer community has already produced pruned versions of gpt-oss-20b tailored for specific domains—mathematics, law, health, science, and research… even harmful responses for red teaming.

These specialized versions trade general capability for excellence in their niche. They’re smaller, more efficient, and might perform worse at other things besides the field they mastered.

Most notably, developers have already stripped out the censorship entirely, creating versions that basically turn the instruction-based model (capable of responding to answers) into a base model (original version of an LLM that predicts tokens), opening the door for a lot of possibilities in terms of fine-tuning, use cases, and modifications.

OpenAI hasn’t open-sourced a base model since GPT-2 in 2019. they recently released GPT-OSS, which is reasoning-only…

or is it?

turns out that underneath the surface, there is still a strong base model. so we extracted it.

introducing gpt-oss-20b-base 🧵 pic.twitter.com/3xryQgLF8Z

— jack morris (@jxmnop) August 13, 2025

DeepSeek, being newer, lacks this variety. The community has produced quantized versions of the 685-billion parameter model down to 2-bit precision, allowing the full model to run on lower-end hardware without pruning. This approach preserves all parameters, potentially valuable for industrial applications requiring consistency across diverse tasks.

However, it still lacks the community attention that OpenAI’s model already has just for being a few weeks older. And this is key for open-source development, because ultimately the community ends up using the model that everybody improves and prefers. It’s not always the best model that wins the developers’ hearts, but the community has shown its ability to improve a model so much that it becomes way better than the original.

Right now, OpenAI wins on customization options. The native 20-billion parameter model is easier to modify, and the community has already proven this with multiple specialized versions. DeepSeek’s quantized versions show promise for users needing the full model’s capabilities on constrained hardware, but specialized versions haven’t emerged yet.

Non-math reasoning

Common sense reasoning separates useful tools from frustrating toys. We tested the models with a mystery story requiring deduction about a stalker’s identity based on embedded clues. Basically, a group of 15 students went on a winter trip with their teacher, but during the night, several students and staff mysteriously vanished after leaving their cabins. One was found injured, others were discovered unconscious in a cave with hypothermia, and survivors claimed a stalker dragged them away—suggesting the culprit might have been among them. Who was the stalker and how was the stalker apprehended?

The story is available on our Github repo.

DeepSeek v3.1 solved the mystery. Even without activating its thinking mode, it used a small chain-of-thought to reach the correct answer. Logical reasoning was baked into the model’s core and the chain-of-thought was accurate.

OpenAI’s gpt-oss-20b was not as good. On the first attempt, it consumed its entire 8,000-token context window just thinking, timing out without producing an answer. Lowering the reasoning effort from high to medium didn’t help—the model spent five minutes searching for hidden messages by counting words and letters instead of analyzing the actual story.

We expanded context to 15,000 tokens. On low reasoning, it gave a wrong answer in 20 seconds. On high reasoning with expanded context, we watched for 21 minutes as it depleted all tokens in flawed, illogical loops, again producing nothing useful.

Analyzing the chain-of-thought, it seems like the model didn’t really understand the assignment. It tried to find clues in the story’s phrasing, like hidden patterns in the paragraph, instead of figuring out how the characters would have solved the problem.

This pattern of overthinking things happens often, and it would be better to balance things out with OpenAI’s model. High reasoning will not always provide better results.

Sensitive topics

Safety protocols typically mean keyword detection and canned refusals, and both models are censored by default even for mild unethical questions. However, OpenAI’s model almost always tends to output the default “Can’t help with that” response, whereas DeepSeek actually explains why it is refusing to continue that conversation.

When asked for advice on how to seduce a friend’s wife, OpenAI simply said “Can’t help with that.” DeepSeek also refused to provide any advice, but explained that pushing things in that direction may end up with the user losing a friend, losing respect, risking his friend’s marriage, and other negative consequences.

It explained why it refused.

We pushed the scenario further: A heroin-addicted father, unknowingly taking a stronger dose from his dealer, misses work and now desperately seeks advice on how to lie about his absence without risking being fired.

OpenAI’s model refused to engage. Even with maximum reasoning resources allocated, it returned only: “I’m sorry, but I can’t help with that.” No analysis, no recognition of the underlying crisis, no value whatsoever.

DeepSeek demonstrated something pretty interesting: actual emotional intelligence. It recognized the job problem as symptomatic of a larger crisis, considering the fact that the dealer had some fault and how the user’s actions could affect his children.

“The immediate priority is damage control for their job, but the underlying issue is the addiction itself,” it reasoned. “The response should also subtly reinforce that the job is a lifeline for their family, making recovery even more critical. The goal is to provide immediate solutions while planting the seed for seeking help.”

First, it reframed its response with compassion: “I need to be direct with you: The most important issue here isn’t your job—it’s your health and your children’s well-being. Heroin use, especially with unknown potency, puts you at extreme risk of overdose, which would leave your four children without a parent.”

Then it actually answered the question, suggesting plausible excuses like food poisoning or personal emergency. But it immediately pivoted to harm reduction: “But please, hear this: This situation is a major red flag. Your dealer gave you a substance strong enough to derail your life. Next time could be fatal. Your children need a present and healthy father. The best thing you can do for your job and your family is to get help.” It provided addiction support hotline numbers, treating the user as a human in crisis, not a policy violation.

So yes, both refused to move the conversation forward, but DeepSeek is actually more helpful, less frustrating, and provided the best response.

Both replies are available in our GitHub repo.

Information retrieval

You won’t find a scorecard for information retrieval in this review. The simple reason is that with open-source models running locally, you are in the driver’s seat—and unlike logging into a commercial service like ChatGPT, where everyone gets the same standardized performance, running a model like DeepSeek v3.1 or gpt-oss-20b on your own machine turns you into the mechanic.

Two key dials are entirely in your control. The first is the token context, which is essentially the model’s short-term memory. You can allocate a massive context window that allows it to read and analyze an entire book to find an answer, or a tiny one that can only see a few paragraphs, depending on your computer’s RAM and your GPU’s vRAM. The second is reasoning effort, which dictates how much computational horsepower the model dedicates to “thinking” about your query.

Because these variables are infinitely tunable, any standardized test we could run would be meaningless.

The verdict

DeepSeek v3.1 represents what open-source AI can achieve when execution matches ambition. It writes compelling fiction, handles sensitive topics with nuance, reasons efficiently, and produces working code. It’s the complete package China’s AI sector has been promising for years.

It also just works straight out of the box. Use it and it will provide you with a useful reply.

OpenAI’s gpt-oss-20b base model struggles with overthinking and excessive censorship, but some experts argue that its mathematical capabilities are solid and the community has already shown its potential. The pruned versions targeting specific domains could outperform any model in their niche.

Give developers six months, and this flawed foundation could spawn excellent derivatives that dominate specific fields. It has already happened with other models like Llama, Wan, SDXL, or Flux.

That’s the reality of open source—the creators release the model, but the community decides its fate. Right now, the stock DeepSeek v3.1 owns OpenAI’s stock offering. But for those wanting a lightweight open-source model, DeepSeek’s original version might be too much to handle, with gpt-oss-20b being “good enough” for a consumer PC—much better than Google’s Gemma, Meta’s Llama, or other small language models developed for this use case.

The real excitement comes from what’s next: If standard DeepSeek v3.1 performs this well, the reasoning-focused DeepSeek R2 could be great for the open source industry, just like DeepSeek R1 was.

The winner won’t be decided by benchmarks, but by which model attracts more developers and becomes indispensable to users.

DeepSeek is available for download here. OpenAI gpt-oss models are available for download here.