The Open Source Earthquake: DeepSeek V4 and What It Means for AI Pricing
Published 25 April 2026
Executive Summary
DeepSeek released preview versions of its V4 model family on 24 April 2026, one day after OpenAI shipped GPT-5.5 and in the same week Claude Opus 4.7 arrived. The timing was not coincidental. V4-Pro, the flagship, carries 1.6 trillion total parameters with 49 billion active per token, scores 80.6% on SWE-bench Verified against Claude Opus 4.6’s 80.8%, and costs $3.48 per million output tokens against Claude Opus 4.7’s $25. That is roughly a sevenfold price gap at near-identical coding benchmark performance. V4-Flash, the efficient sibling, runs at $0.28 per million output tokens and is cheaper than any frontier model currently available. Both are released under an MIT licence with open weights on Hugging Face, meaning any organisation can run, fine-tune, and deploy them without restriction.
The headline claim is that open source has effectively matched closed source at the frontier of software engineering, for the first time. That claim is substantially, though not unconditionally, true. DeepSeek leads on LiveCodeBench, Codeforces competitive programming, and Terminal-Bench 2.0 agentic execution. Claude Opus 4.7 leads on SWE-bench Pro, Humanity’s Last Exam, and the most complex mathematical reasoning benchmarks. Gemini 3.1 Pro leads on factual world knowledge retrieval. No single model dominates every category, but the cost differential is large enough that the burden of proof has shifted. If you are currently routing high-volume inference through Claude or GPT at premium pricing for workloads that are primarily code generation or structured reasoning, the numbers now favour doing the infrastructure work to change that.
There are real constraints to navigate. Running V4-Pro at full quality requires at minimum an eight GPU H100 cluster, which is data centre territory. V4-Flash, the more deployable of the two, fits on two H100 80GB cards in FP8 precision and generates output at approximately 83 tokens per second through the hosted API. For teams not yet at the volume where self-hosting pays off, DeepSeek’s own hosted API at current pricing is dramatically cheaper than the closed lab alternatives. The data sovereignty and geopolitical dimensions require honest assessment: DeepSeek’s hosted API routes through Chinese infrastructure, the model was trained on Huawei Ascend chips, and Anthropic’s own congressional filing alleged industrial-scale distillation attacks from Chinese labs. Self-hosting on your own infrastructure neutralises the data routing concern but does not answer the provenance question.
The strategic conclusion for enterprise technology leaders is that this is not a binary choice between DeepSeek and Anthropic. It is a portfolio architecture question. The optimal approach for most teams combines open weight inference for high-volume commodity tasks with closed lab models for workloads requiring demonstrably superior knowledge retrieval, safety guarantees, or regulatory compliance. The organisations that move fastest to build that routing layer will extract the largest cost advantage from this week’s releases.
1. Context: What Just Happened and Why It Matters
DeepSeek first disrupted the AI market in December 2024 with V3, which demonstrated that a Chinese lab operating under US semiconductor export controls could compete with OpenAI and Anthropic at a fraction of the training cost. Its R1 reasoning model followed in early 2025, briefly crashing Nvidia’s stock when markets recognised that the assumption linking raw compute spending to model quality was not as solid as the hyperscalers had implied. V4 arrives in a substantially more crowded environment than R1 did. Chinese competitors including Alibaba’s Qwen series, Zhipu AI’s GLM line, MiniMax, and Moonshot AI’s Kimi models have all released strong open-weight models in 2026, and DeepSeek now positions these as direct competitors rather than framing its rivalry exclusively against American labs. That competitive framing is itself a signal of how much the Chinese open-source stack has matured.
The timing of the V4 release, one day after the White House OSTP accused Chinese entities of running industrial-scale campaigns to distil US frontier AI models through tens of thousands of proxy accounts, and hours after OpenAI shipped GPT-5.5, was almost certainly deliberate. DeepSeek needed a launch window where the economics of open-weight frontier AI would dominate the news cycle. The strategy worked. Ivan Su, senior equity analyst at Morningstar, told CNBC that V4 probably will not create the same market shock as R1 because investors have already priced in the reality that Chinese AI is competitive and cheaper, but that the competitive framing against domestic Chinese labs is itself a new dynamic that did not exist with R1.
2. Architecture and Technical Specifications
V4 is a two-model family, both using Mixture of Experts architecture. A MoE model maintains a large pool of specialised sub-networks, known as experts, but routes each token through only a small active subset rather than running the full parameter count on every forward pass. This is what makes the economics of V4-Pro tractable despite its scale.
V4-Pro carries 1.6 trillion total parameters with 49 billion active per token. It was pre-trained on 33 trillion tokens and shipped in a native FP4 plus FP8 mixed precision format, with MoE expert weights in FP4 and most other parameters in FP8. Its stored weight size on Hugging Face is approximately 862GB in this mixed precision format, not the 1.3 terabyte figure that full BF16 precision would imply. V4-Flash carries 284 billion total parameters with 13 billion active, trained on 32 trillion tokens, and weighs approximately 158GB in the same mixed precision format. Both models support a one million token context window with a maximum output of 384,000 tokens. Both expose thinking and non-thinking modes, three reasoning effort levels (base, high, and max), tool calls, JSON output, and chat prefix completion. The API is OpenAI ChatCompletions compatible and also accepts the Anthropic API format, which significantly lowers the integration effort for teams already running Claude.
The architectural innovation that makes the one million token context window practical rather than merely theoretical is V4’s hybrid attention mechanism. Standard transformer attention scales quadratically with sequence length, making one million token contexts prohibitively expensive to serve at production volumes. V4 replaces this with a combination of Compressed Sparse Attention and Heavily Compressed Attention. CSA maintains a compressed KV cache plus a sparse top-k selector. HCA folds many more tokens into a single cache entry. Interleaving the two mechanisms produces the following result at one million token context length: V4-Pro requires only 27% of the single-token inference FLOPs and 10% of the KV cache that V3.2 required for the same context. V4-Flash drops further to 10% of FLOPs and 7% of the KV cache. This is not a minor efficiency improvement. It is the difference between one million token context being a data centre problem and being a team-scale deployment problem.
The post-training pipeline also diverges meaningfully from V3. Rather than a single large generalised reinforcement learning stage, V4 trains separate specialist experts for each domain including mathematics, coding, agentic tasks, and instruction following, each going through supervised fine-tuning followed by Group Relative Policy Optimisation with domain-specific reward models. These specialists are then unified into the final model via on-policy distillation. The result is a model with sharply defined capability profiles per domain rather than a uniform capability curve, which is why the benchmark story is more nuanced than any single headline number captures.
V4 was trained on Huawei Ascend 950PR chips, a deliberate strategic choice to reduce dependency on Nvidia hardware subject to US export controls. Reuters confirmed this on 4 April 2026. Huawei simultaneously announced that its Ascend Supernode platform provides full support for V4, establishing what Counterpoint Research describes as a meaningful step toward Chinese AI sovereignty across the full stack from model weights to inference silicon. The inference weights released on Hugging Face run on standard Nvidia H100 and H200 hardware, so the training hardware choice does not affect developers outside China who want to self-host. But the geopolitical implication is significant: US semiconductor export controls may have accelerated rather than impeded Chinese AI efficiency research, because operating under constrained compute forced architectural innovations that American labs with abundant Nvidia access had less incentive to pursue.
3. Benchmark Performance: The Honest Picture
Benchmarks from the DeepSeek technical report published on Hugging Face on 24 April 2026 place V4-Pro at the following positions against Claude Opus 4.6, GPT-5.4, and Gemini 3.1 Pro.
| Benchmark | V4-Pro | Claude Opus 4.6/4.7 | GPT-5.4/5.5 | Gemini 3.1 Pro | Winner |
|---|---|---|---|---|---|
| LiveCodeBench | 93.5% | 88.8% | — | 91.7% | V4-Pro |
| Codeforces rating | 3206 | — | 3168 | 3052 | V4-Pro |
| Apex Shortlist Pass | 90.2% | 85.9% | 78.1% | 89.1% | V4-Pro |
| SWE-bench Verified | 80.6% | 80.8% | — | 80.6% | Tie |
| SWE-bench Pro | 55.4% | 64.3% | 58.6% | — | Claude |
| Terminal-Bench 2.0 | 67.9% | 65.4% / 69.4% | 82.7% | 68.5% | GPT-5.5 |
| BrowseComp | 83.4% | 79.3% | 84.4% | — | GPT-5.5 |
| GPQA Diamond | 90.1% | 94.2% | 93.6% | — | Claude |
| HMMT 2026 (maths) | 95.2% | 96.2% | 97.7% | — | GPT-5.4 |
| HLE (no tools) | 37.7% | 46.9% | 41.4% | 44.4% | Claude |
| SimpleQA-Verified | 57.9% | — | — | 75.6% | Gemini |
| IMOAnswerBench | 89.8% | 75.3% | 91.4% | 81.0% | GPT-5.4 |
The pattern that emerges is consistent and worth internalising before making any procurement decision. V4-Pro dominates on competitive coding and agentic execution, ties or comes within rounding error on practical software engineering (SWE-bench Verified), and trails the closed labs meaningfully on expert-level cross-domain reasoning, factual world knowledge retrieval, and the hardest mathematical competition benchmarks.
The V4-Flash performance profile is worth noting separately because it matters for deployment decisions. Flash scores 79.0% on SWE-bench Verified against Pro’s 80.6%, and 91.6% on LiveCodeBench against Pro’s 93.5%. On most developer coding tasks the gap between Flash and Pro is functionally negligible, and Flash at max reasoning effort closes much of the remaining distance. Flash also generates output at approximately 83.8 tokens per second through the hosted API, faster than most comparable open-weight alternatives.
The independent benchmark comparison service BenchLM places Claude Opus 4.6 and DeepSeek V4-Pro Max in an overall tie on provisional scores, with the split falling along predictable lines: Claude leads on knowledge tasks averaging 76.2 against V4-Pro’s 66.1, while V4-Pro leads on coding averaging 75.9 against Claude’s 64.4. The practical choice depends on which category dominates your workload. DeepSeek itself acknowledges the knowledge and world model gaps directly in the V4 technical report, describing a developmental trajectory that trails frontier closed-source models by approximately three to six months on those specific dimensions.
4. Pricing: The Number That Changes Everything
The pricing comparison is stark. The table below uses standard non-cached rates as of 24 April 2026, with a simple 1M input plus 1M output token blended comparison as the reference point.
| Model | Input ($/M) | Output ($/M) | Blended 1M+1M | Multiple vs V4-Pro |
|---|---|---|---|---|
| DeepSeek V4-Flash | $0.14 | $0.28 | $0.42 | 0.08x |
| DeepSeek V4-Pro | $1.74 | $3.48 | $5.22 | 1x |
| Gemini 3.1 Flash | ~$0.35 | ~$1.05 | ~$1.40 | 0.27x |
| GPT-5.4 Mini | ~$0.40 | ~$1.60 | ~$2.00 | 0.38x |
| Claude Haiku 4.5 | ~$0.80 | ~$4.00 | ~$4.80 | 0.92x |
| Claude Sonnet 4.6 | ~$3.00 | ~$15.00 | ~$18.00 | 3.4x |
| GPT-5.4 | ~$5.00 | ~$20.00 | ~$25.00 | 4.8x |
| Claude Opus 4.7 | $5.00 | $25.00 | $30.00 | 5.7x |
| GPT-5.5 | $5.00 | $30.00 | $35.00 | 6.7x |
With cached input at $0.145 per million, the blended cost for V4-Pro drops further to $3.63, which VentureBeat describes as roughly one-eighth the cost of Claude Opus 4.7 at that mix. V4-Flash with caching sits at $0.308 blended, placing it at nearly 1/100th the cost of GPT-5.5 and Claude Opus 4.7, though with correspondingly lower capability on the hardest tasks. For high-volume pipelines where Flash’s performance is sufficient, the cost reduction against premium Claude or GPT tiers can approach 98%.
Fortune reported that DeepSeek expects V4-Pro prices to fall further through 2026 as Huawei scales up production of its Ascend 950 processor. The absence of rate limits on self-hosted deployments is equally material. Production workloads built on the major closed APIs know what it means to hit throughput ceilings at scale, and that constraint disappears entirely when you run the model on your own infrastructure.
5. Hardware Requirements and Deployment Guide
This section covers the actual infrastructure required to run V4 yourself, organised by deployment tier. All numbers reflect V4’s native FP4 plus FP8 mixed precision format, which is what ships on Hugging Face and what is optimised for H100 and H200 inference.
5.1 Understanding the Memory Mathematics
The key planning insight is that all expert weights must remain resident in GPU memory even though only a fraction activate per token. The router selects different experts on each forward pass, so lazy loading produces unacceptable latency spikes. V4-Pro’s 862GB weight file means you need at least 862GB of aggregate GPU VRAM plus additional headroom for KV cache, activation buffers, and working memory. V4-Flash at 158GB is far more tractable. The KV cache requirement scales with context length and batch size and can dominate at the full one million token end of the window. For practical production serving at 64,000 to 256,000 token context lengths, which covers the majority of real-world use cases, the KV cache overhead is manageable on a well-configured cluster.
5.2 V4-Pro: Enterprise Cluster Deployment
The minimum production configuration for V4-Pro in FP8 precision is eight Nvidia H100 80GB SXM cards in NVLink configuration, providing 640GB aggregate VRAM with approximately 140GB remaining for KV cache after weights load. This handles production serving at moderate context lengths and small batch sizes. For comfortable production throughput with larger batches or longer contexts, sixteen H100 80GB cards across two nodes running vLLM with tensor parallelism of 8 and pipeline parallelism of 2 provides adequate headroom. BF16 precision approximately doubles the stored weight size to around 1.3 terabytes and requires sixteen to twenty-four H100 80GB GPUs to hold comfortably. For most teams FP8 is the right default, as the quality difference versus BF16 is negligible for production inference tasks. A100 80GB cards are viable but suboptimal because they lack the Hopper Transformer Engine required for hardware-accelerated FP8, meaning you need more cards and achieve lower throughput.
For cloud deployment, AWS P5 instances provide eight H100 SXM cards in the minimum configuration. Azure ND H100 v5 series and CoreWeave H100 clusters are the other primary options. At reserved or spot pricing with sustained utilisation, break-even against V4-Pro’s hosted API occurs at roughly 500 billion output tokens per month.
5.3 V4-Flash: Team Scale Deployment
V4-Flash’s 158GB weight file is far more accessible. It fits on a single H200 141GB card with modest KV cache headroom, or on two A100 80GB cards in FP8 precision, or on two H100 80GB cards in FP8. A two-H100 configuration is the practical minimum for FP8 production serving; a single H100 handles Flash at INT4 quantisation with a measurable but small quality reduction. Current GPU cloud pricing for a two to four H100 Flash configuration runs approximately $4,000 to $8,000 per month at reserved capacity. Break-even against Flash’s already-inexpensive hosted API sits at roughly 100 billion output tokens per month.
The recommended serving stack is vLLM, which supports V4’s MoE expert parallelism, the hybrid CSA plus HCA attention architecture, and efficient KV cache management for long contexts. SGLang is a strong alternative that exposes cleaner function-calling and JSON-mode primitives. Both expose OpenAI-compatible API endpoints, which means migration from the hosted DeepSeek API to a self-hosted deployment is a single line change in your client configuration.
5.4 Local Development and Personal Use
Apple Silicon is a viable path for V4-Flash development and experimentation but not production deployment. A machine with 128GB unified memory, specifically an M3 Max or M4 Max at 128GB, can run Flash at INT4 or INT8 quantisation via llama.cpp or Ollama once the GGUF community quantisations appear, which typically takes days to weeks after a major open-weight release. Expected throughput on Apple Silicon at this configuration is 8 to 15 tokens per second, adequate for personal development but not production. An M4 Ultra at 192GB or 512GB handles it more comfortably. The one million token context window is not practically accessible on Apple Silicon at any current configuration due to KV cache memory requirements.
For the fastest and cheapest path to evaluating V4 without building your own infrastructure, renting a single H100 on RunPod or Lambda Labs for a few hours to run Flash at INT4 and measure throughput against your actual prompts costs $10 to $30 and produces more useful data than a week of planning against published benchmarks.
5.5 When Self-Hosting Beats the Hosted API
For V4-Flash, self-hosting on reserved GPU cloud infrastructure beats the hosted API at approximately 100 billion output tokens per month. For V4-Pro, the break-even is closer to 500 billion. Below those thresholds, the hosted DeepSeek API is both cheaper and operationally simpler. Above them, and especially for teams with data sovereignty requirements that preclude routing through Chinese infrastructure, the self-hosting case is compelling.
6. The Geopolitical and Compliance Dimension
This section is not optional reading for enterprise technology leaders in regulated industries.
DeepSeek’s hosted API routes through infrastructure based in China. The US House Select Committee on the Chinese Communist Party concluded in its December 2025 report that DeepSeek funnels American user data to the People’s Republic of China through backend infrastructure linked to a government-designated Chinese military company. Anthropic’s February 2026 congressional filing alleged that DeepSeek, Moonshot AI, and MiniMax collectively operated approximately 24,000 fraudulent accounts to conduct more than sixteen million interactions with Claude for the purpose of harvesting model outputs at industrial scale. China’s foreign ministry has described these accusations as groundless, and neither the distillation claims nor the data routing assertions have been independently and conclusively verified in public. However, for organisations with data residency requirements, financial services compliance obligations, or contractual commitments that restrict where customer data may be processed, using DeepSeek’s hosted API is a compliance question that requires a clear answer from legal and risk teams before any production deployment.
Self-hosting the MIT-licensed weights on your own cloud infrastructure is the direct answer to the data routing concern, because it places the model entirely within your controlled environment. It does not resolve the provenance question of whether the model’s capabilities derive partially from distilled outputs of closed-source American models. That question is unresolved and may remain so. Organisations for which provenance matters should track its development closely.
The Huawei Ascend training story carries a separate strategic implication. V4 was trained on Huawei’s Ascend 950PR chips, not Nvidia hardware. As Counterpoint Research analyst Wei Sun noted, V4’s ability to run natively on domestic Chinese chips could help Beijing achieve meaningful AI sovereignty and further reduce reliance on Nvidia for both training and inference at scale. The inverse implication for Western enterprise is that the assumption underpinning US export control policy, that semiconductor restrictions can meaningfully constrain Chinese AI capability at the frontier, is now visibly weakened.
For South African enterprises, the compliance picture is different from both the US and UK contexts. There is no direct regulatory prohibition on using DeepSeek’s API as of April 2026. The relevant considerations are POPIA data residency obligations for personal information, contractual data processing requirements with international counterparties, and internal risk appetite for routing commercially sensitive data through infrastructure in a geopolitically contested jurisdiction. Self-hosting or using a managed inference provider that runs the weights on locally controlled infrastructure addresses the first two concerns without requiring any position on the third.
7. What This Means for Enterprise AI Strategy
The closed lab value proposition has rested on two pillars: capability and trust. The capability pillar is now under serious, documented pressure for software engineering workloads, and the competitive data from LiveCodeBench and SWE-bench Verified is real rather than speculative. The trust pillar, covering data sovereignty, compliance audit, and alignment guarantees, is more complicated than it first appears for open weights, because the MIT licence gives you control over the data path but does not guarantee the same safety and alignment investment that Anthropic has made in Claude. The two concerns point in opposite directions depending on which trust dimension is primary for your organisation.
The honest enterprise calculus is a portfolio question rather than a binary choice. A well-designed routing layer that sends 60 to 70 percent of traffic to V4-Flash for high-volume commodity inference, escalates complex coding and agentic tasks to V4-Pro or Claude Opus 4.7 depending on the benchmark profile that matches the specific workload, and reserves GPT-5.5 for agentic desktop and knowledge-intensive tasks can reduce AI infrastructure costs by 40 to 60 percent against a single-model Claude approach while maintaining or improving output quality across most task types.
Claude Sonnet 4.6 remains a strong choice for tasks requiring tool use reliability, safety properties, and the reasoning style that Anthropic has invested in. Claude Opus 4.7 leads on SWE-bench Pro at 64.3% against V4-Pro’s 55.4%, which matters for the most complex software engineering tasks and for agentic workloads where production reliability history counts. For high-volume code generation, document processing, and retrieval augmented generation pipelines at scale, V4-Flash at $0.28 per million output tokens changes the economics of what is worth automating. Tasks that looked too expensive on Claude Opus 4.7 or GPT-5.5 become viable at V4-Flash pricing, which changes the scope of what it is rational to build.
8. The Broader Pattern
What DeepSeek has done with V4 is less about one model release and more about demonstrating that architectural innovation can substitute for raw compute at the frontier, and that this substitution is now happening outside the Nvidia-dependent, US-lab-centred ecosystem that defined AI development through 2024. The 27% FLOPs reduction versus V3.2 at one million token context is not a benchmark curiosity. It is evidence of a genuinely independent research programme that no longer mirrors the methodological choices of American frontier laboratories.
The competitive programming benchmarks matter beyond their immediate domain. LiveCodeBench and Codeforces measure structured reasoning under constraint at the frontier of what current AI systems can do with code. Taking first place ahead of every closed-source model is a meaningful signal about generalised reasoning quality, not just narrow coding task performance. It is the domain where the open-weight gap was most stubbornly persistent, and it has now closed.
The closed labs are not going away. Anthropic’s reported annual recurring revenue grew from $9 billion to $30 billion in early 2026, which reflects real enterprise contract value that does not evaporate because a cheaper alternative exists. Distribution, enterprise relationships, safety reputations, and the alignment research programme that Anthropic is conducting have genuine value that DeepSeek cannot replicate quickly. But the days when premium pricing for frontier inference could be justified across all workload categories regardless of task type are numbered. The organisations that recognise this first and build the infrastructure to act on it will extract a structural cost advantage that compounds as inference volumes grow.
Free has beaten paid before. In AI infrastructure, as in enterprise software before it, the pattern tends to hold once the quality gap closes. The gap has effectively closed for coding. The rest of the capability curve will follow.
9. Hobby Users and Retail Hardware: What Is Actually Possible
Not everyone approaching this release is running an enterprise GPU budget. A meaningful portion of the people who will experiment with V4 are developers, researchers, and technically curious individuals working with retail consumer hardware, and the picture for that audience is more nuanced than either the enthusiast press or the enterprise guides acknowledge.
The honest summary is that V4-Flash is the only realistic target for consumer hardware, and even then your experience will vary significantly depending on what you have. The full V4-Pro model at 862GB is simply not runnable on any single consumer machine without extreme quantisation that degrades quality to the point where the comparison to the hosted API becomes unfavourable.
The table below covers the realistic consumer and prosumer hardware options and what you can expect from each.
| Hardware | Unified / VRAM | Model | Precision | Est. tokens/sec | Context limit | Verdict |
|---|---|---|---|---|---|---|
| Mac Mini M4 (16GB) | 16GB | Flash (distilled) | Q4 only | 2-4 | 8K | Not worth it |
| Mac Mini M4 Pro (24GB) | 24GB | Flash (distilled) | Q4 | 4-6 | 16K | Marginal |
| Mac Mini M4 Pro (48GB) | 48GB | Flash | Q4/Q5 | 8-12 | 32-64K | Usable |
| Mac Studio M4 Max (64GB) | 64GB | Flash | Q5/Q6 | 12-18 | 64-128K | Good |
| Mac Studio M4 Max (128GB) | 128GB | Flash | Q8 | 15-22 | 128-256K | Very good |
| Mac Studio M4 Ultra (192GB) | 192GB | Flash | FP16 | 18-28 | 256-512K | Excellent |
| Mac Pro M4 Ultra (512GB) | 512GB | Pro (quantised) | Q4 | 5-10 | 128K | Possible |
| RTX 4090 (24GB) | 24GB | Flash (distilled) | Q4 | 15-25 | 32K | Good speed, small context |
| Dual RTX 4090 (48GB) | 48GB | Flash | Q4/Q5 | 25-40 | 64K | Strong for the price |
| RTX 5090 (48GB) | 48GB | Flash | Q5 | 30-50 | 64-128K | Best consumer GPU option |
A few things the table does not capture. Apple Silicon’s unified memory architecture means the GPU and CPU share the same memory pool, which makes large models behave differently than on discrete GPU setups with PCIe bandwidth bottlenecks between the CPU and GPU. For inference on Ollama or llama.cpp, Apple Silicon consistently punches above its weight relative to equivalent VRAM on a discrete card, because the memory bandwidth is substantially higher and there is no PCIe transfer overhead. A 128GB Mac Studio M4 Max running Flash at Q8 will generally outperform a dual RTX 4090 setup at Q4 despite having the same effective VRAM, because of bandwidth and memory locality advantages.
The minimum configuration where the experience is genuinely useful rather than merely possible is a Mac Mini M4 Pro at 48GB or a Mac Studio M4 Max at 64GB. Below 48GB, context limits and quantisation requirements combine to produce a model that handles short tasks adequately but struggles with the multi-file reasoning and extended context work that makes V4-Flash interesting in the first place. The 16GB base Mac Mini will run tiny distilled variants but not anything recognisably comparable to V4-Flash’s full capability profile.
For Windows PC builders, a single RTX 5090 at 48GB is the current sweet spot: faster token generation than Apple Silicon on pure throughput benchmarks, good Q5 quality at 64K context, and a lower entry price than a 128GB Mac Studio. The trade-offs are power consumption (around 575W under load versus 60-80W for Apple Silicon), the CUDA-only ecosystem which is more mature for vLLM and llama.cpp than the Metal backend, and the fact that Windows inference tooling is slightly behind macOS in terms of setup simplicity for one-command deployment.
The GGUF community quantisations for V4-Flash are expected within days to weeks of this writing. Once those land in the Ollama model registry, the setup barrier drops to a single command.
10. Mac Mini Setup Script
The following script handles the complete setup for running DeepSeek V4-Flash on a Mac Mini M4 Pro (48GB) or Mac Studio M4 Max, from a clean macOS installation. It installs Ollama, pulls the quantised V4-Flash model, starts the inference server, and opens a local web UI so you can interact with it immediately in a browser. It also installs the Open WebUI container via Docker for a proper chat interface, which is a much better experience than the raw Ollama CLI for day-to-day use.
Run the whole thing as a single paste into Terminal. It will ask for your password once for the Homebrew and Docker steps.
cat > install_deepseek.sh << 'EOF'
#!/bin/bash
set -e
echo "=== DeepSeek V4-Flash on Apple Silicon ==="
echo "Estimated download: ~90GB for Q5_K_M. Ensure you have space and a fast connection."
echo ""
# 1. Install Homebrew if not present
if ! command -v brew &>/dev/null; then
echo "[1/6] Installing Homebrew..."
/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"
# Add brew to path for Apple Silicon
echo 'eval "$(/opt/homebrew/bin/brew shellenv)"' >> ~/.zprofile
eval "$(/opt/homebrew/bin/brew shellenv)"
else
echo "[1/6] Homebrew already installed, skipping."
fi
# 2. Install Ollama
if ! command -v ollama &>/dev/null; then
echo "[2/6] Installing Ollama..."
brew install ollama
else
echo "[2/6] Ollama already installed, skipping."
fi
# 3. Start Ollama service in background
echo "[3/6] Starting Ollama service..."
brew services start ollama
sleep 3 # Give it a moment to start
# 4. Pull DeepSeek V4-Flash
# Model tag will be confirmed once GGUF quantisations land in the Ollama registry.
# Q5_K_M is recommended for 48GB+ systems: best quality/size tradeoff.
# Use Q4_K_M if you have exactly 48GB and want headroom for longer contexts.
echo "[4/6] Pulling DeepSeek V4-Flash (Q5_K_M)..."
echo " This will download approximately 90GB. Go make coffee."
echo ""
# Primary tag - update once official Ollama registry entry is confirmed
MODEL_TAG="deepseek-v4-flash:q5_k_m"
if ollama pull "$MODEL_TAG" 2>/dev/null; then
echo " Pulled $MODEL_TAG successfully."
else
echo " Primary tag not yet in registry. Trying fallback tag..."
# Fallback: community GGUF via direct Hugging Face modelfile
MODEL_TAG="hf.co/bartowski/DeepSeek-V4-Flash-GGUF:Q5_K_M"
ollama pull "$MODEL_TAG"
fi
# 5. Quick smoke test
echo "[5/6] Running smoke test..."
RESPONSE=$(ollama run "$MODEL_TAG" "Reply with exactly: DeepSeek V4-Flash is running." 2>/dev/null)
echo " Model response: $RESPONSE"
# 6. Install Docker Desktop and Open WebUI (optional but recommended)
echo "[6/6] Setting up Open WebUI chat interface..."
if ! command -v docker &>/dev/null; then
echo " Installing Docker via Homebrew Cask..."
brew install --cask docker
echo " Docker installed. Please open Docker Desktop from Applications and"
echo " wait for it to finish starting before continuing."
echo " Press Enter when Docker Desktop is running..."
read -r
fi
# Pull and run Open WebUI
docker pull ghcr.io/open-webui/open-webui:main
docker run -d \
--name deepseek-webui \
--restart unless-stopped \
-p 3000:8080 \
--add-host=host.docker.internal:host-gateway \
-e OLLAMA_BASE_URL=http://host.docker.internal:11434 \
-v open-webui:/app/backend/data \
ghcr.io/open-webui/open-webui:main
echo ""
echo "=== Setup Complete ==="
echo ""
echo " Ollama API: http://localhost:11434"
echo " Chat UI: http://localhost:3000 (Open WebUI)"
echo " Model: $MODEL_TAG"
echo ""
echo "Open http://localhost:3000 in your browser to start chatting."
echo ""
echo "To use V4-Flash from the command line:"
echo " ollama run $MODEL_TAG"
echo ""
echo "To use the OpenAI-compatible API (works with any OpenAI SDK client):"
echo " Base URL: http://localhost:11434/v1"
echo " API key: ollama (any string works)"
echo " Model: $MODEL_TAG"
echo ""
echo "Expected throughput on M4 Pro 48GB: 8-12 tokens/sec"
echo "Expected throughput on M4 Max 128GB: 15-22 tokens/sec"
echo ""
echo "Context window note: at Q5_K_M on 48GB, practical context is ~32-64K tokens."
echo "For longer contexts, switch to Q4_K_M to free VRAM for the KV cache."
EOF
chmod +x install_deepseek.sh A few notes on what the script does and does not do. The model tag in step 4 will need verification once the official Ollama registry entry for V4-Flash is confirmed; the GGUF quantisations from Bartowski and Unsloth typically appear within days of a major open-weight release and the script includes a Hugging Face fallback for that reason. The Open WebUI step requires Docker Desktop, which you may already have. If you only want the raw Ollama CLI without the browser UI, you can skip step 6 entirely. The script sets the container to restart on reboot so the chat interface comes back automatically after a restart.
To stop everything cleanly:
# Stop the chat UI
docker stop deepseek-webui && docker rm deepseek-webui
# Stop Ollama service
brew services stop ollama To update to a newer quantisation when one appears:
ollama pull deepseek-v4-flash:q8_0 # higher quality if you have the VRAM
ollama rm deepseek-v4-flash:q5_k_m # remove old version to reclaim space References
- DeepSeek AI, DeepSeek-V4 Technical Report, Hugging Face, 24 April 2026. https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro
- DeepSeek API Documentation, DeepSeek-V4 Preview Release Notes, 24 April 2026. https://api-docs.deepseek.com/news/news260424
- Simon Willison, DeepSeek V4: Almost on the Frontier, a Fraction of the Price, 24 April 2026. https://simonwillison.net/2026/Apr/24/deepseek-v4/
- TechCrunch, DeepSeek Previews New AI Model That “Closes the Gap” with Frontier Models, 24 April 2026. https://techcrunch.com/2026/04/24/deepseek-previews-new-ai-model-that-closes-the-gap-with-frontier-models/
- VentureBeat, DeepSeek-V4 Arrives with Near State-of-the-Art Intelligence at 1/6th the Cost of Opus 4.7 and GPT-5.5, 24 April 2026. https://venturebeat.com/technology/deepseek-v4-arrives-with-near-state-of-the-art-intelligence-at-1-6th-the-cost-of-opus-4-7-gpt-5-5
- Fortune, DeepSeek Unveils V4 Model, With Rock-Bottom Prices and Close Integration with Huawei’s Chips, 24 April 2026. https://fortune.com/2026/04/24/deepseek-v4-ai-model-price-performance-china-open-source/
- CNBC, China’s DeepSeek Releases Preview of Long-Awaited V4 Model as AI Race Intensifies, 24 April 2026. https://www.cnbc.com/2026/04/24/deepseek-v4-llm-preview-open-source-ai-competition-china.html
- Build Fast With AI, DeepSeek V4-Pro Review: Benchmarks, Pricing and Architecture, 24 April 2026. https://www.buildfastwithai.com/blogs/deepseek-v4-pro-review-2026
- Lushbinary, Self-Hosting DeepSeek V4: vLLM, Hardware and Deployment Guide, 24 April 2026. https://lushbinary.com/blog/deepseek-v4-self-hosting-guide-vllm-hardware-deployment/
- Lushbinary, DeepSeek V4 vs Claude Opus 4.7 vs GPT-5.5: Benchmarks and Pricing, 24 April 2026. https://lushbinary.com/blog/deepseek-v4-vs-claude-opus-4-7-vs-gpt-5-5-comparison/
- Digital Applied, DeepSeek-V4 Preview Launch: 1M Context Efficiency, 24 April 2026. https://www.digitalapplied.com/blog/deepseek-v4-preview-launch-1m-context-efficiency
- WaveSpeed AI, DeepSeek V4 GPU Requirements: VRAM and Hardware Guide, April 2026. https://wavespeed.ai/blog/posts/deepseek-v4-gpu-vram-requirements/
- Spheron Network, Deploy DeepSeek V4 on GPU Cloud: MoE Inference with vLLM and Expert Parallelism, April 2026. https://www.spheron.network/blog/deploy-deepseek-v4-gpu-cloud/
- BenchLM, Claude Opus 4.6 vs DeepSeek V4 Pro Max: AI Benchmark Comparison 2026, 24 April 2026. https://benchlm.ai/compare/claude-opus-4-6-vs-deepseek-v4-pro-max
- Knowledge Hub Media, DeepSeek V4 Changes the AI Pricing Game: Full Breakdown and Analysis, 24 April 2026. https://knowledgehubmedia.com/deepseek-v4-changes-the-ai-pricing-game-full-breakdown-and-analysis/
- Office Chai, DeepSeek Releases V4-Pro and V4-Flash: GPT-5.4 and Opus 4.6 Level Performance at Fraction of the Price, 24 April 2026. https://officechai.com/ai/deepseek-v4-pro-deepseek-v4-flash-benchmarks-pricing/
- Resultsense, DeepSeek V4 Preview Tuned for Huawei Ascend Chips, 24 April 2026. https://www.resultsense.com/news/2026-04-24-deepseek-v4-huawei-ascend-chips
- Ogun Security, DeepSeek V4 Release: China’s Sovereign AI Stack and the Strategic Fracturing of US Technology Dominance, 24 April 2026. https://www.ogunsecurity.com/post/deepseek-v4-release-china-s-sovereign-ai-stack
- Apidog, How to Run DeepSeek V4 Locally, 24 April 2026. https://apidog.com/blog/how-to-run-deepseek-v4-locally/
- Coder Sera, Run DeepSeek V4 Flash Locally: Full 2026 Setup Guide, 24 April 2026. https://ghost.codersera.com/blog/run-deepseek-v4-flash-locally-full-2026-setup-guide/