Introduction

  • TL;DR: Meta released Llama 4 Scout and Llama 4 Maverick on 2025-04-05. Scout targets ultra-long context (10M tokens) with 17B activated / 109B total params, while Maverick offers 1M tokens with 17B activated / 400B total params.
  • Both are natively multimodal (text+image inputs) and use a Mixture-of-Experts (MoE) design.
  • Benchmarks shared by Hugging Face show strong gains vs earlier Llama generations, but leaderboard integrity and “variant mismatch” issues mean you should validate on your own workloads.
  • The “Llama 4 Community License” includes practical obligations and a major threshold clause (700M MAU) you must review before production use.

In this post, we’ll focus on what’s verifiable from public artifacts (model cards, the license text, and release notes), then translate it into an engineer-friendly decision checklist.

Why it matters: Choosing an LLM is not just “best benchmark wins.” Context length, deployment cost, and licensing constraints often dominate real-world outcomes.

1) What Meta Actually Released: Scout vs Maverick

1-1. Specs from the official model card

Below is a concise comparison based on Meta’s Hugging Face model card.

ItemLlama 4 Scout (17Bx16E)Llama 4 Maverick (17Bx128E)
Activated params17B17B
Total params109B400B
ModalitiesText+image in, text/code outText+image in, text/code out
Context length10M tokens1M tokens
Knowledge cutoffAug 2024Aug 2024
Supported languages (explicit)12 languages12 languages
Release date2025-04-052025-04-05

Reuters also described Llama 4 as a multimodal system spanning multiple data types, reinforcing Meta’s “multimodal-first” positioning.

Why it matters: Ultra-long context (1M–10M) can simplify certain pipelines, but it changes infra economics and test strategy.

2) Key Technical Ideas: MoE + Native Multimodality + Ultra-Long Context

2-1. MoE in practice

Scout and Maverick are MoE models; the model card explicitly calls out 16 experts (Scout) and 128 experts (Maverick). Hugging Face notes Maverick alternates MoE and dense layers, applying experts in roughly half the layers.

2-2. Native multimodality

The model card describes “early fusion for native multimodality,” and positions the models for text+image understanding.

2-3. How they push context length

Hugging Face’s release write-up explains that pretraining used 256K context, while Instruct tuning extends to 1M (Maverick) and 10M (Scout). It also discusses NoPE layers, chunked attention, and related design choices.

Why it matters: Long context is a capability, not a free lunch. You’ll still need regression tests for “lost-in-the-middle,” cost spikes, and failure modes at extreme sequence lengths.

3) Benchmarks: Use Them, Don’t Worship Them

3-1. Public scores (selected)

Hugging Face published evaluation tables. For example, instruction-tuned Maverick reports MMLU Pro 80.5 and GPQA Diamond 69.8, with LiveCodeBench results over a specific date range.

3-2. Benchmark integrity and “variant mismatch”

There were public reports around leaderboard submissions using an “experimental chat version,” highlighting why reproducibility matters.

Practical approach:

  • Re-run with the exact public checkpoint and your prompt templates
  • Evaluate on your domain corpus and real task harness
  • Track cost/latency alongside quality

Why it matters: Your production bottleneck is rarely “raw benchmark rank.” It’s stability, governance, and predictable infra cost.

4) Licensing: “Open Source” vs “Open Weights”

4-1. The clause you must not miss (700M MAU)

The Llama 4 Community License states that if your (and affiliates’) products exceed 700 million monthly active users, you must request a separate license from Meta and are not authorized to exercise rights until granted.

It also includes redistribution and attribution obligations such as “Built with Llama,” and naming requirements for distributed derivative models.

4-2. Why terminology gets contentious

OSI has argued that Meta’s Llama licenses (notably Llama 2’s) do not meet the Open Source Definition due to restrictions and an Acceptable Use Policy. This is useful context when you communicate internally about compliance and risk.

Why it matters: Licensing issues become outages when a product ships. Treat license review as part of your model selection gate, not an afterthought.

5) Getting Started Quickly (Transformers)

Hugging Face states Llama 4 is integrated with transformers (v4.51.0), supports TGI, and offers quantization paths (on-the-fly int4 for Scout; FP8 weights for Maverick).

1
pip install -U "transformers>=4.51.0" "huggingface_hub[hf_xet]"

A multimodal example using AutoProcessor and Llama4ForConditionalGeneration is also provided in the release post.

Why it matters: Fast PoCs are great—but ultra-long context and multimodality can amplify infra cost. Start with realistic context limits and scale up with measurement.

Conclusion

  • Llama 4 Scout and Maverick (released 2025-04-05) emphasize MoE efficiency, native multimodality, and ultra-long context (10M/1M).
  • Public benchmark tables look strong, but you should validate using the exact public checkpoint and your own task harness.
  • The Llama 4 Community License includes real operational obligations and a major 700M MAU threshold clause—review it before shipping.
  • The HF ecosystem support (Transformers v4.51.0, TGI, quantization notes) makes experimentation straightforward.

Summary

  • Scout: 10M context, 17B activated / 109B total (16E)
  • Maverick: 1M context, 17B activated / 400B total (128E)
  • License: review attribution/redistribution requirements + 700M MAU clause
  • Benchmarks: use as inputs, prioritize reproducible evaluation on your workload

References