Lantern 需要您的幫助:您的捐款為數百萬人提供無審查的互聯網服務。立即捐款!

How Lantern Uses Adaptive Machine Learning to Outsmart Censors

The Problem

Lantern operates thousands of proxy servers across dozens of regions worldwide. When a user in Iran or China connects, we need to decide which proxy to assign them. The wrong choice means their connection gets blocked by censors. The right choice means they get through — fast and reliably.

Traditionally, Lantern used static assignment: hash the user's ID, pick a proxy, done. This worked when censors moved slowly. But modern censorship infrastructure can detect and block proxy IPs within hours. We needed a system that learns in real-time which proxies work for which networks, and adapts faster than censors can block.

Enter the Multi-Armed Bandit

We chose the EXP3.S adversarial multi-armed bandit algorithm — specifically designed for environments where the "opponent" (the censor) is actively working against you.

The Metaphor

Imagine you're in a casino with K slot machines (arms). Each pull gives a random reward. You want to maximize your total reward, but you don't know which machine is best. You have to balance exploitation (pulling the machine that's been paying best) with exploration (trying other machines in case they're better).

In our case:

  • Each arm is a (region, protocol) combination — e.g., "Frankfurt + samizdat" or "Tokyo + hysteria2"
  • Each pull is a config fetch from a client
  • The reward is whether the proxy successfully connected (measured via callback URLs)
  • The adversary is the censor, who can block arms at any time

Why EXP3.S?

Most bandit algorithms (UCB, Thompson Sampling) assume the reward distribution is stationary — the best arm stays the best. Censorship is the opposite: the best arm today might be blocked tomorrow. EXP3.S handles this with two mechanisms:

  1. Importance-weighted updates: Arms selected with low probability get amplified rewards, preventing the algorithm from ignoring rarely-tried alternatives
  2. Weight decay (α shift): All weights slowly drift toward uniform, so a previously-blocked arm can recover when unblocked

Our parameters:

  • γ = 0.20: 20% of selections are purely random exploration
  • α = 0.01: Weights decay toward uniform, preventing runaway dominance from early luck

Architecture

The Feedback Loop

The key insight: we learn per-ISP. A proxy that works on Comcast in the US might be blocked on MCI in Tehran. Each ISP (identified by its Autonomous System Number, or ASN) gets its own weight vector, so the bandit learns separate preferences for each network.

Probes and Callbacks

When a client fetches its config, the server:

  1. Loads the EXP3 weights for this ASN
  2. Selects 3 arms probabilistically (weighted by past performance)
  3. Picks 2 proxy IPs per arm (for redundancy)
  4. Embeds unique callback URLs in each proxy's URL test configuration

The client's proxy tunnel periodically tests each assigned proxy by hitting its callback URL. When the callback arrives at the server, we know:

  • The proxy is reachable from this ISP
  • The round-trip latency
  • The device ID (for unique user counting)

If a callback never arrives (30-second timeout), the reaper records a failure. This is how we detect blocking without the client needing to explicitly report it.

Relative Latency Rewards

Early in development, we used an absolute latency sigmoid: 500ms = good, 2000ms = bad. This worked for US users but broke for users in Iran, where every proxy has 2000ms+ latency. The absolute sigmoid gave near-zero rewards for all arms, leaving the bandit unable to differentiate.

The fix: relative rewards computed inside Redis. A Lua script maintains an exponential moving average (EMA) latency per arm per ASN. On each callback, it:

  1. Updates this arm's EMA: ema = 0.3 × new + 0.7 × old
  2. Reads ALL arms' EMA latencies for this ASN
  3. Computes a percentile rank: reward = (arms with worse latency) / (total arms - 1)

Now 2000ms is "excellent" if everything else is 3000ms+. The bandit learns "which arm is best for THIS network" regardless of absolute performance.

Four Levels of Blocking Detection

Censorship manifests at different granularities. A single proxy IP might be blocked on one ISP but not another. A protocol might be blocked nationally. We detect each level independently:

Level 1: Per-ASN + Protocol (20 samples, 1h window)

"Is samizdat blocked on MCI?"

Tracks success/failure per ISP per protocol track. When success rate drops below 15%, the bandit penalizes this arm's weight by 99% for this ASN. The arm can still be selected (exploration), but it's heavily deprioritized.

Level 2: Per-Country + Protocol (100 samples, 24h window)

"Is samizdat blocked everywhere in Iran?"

Aggregates across all ASNs in a country. Needs 100 samples for confidence — avoids false positives from a single flaky ISP. Same 15% blocking threshold.

Level 3: Per-Route Global (100 samples, 2h window)

"Is this specific IP (1.2.3.4) burned everywhere?"

Tracks at the individual proxy IP level. When an IP is globally blocked (success rate < 10% across all users), it triggers a deprecation process: 1-hour grace period, then the IP is destroyed and a fresh one provisioned.

Level 4: Per-Route + Country (50 samples, 24h window)

"Is this IP blocked in Iran but working in the US?"

The most granular level. A proxy IP might be blocked by Iranian censors but working perfectly for US users. This signal tells the catalog to exclude this specific IP for Iranian users while keeping it available for everyone else. No IP is wasted.

VPS Pool Management

Each arm's proxy pool is managed automatically:

  • Base pool size: Configurable per protocol track (e.g., 2 routes per location)
  • Capacity scaling: When unique device count (via HyperLogLog) exceeds 70% of max_clients × running_routes, new proxies are provisioned to bring utilization to ~50%
  • Deprecation replacement: When a globally-blocked proxy is destroyed, the pool worker sees the deficit and provisions a fresh IP

Device counting uses Redis HyperLogLog — an approximate unique counter that uses only 12KB of memory per route regardless of how many devices connect. We use multi-key PFCOUNT to accurately count unique devices across multiple routes in the same arm.

Adaptive Polling

The server recommends how often the client should re-fetch its config, based on how confident the bandit is:

Confidence LevelPoll IntervalWhen
New ASN (< 10 observations)60 secondsLearning fast — need data
High uncertainty3 minutesStill exploring
Moderate confidence5 minutesSettling
Good confidence10 minutesMostly converged
Fully converged15 minutesStable assignment

This means a user on a new ISP gets optimal proxy assignment within minutes, while a user on a well-known ISP doesn't waste bandwidth re-fetching configs they already have.

Lessons Learned

The Positive Feedback Loop

Frankfurt was dominating the selection for a user on Comcast despite a closer region being available. EXP3's importance-weighted updates amplified early luck: a successful callback from a randomly-explored arm gave a disproportionate weight boost. Combined with a flat absolute latency sigmoid (1000ms and 700ms looked nearly identical) and slow weight decay (α=0.002), early luck compounded into persistent dominance.

The fix: relative latency rewards (so 700ms clearly beats 1000ms for this ASN), faster weight decay (α=0.01), and more exploration (γ=0.20).

Counting Users, Not Polls

We initially tracked "assignment count" — a lifetime counter incremented every config poll. With 60-second polling, one user generated 1,440 "assignments" per day per route. The capacity scaling system saw this as 1,440 users and over-provisioned wildly.

The fix: HyperLogLog unique device counting via Redis PFCOUNT, with multi-key union for per-arm totals.

What's Next

The bandit system is preparing for production deployment. Key upcoming work:

  • More protocols: Adding hysteria2, VLESS, and other protocols as additional arms
  • Production scaling: Optimizing for 2M+ active users across 100K+ ASNs

The bandit doesn't just assign proxies — it wages an automated, adaptive campaign against censorship, learning from every connection attempt and continuously shifting traffic to stay one step ahead of the censors.