Behind the Scenes

How the photo-to-monster magic actually works

A behind-the-scenes look at the MonsterCam capture pipeline: scene analysis, spawn weighting, AI image composite, and why it takes 5–7 seconds.

The short version

When you tap the shutter, three things happen in sequence: a vision model reads your photo's scene, a weighted random selection picks which Gen 1 monster fits best, and an AI image model composites that monster into your original photo. Total time: 5–7 seconds.

Step 1: scene analysis

The server receives your photo and sends it to a vision model. The vision model returns scene tags like indoor, kitchen, sink, night, etc. These tags are what the spawn engine uses to weight monsters.

Step 2: spawn weighting

Each of the 200 Gen 1 monsters has a base pull rate, a primary habitat, secondary habitats, scene tags, time windows, and indoor/outdoor preference. The spawn engine multiplies these together for every monster:

weight = base_pull_rate × scene_match × habitat_match × time_match × indoor_outdoor_match

Then it picks one weighted-randomly. About 15% of catches use an "off-scene" fallback where the scene didn't match anyone strongly — you still get a wild pull, weighted by global rarity.

Step 3: AI image composite

The picked monster has a pregenerated reference image stored on our server (kept server-side so trainers can't pre-scout the Dex). Both your original photo and the reference go to our AI image model, which composites the monster into your photo while preserving everything else — same perspective, same lighting, same depth of field, same grain.

Why does it take 5–7 seconds?

The composite step is the bottleneck. We've optimized aggressively:

  • Provisioned compute so we don't pay cold-start latency on most catches.
  • Aspect-aware sizing — portrait photos use a portrait composite, landscape uses landscape, so the model doesn't waste pixels.
  • JPEG re-encoding at quality 78 with mozjpeg, so the final image is 80–150KB instead of 2–6MB — Reveal-screen loads in under a second over cellular.

Why we use pregenerated references instead of generating monsters fresh each time

Two reasons. First, freshly generating a monster each catch would be slow and inconsistent — you'd get a slightly different Glubglug every time. The pregenerated reference is canonical: Glubglug always looks like Glubglug. Second, by keeping reference art server-side, we prevent people from reverse-engineering the Dex by inspecting the app bundle. The 3 Secret monsters in particular stay completely hidden until someone catches one in the wild.

Why your composite is unique even though there are only 200 monsters

The monster is canonical, but your photo isn't. The composite is the monster + your specific scene, lit and framed the way you took it. Two people catching Glubglug get two completely different cards: yours might have Glubglug peeking out of your bathroom sink at sunset; someone else might have Glubglug perched on their kitchen counter under fluorescent light. The catalog is finite. The composites are not.

Start your Dex.

Open MonsterCam, take a photo of where you are right now, and see what shows up.