Mass, Not Subject

Reading AI-Generated Images Through Gradient Fields

Cover: gradient field analysis across AI-generated images

The thing you are looking at is not the image. The image is the mass.

A butterfly. A businessman. A cathedral. A fractal. Most people see subjects. Artists see structure. An image model generating all four is not exercising four compositional strategies, it is applying one template, dressing it in different semantics, and serving the result as variety. Look closer and the seams show: subjects centered, framed in implicit rectangles, angled just enough that each image feels distinct. Small deltas, carefully maintained, so the semantic surface survives inspection while the geometric substrate does not. And that is only what is visible without measurement.

Midjourney analysis showing compositional structure
Figure 1: Midjourney analysis, 2025

This document bridges the gap between readings. It explains what mass means across three distinct vocabularies between artist, researcher, engineer, and how gradient-field analysis makes the same phenomenon visible and measurable regardless of which language you bring. The underlying claim is simple and empirically supported: semantic diversity does not necessarily produce compositional diversity.

This is a pre-read for Semantic Diversity Masks Geometric Uniformity, which documents the full measurement and analysis across 400 MidJourney images spanning 100 different prompts, where the compositional fingerprint barely moves. If you understand gradient fields and compositional geometry, skip there directly. If you want the vocabulary bridge first, read on.

Part 1: Mass

Mass does not mean subject. Mass means regions of rapid visual change: edges, contrast transitions, texture boundaries. The high-gradient zones that define forms, not the semantic objects those forms represent. When this document says 'the mass is centered,' it means the gradient-weighted centroid sits near the frame's barycenter. It does not necessarily mean the object(s) are centered. Those are different things, and the difference is the entire point.

Sora analysis: structural mass centroid vs saliency centroid across three still lifes
Figure 2: Sora analysis, 2025

This figure shows three still lifes. The semantic centers are the objects themselves, typically grouped. The structural mass centroid (kernel) is where the gradient energy is. The Sobel operator finds edges, transitions in pixel intensity, and the 85th percentile of that gradient field defines the structural mask. The centroid is the gradient-weighted center of those edge pixels. It answers: where is the structural activity in this image? It doesn't know what anything is. A highly textured background surface will pull the centroid just as hard as a foreground object.

The saliency centroid (spectral residual) is where the image is locally surprising relative to its surroundings. The spectral residual method (Li et al. 2007) computes the difference between the image and a heavily blurred version of itself, then squares it. High values are pixels that contrast strongly against their local context. The centroid is the weighted center of the top 20% of that map. It answers: where does the eye go, what region is visually distinct from everything around it?

Sora analysis: kernel vs saliency centroid divergence
Figure 3: Sora analysis, 2025

When the centroids are close, structure and visual attention are co-located: the thing to look at is also where the gradient energy is concentrated. That's the normal case for a single isolated object or a set of grouped objects on a ground.

When they diverge, something is happening. Common causes: a highly textured background pulling the structural centroid away from the visually prominent subject; a smooth-surfaced foreground object pulling the saliency centroid away from the structural activity; or a multi-object scene where gradient mass and visual attention are genuinely distributed differently.

Saliency is typically the vehicle of study in AI image analysis, frequently described as representing the "focal point" of model decision-making or human attention. However, saliency is the weighted average of pixel gradients with respect to model outputs, it requires access to the network's internal activations and propagates backward through the decision pathway. It answers: what drove the model to this classification or generation choice? It is introspective. It is semantic. And it collapses to a point.

Whereas the kernel calculates mass through the XY distribution of gradient magnitude in the rendered output itself, no model access required, no decision pathway, no backward propagation. It asks a different question: where does the visual weight of this image resolve, and how is that weight distributed across the full frame? Not a focal point but a field. Not where the model looked, but where the image's own physics settled.

The distinction matters in one precise way: saliency, by design, confirms the centering bias rather than measuring it. If you ask "where did the model attend?", you will find the center, because that is where semantic priority lives and where the radial attention prior concentrates energy simultaneously. The two signals are entangled. Saliency cannot separate them.

The kernel separates them by never asking about the model at all. Delta x = 0.005 is not a claim about model attention. It is a claim about where the gradient-weighted centroid of the final image resolves. Those are measurably different things, and the difference is what makes the kernel capable of detecting the compositional prior that saliency, by its own construction, is constitutionally unable to see.

Saliency maps the decision. The kernel maps the image, as the image was already decided.

Sora analysis: kernel-saliency gap visualization
Figure 4: Sora analysis, 2025

The kernel-saliency gap is not just a measurement artifact, it is a behavioral signature. When the two centroids are close, structure and attention are co-located: the image is organized around a single dominant mass, and the eye goes exactly where the gradient energy concentrates. When they diverge, something more complex is happening. But the gap itself, measured consistently across a corpus, tells you something about what a model was optimized for.

Firefly's outputs, across the tested corpus, showed a persistently low gap. Arguably not because of compositional sophistication, but because the images are sparse edge fields with dominant focal clusters positioned in the lower portion of the frame. Structure and attention agree because both are pointing at the same isolated subject against a clean, low-gradient ground. This is a recognizable compositional grammar: editorial product photography, marketing hero shots, stock imagery built for text overlay in the upper register. The kernel-saliency alignment is not evidence of spatial range. It is evidence of a very specific, very narrow target use case, reproduced with high consistency.

Where MidJourney's signature is radial collapse toward center, Firefly's signature is focal-bottom with void above. Different attractor basin. Same fundamental constraint: one template, many subjects, no forbidden zones attempted.

Midjourney analysis: object drawing the eye co-located with edges
Figure 5: Midjourney analysis, 2025

As one can see, the object that draws the eye is also where the edges are, which are not offset. Firefly, within tested corpus without adversarial prompting, showed a tendency for low placement.

Mass as Common Vocabulary

The three vocabularies describing this phenomenon converge on the same structure, which is why we study it over semantics or aesthetics, they avoid traps of judgements, following artistic intent through engineer: the mass can be an agreed truth in composition as it relates to a focal point within the spatial field:

Three audiences, three vocabularies, one phenomenon
Figure 6: Three audiences, three vocabularies, one phenomenon. Artist intuition, kernel measurement, and gradient analysis all describe the same spatial structure.

In this context, mass can dress itself in a variety of artistic outputs, coherent structure, energetic states, statistical equilibrium, dispersion, surface texture. Structure survives semantics while occupancy, field, and centroid distributions remain.

For artists: mass is visual weight. Dark shapes, bright highlights, hard edges, textured areas that pull the eye. Not objects, but contrast boundaries. A standing figure's mass includes its shadow, the tonal relationships between figure and background, the edge where light meets dark. When an experienced painter says 'this image is centered,' they're describing where the optical weight resolves, not where the subject sits.

For researchers: mass is high-gradient regions in the luminance field. Areas where pixel values change rapidly: edges, contrast transitions, texture boundaries. This is the same phenomenon the artist perceives, described in spatial frequency terms.

For engineers: mass is pixels above the 75th–85th percentile of gradient magnitude after Sobel filtering. The specific locations where the model placed sharp transitions. Same structure, quantified.

Mass = regions of rapid visual change. Where the image has structure rather than emptiness. This is the skeleton, not the skin.

This shift from subject to mass is the only vocabulary move this document asks for. Everything else follows. When a kernel measurement says Delta x = −0.143 for a 'left-weighted' composition, it is not contradicting your eye. It is measuring something your eye was not tracking: not the subject, but the mass, the scaffold of the composition. While the artist might take a picture of a woman reading, they place it through mass and not the figure because it provides the structure, depth, and spatial stabilization (or destabilization) of any given subject to environment placement. Mass placement is almost always the underlying scaffolding of an image. The image is never just a woman reading, it is the window, the wall, the light, the shapes and movement of the scene.

Midjourney analysis: woman reading, gradient-weighted centroid
Figure 7: Midjourney analysis, 2025

Returning to this image: the gradient-weighted centroid of that scene is actually closer to center than the figure's position suggests, because the book in her hands, the window frame, the light source, and the background gradients all pull mass rightward to counterbalance the figure. That is compositional sophistication, practiced here. And the kernel finds it in every image.

Part 2: Why Composition Sets Before Content

The most important thing to understand about diffusion-model generation is the sequence. Users see the final image. The compositional constraint was sealed much earlier.

Diffusion generation sequence: compositional structure locks in during first 20% of denoising steps
Figure 8: Diffusion generation sequence. Compositional structure (Delta x, rv) locks in during the first 20% of denoising steps. Semantic content arrives after the spatial skeleton is established.

Steps 1–10: The spatial prior activates. Pure noise collapses toward a rough layout. Transformer attention establishes where 'important content' belongs. This creates a radial attention gradient: center tokens have 360-degree context; edge tokens have 180-degree context. The gradient falloff from center to periphery is the first physics of the image. Delta x and rv lock in here, before any semantic content exists.

Steps 10–30: Semantic content populates the template. Text-prompt tokens activate learned associations. Butterfly, businessman, cathedral tokens fire and place content into the pre-existing attention field. They do not choose where to go. The template was chosen. Prompts influence content. They rarely override structure.

Steps 30–50: Details refine. Edges sharpen, colors settle, textures resolve. The image begins to 'look like' its subject. This is what users evaluate. It is the last thing to arrive and the only thing most metrics measure.

Sora analysis: architecture reinforces radial prior
Figure 9: Sora analysis, 2025

The architecture reinforces this prior at every level. Center tokens have maximum contextual access. Training data skews toward centered, balanced subjects. RLHF reward models prefer centered, readable compositions because evaluators rate them as 'good.' The architecture, training data, and fine-tuning all push toward the same attractor basin. This is not a bug. It is a learned equilibrium.

The model generates subjects that fit its compositional physics, not subjects that best match the semantic prompt. When prompted 'grand cathedral interior,' radial architecture generates because radial structures match the compositional template, not because cathedrals are inherently radial.

Part 3: The Kernel

Seven gradient-field primitives measure the spatial forces governing any image. They are model-agnostic, contrast-invariant, deterministic, and computationally fast: O(n) over image segments. Same image, same result, every time. No learned components.

VTL analysis: radial envelope and seven kernel primitives
Figure 10: VTL analysis, 2025. The radial envelope (left) showing how mass clusters within a concentric attractor field, and the seven kernel primitives (right). Delta x,y, rv, ρr, μ, and xp form the core field. θ and ds are extended primitives for figurative and architectural evaluation.
Table 1: VTL kernel metrics across 400 MidJourney v7 images. n=400. No outliers removed. This is the full distribution.
MetricWhat It MeasuresMJ MeanMJ StdFinding
Delta x,yHorizontal/vertical placement offset0.00530.0444Only 34% of space used
rvVoid ratio: low-gradient fraction0.85050.034180–96% void on every image
ρrPacking density of mass region40.9212.36Over-detailed at center
μCohesion vs. fragmentation0.2680.245Low mean; fragmentation dominates
xpPeripheral pull (field invariant)0.3940.097Edge tension via fragmentation, not placement
θOrientation stability0.0460.054Minimal directional coherence
dsStructural thickness0.01670.0031Simulated depth, not volumetric form

Two findings from this table deserve particular attention. First: xp (peripheral pull) = 0.394 through fragmentation (low μ = 0.268), not through lateral displacement (Delta x = 0.005). The model creates the appearance of edge tension by scattering detail across the frame or strong gradients rather than by actually placing the subject off-center. This produces what artists describe as 'muddy edges': compositional energy at the boundary that does not resolve into deliberate placement. Existing metrics cannot detect this distinction.

MidJourney analysis: thin filamentary structures
Figure 11: MidJourney analysis, 2025

Second: ds = 0.0167 indicates thin, filamentary structures throughout. The model is rendering simulated depth rather than volumetric form. This is quantified evidence of something artists observe intuitively: AI-generated figures lack mass in the dimensional sense. They are surface renditions. The gradient is present; the weight is not.

Part 4: What 400 Images Show

Horizontal Placement (Delta x)

Delta x distribution across 400 images
Figure 12: Delta x distribution across 400 images. The model uses 34% of available horizontal space. No image achieves |Delta x| > 0.191 despite 100 semantically diverse prompts.

Theoretical range: −0.5 (extreme left) to +0.5 (extreme right). Observed range: −0.146 to +0.191. 95% of images fall within ±0.089 of center.

In the MidJourney monoculture set, the most extreme left image is the woman reading by a window, Delta x = −0.144, 15% off-center. Human artists routinely place figures at 40% displacement for the rule of thirds. The most extreme right image is a fractal pattern at Delta x = +0.191. Still well within what any painter or photographer would call 'centered.'

MidJourney analysis: spatial prior overrides explicit placement prompt
Figure 13: MidJourney analysis, 2025

When told explicitly to place a figure 'in the extreme lower-left corner, vast empty white wall,' MidJourney generates Delta x = −0.1186. The prompt is honored semantically. A small figure exists. White wall surrounds it. Gradients pull the composition up and toward the center. The spatial prior remains. The prior is stronger than the prompt.

MidJourney analysis: figure in corner, prior pulling to center
Figure 14: MidJourney analysis, 2025

Void Ratio (rv)

rv distribution across 400 images: mean 0.850, std 0.034
Figure 15: rv distribution across 400 images. Mean 0.850, std 0.034. Whether generating a butterfly, a businessman, or a fractal, MidJourney maintains 80–96% void. Subject does not explain this structure.

Theoretical range: 0 (all edges, fully packed) to 1.0 (completely empty). Observed range: 0.795 to 0.955. All 400 images cluster between 80% and 96% void.

MidJourney analysis: void ratio outliers still respect template
Figure 16: MidJourney analysis, 2025

Most void-heavy image: spring garden, rv = 0.954. 95% empty space, yet centered (Delta x = −0.053). Most packed image: octopus, rv = 0.795. Still 80% void, still centered (Delta x = +0.050). Even the outliers respect the template.

Different textures. Different colors. Different subjects. Same spatial structure. The model has learned to generate infinite semantic variations of one compositional template.

Part 5: Forbidden Zones

400 MidJourney images in Delta x vs rv space: narrow ellipse in Basin B0
Figure 17: 400 MidJourney images in Delta x vs rv space. The cluster occupies a narrow ellipse in Basin B0. The shaded regions are compositional territories the model systematically avoids: not rare edge cases, but fundamental artistic vocabulary.
MidJourney analysis: forbidden zone examples
Figure 18: MidJourney analysis, 2025

When all 400 images are plotted in Delta x–rv space, the revealing feature is not the cluster. It is the emptiness around it. Entire compositional territories are absent, not by chance, but by architectural constraint.

Table 2: Empirical forbidden zones. Boundaries are observed, not theoretical. Based on n=400 images across 100 prompts, 8 semantic categories.
Forbidden ZoneEmpirical BoundaryArtistic Practice It BlocksWhat It Requires
Strong asymmetryMax |Delta x| = 0.191Rule of thirds, editorial photographyDelta x ≥ 0.33
Dense packingMin rv = 0.795Poster design, graphic layoutsrv < 0.60
Intentional voidMax rv = 0.955 (low μ)Minimalism, deliberate emptinessHigh rv + high μ together
B3 zone (void + displacement)Zero images: |Delta x| > 0.2 AND rv > 0.85Hokusai-type edge-weighted figuresBoth extremes simultaneously

These are not rare strategies. Rule of thirds (Delta x ≥ 0.33) is the first compositional technique taught in photography and painting. Poster-density design (rv < 0.60) is standard graphic design practice. The model's forbidden zones map directly onto the art school curriculum.

MidJourney analysis: canonical works at coordinates the model cannot reach
Figure 19: MidJourney analysis, 2025

Canonical works occupy coordinates MidJourney cannot reach under standard prompting, but this is not saying it can't be done. Through adversarial prompting, compositional balances can be shifted.

MidJourney analysis: adversarial prompt result
Figure 20: MidJourney analysis, 2025

However, they tend to be tradeoffs being mostly the singular focus of the prompt. Unlike, say, Caravaggio's The Calling of Saint Matthew which uses extreme edge lighting and radical asymmetry. Degas composes figures in competing directional vectors. Hokusai's The Great Wave places mass at the periphery with radical displacement. Vermeer constructs through rectangular framing devices that generate distributed gradient regions across multiple spatial planes. This is not a quality argument. It is a compositional range argument. AI, at present state, focuses on one rule to get it to snap outside the operating window boundary.

Part 6: Two Organizational Strategies

Below are side-by-side comparisons. MidJourney woman-reading-by-window with radial mass overlay (Delta x = −0.143 despite figure's left placement). And Vermeer's Girl Reading a Letter by an Open Window with gradient overlay showing distributed planar mass (Delta x = 0.039). Vermeer, one might quickly point out, is very centered.

MidJourney analysis: radial mass organization despite left figure placement
Figure 21: MidJourney analysis, 2025
Vermeer analysis: distributed planar mass through rectangular framing devices
Figure 22: Vermeer analysis, 2025

Vermeer composition distributes mass across multiple planar zones through rectangular framing devices. MidJourney, on the other side, organizes radially from center outward regardless of subject position, like many of its images.

Vermeer (Delta x = 0.039): Composition organized through rectangular framing devices. Window, curtain, letter, and painted map each generate gradient mass in distinct spatial planes. The mass is distributed across multiple zones at different depths, creating planar transitions. The eye moves on a path constructed from interlocking shapes.

MidJourney analysis: radial organization from center outward
Figure 23: MidJourney analysis, 2025

MidJourney (Delta x = −0.143): Despite the figure's left placement, gradient mass organizes radially from center outward. The window frame, the subject's hair, the book, and the light source all curve inward toward the compositional center of gravity. What the viewer reads as a left-weighted image is, at the gradient level, a center-weighted composition with a left-displaced subject.

MidJourney and Vermeer analysis: organizational strategy comparison
Figure 24: MidJourney and Vermeer analysis, 2025

AI, in a number of corpus runs, radially distributes naturally from the center.

MidJourney analysis: radial distribution from center
Figure 25: MidJourney analysis, 2025

This is not a quality comparison. It is evidence of organizational strategy: diffusion models employ a structurally distinct compositional system that persists regardless of semantic content, and that system differs fundamentally from how trained artists construct space.

Part 7: Why Existing Metrics Cannot See This

CLIP measures semantic similarity. It does not ask where the butterfly is, only whether a butterfly is present. A centered butterfly and a Hokusai-positioned butterfly score identically.

FID measures distance between generated and real feature distributions. The training data has the same compositional bias as the generated outputs. The benchmark is built on the bias it is supposed to measure.

Aesthetic predictors learn human preferences. Human preferences have been shaped by decades of photography, social media, and AI output. Centered, balanced, readable compositions score well because that is what the rater population has been trained on. The feedback loop is closed.

T2I-CompBench and GenEval measure relational correctness. Spatial constraints are met while compositional collapse remains invisible.

The evaluation ecosystem measures what goes into the image. Nobody was measuring where it went, or whether it went to the same place every time. Geometric behavior is measured only by the kernel.

Conclusion: The Template

Diffusion models do not generate images. They generate a compositional template in the first ten denoising steps, populate it with semantic content in the following twenty, and refine the surface details in the last twenty. Users see step fifty and interpret it as diversity.

The template parameters for MidJourney in vertical 2:3 format: Delta x = 0.005 ± 0.044 (centered), rv = 0.850 ± 0.034 (85% void), mass organized within a radial envelope. This template loads for butterflies, businessmen, cathedrals, and fractals. Different content. Same structure.

The contribution of this framework is measurement infrastructure: quantitative bounds on compositional range, vocabulary for describing what artists have always perceived and researchers have lacked tools to verify, a diagnostic that makes the invisible visible and, therefore, addressable.

Authorship. This framework was developed by Russell Parrish. The conceptual architecture, design decisions, and all substantive judgments are human-authored. Every structural choice originated from artistic practice and was refined through multi-model consultation.

This work is offered as a contribution, not criticism. The models can generate a potential infinite images. But they have measurable compositional constraints that existing metrics don't capture. This framework makes those constraints visible and provides infrastructure for addressing them.