Robin Sloan
the lab
October 2021

The cutouts

This blog post is for approx­i­mately twelve people, but those twelve people are going to love it!

This year, there’s been a prolif­er­a­tion of wonderful, zine-y Colab notebooks that generate images from text prompts, all using OpenAI’s profound CLIP model alongside different kinds of image generators.

Reading these notebooks, I kept running into a chunk of code like this:

cutouts = []
for _ in range(self.cutn):
    size = int(torch.rand([])**self.cut_pow * (max_size - min_size) + min_size)
    offsetx = torch.randint(0, sideX - size + 1, ())
    offsety = torch.randint(0, sideY - size + 1, ())
    cutout = input[:, :, offsety:offsety + size, offsetx:offsetx + size]
    cutouts.append(F.adaptive_avg_pool2d(cutout, self.cut_size))
return torch.cat(cutouts)

I read this as: “from the image under consideration, generate a bunch of views of smaller portions, their locations random­ized.” But I had no idea why you’d want to do that.

The pixray project from @dribnet goes further; it doesn’t only offset the views, but warps their perspec­tive and shifts their colors!

augmentations.append(K.CenterCrop(size=self.cut_size, cropping_mode='resample', p=1.0, return_transform=True))
augmentations.append(K.RandomPerspective(distortion_scale=0.20, p=0.7, return_transform=True))
augmentations.append(K.ColorJitter(hue=0.1, saturation=0.1, p=0.8, return_transform=True))

The notebooks discuss the number of these “cutouts” as a key deter­mi­nant of quality, as well as memory consumption. Well, I am always inter­ested in both quality AND memory consumption, so I wanted to figure out what the cutouts were actually doing, and the code alone wasn’t forthcoming.

Characteristically, it was a tweet from the prolific Mario Klingerman, discov­ered on page three of a Google search, that provided the answer:

CLIP can only process patches of 224x224 so the typical way of evolving an image involves making a batch random crops at different scales so it can work on the whole image as well as details at the same time. [ … ]

I also got an inti­ma­tion of this from Ryan Murdock, who basically insti­gated all of these zines, in an interview on Derrick Schultz’s YouTube channel; he locates the technique’s discovery precisely in his exper­i­ments with these cutouts.

Here’s how I understand it now: the cutouts are different “ways of looking” at the image being generated. Maybe one cutout centers on the shadow fleeing down the corridor, while another looks closely at the pool of blood on the marble floor, and a third frames both details together. The potential for the cutouts to overlap and aggregate feels important to me; they don’t represent a grid-like decomposition, but rather a stream of glances. They feel very much like the way you’d absorb a painting at a museum, honestly.

(Here is your reminder that the eyes doing the glancing are CLIP’s, which then reports the numeric version of: “Eh, looks to me like somebody spilled some ketchup, but, if the fleeing shadow had a knife, then I might call it a murder … ” To which the image generator replies: “Got it! Adding a knife!”)

In most of the notebooks I’ve encountered, there are between 32 and 40 cutouts. That number is mostly deter­mined by memory constraints, but I wonder if, even granted infinite VRAM, there’s a cutout sweet spot? Often, systems of this kind thrive on restrictions.

I imagine 32 periscopes peeking up from the water, swiveling to find their targets, trying to make sense of the panorama of the world.

I believe these cutouts are newly random­ized on each step of the generator’s march towards a satis­fac­tory image, so it’s not only an aggre­ga­tion of views across space, but also over time, as more and more “ways of looking” are evaluated.

Here’s a quick example: four images, all generated using the same prompt, settings, and random seed. The only differ­ence is the number of cutouts, which decreases from the upper-left, clockwise: 30 to 20 to 10 to 2.

A grid of four images; as the number of cutouts decrease, they get a bit blurrier, more "general".

This is just my interpretation, but, as the number of cutouts decreases, I think I see the images getting both fuzzier and more “general”; there is perhaps a sense of CLIP squinting at the whole thing — “sure, that’s dark queen-ish”—rather than attending to particular details.

I’m not sure any of n=(30-10) are “better” or “worse,” though — which is inter­esting and, if you ask me, heartening.

Okay, I hope you twelve people got something out of this!

October 2021, Berkeley