Interpreting CLIP scores

I've understood why CLIP vectors have unexpectedly low dot products.

In high dimensions, two randomly picked vectors will be almost orthogonal. If you work through the math, you can see this doesn't have anything to do with the "weirdness" of high dimensional spaces, but is a restatement of random walks. This does presuppose finite variance, but let us say we sample uniformly, then it holds.

This fact has a precise rendition - for a d dimensional space, the dot product of two uniformly randomly picked unit vectors will be around 1 standard deviation, with a value 1/√d (these are two separate facts - that the dot product ends up being equated to the standard deviation of the distribution we're picking from, and that this standard deviation will be 1/√d. Reminder that both these are easy to see if if you work through the math)

This establishes a noise floor. Training starts with such almost orthogonal vectors, and then each time two vectors co-occur in a context, it "fire together wires together" them ever so closer in terms of dot product.

During training, each of these slight shifts are very slight, because bigger ones risk breaking the constraints to other vectors that have already been encoded in the direction of the vector. So nudge slightly, and do it many times, and eventually similar vectors for particular contexts end up with a dot product greater than the noise floor.

Since the shifts are slight, we'll almost never end up with vectors that are representing similarity our linear expectation of high values like 0.8, 0.9. So instead we end up with numbers like 0.2 representing similarity, with 0.3 being very similar.

When interpreted in terms of the standard deviation, this makes sense. CLIP vectors are 512 dimensions, so standard deviation is 1/√d = 0.044, so 0.3 is 7 standard deviations above the noise floor, which is a decisive signal in this regime.