Visual Memory
How do we store incoming visual information from the world around us? It is easy to take for granted just how much information our minds hold onto from one moment to the next. Every time you move your eyes—which happens around 100,000 times a day—your brain briefly stores what you just saw so it can compare it with what you are looking at next. The cognitive systems that handle this incredible task are known collectively as visual working memory.
While scientists have studied visual working memory for decades, the vast majority of research relies on simplified artificial arrays—like a few colored squares on a gray screen. While highly controlled, these traditional experiments don’t capture the messy, continuous, beautifully complex nature of the real world. In real-world scenes, colors fade gradually, shapes are irregular, and objects blend into one another.
To bridge this gap, Steve and I developed a new way to model how the human brain stores real-world visual scenes. Our model proposes that the working memory representation of a scene is simply a weaker, noisier echo of the brain’s original perceptual representation in the ventral visual processing pathway. To test this, we used CORnet-S, a deep neural network (DNN) explicitly engineered to mirror the anatomical stages of the human visual cortex (from basic features in V1 to complex, abstract objects in IT).
The results were striking. In some of our experiments, our model accounted for a massive 76% of the variance in human response times across different scene pairs. It vastly outperformed basic pixel-level changes (which explained only 29%) and came remarkably close to matching the “ground truth” from explicit human ratings of visual similarity (78%).
We even tested it on focal changes where an entire object was added or deleted from a photograph using Photoshop. After modifying the network to account for “center bias”—the human tendency to focus on the center of an image—the model continued to successfully predict how well a human subject would detect specific object changess.
Predicting behavior is one thing, but does the model actually capture how the human brain actually does any of this? To answer this, we recorded electrical brain activity using EEG while participants held a real-world scene in their working memory.
Using a technique called Representational Similarity Analysis (RSA), we compared the geometric relationships of the neural patterns recorded from the scalp directly to the population vectors inside the neural network. Right on cue, during the crucial delay period when the image was missing from the screen, the patterns of voltage across the scalp closely matched the representational geometry of the model’s visual areas, indicating that there is indeed a link between the two representational spaces.
