Goal-Directed Active Vision System

A goal-directed active vision agent that learns where to look — trained with Inverse Reinforcement Learning on human gaze data, outperforming passive scan baselines on the COCO-Search18 benchmark.

Overview

Humans don't scan scenes randomly. They fixate strategically — driven by a goal. This project models that behavior.

Given a target object category, the agent learns a fixation policy that mimics human search behavior using IRL, guided by semantic features from CLIP and spatial priors from real eye-tracking data.

Results

Method	SS (↑)	TFP (↑)
Passive Baseline (random scan)	0.31	0.29
Goal-Directed Agent (ours)	0.58	0.54

Significant improvement over passive baseline on both Search Score and Target Fixation Proportion metrics.

Architecture

Input Image (COCO scene)
        │
        ▼
   CLIP ViT-L/14
   (semantic patch embeddings)
        │
        ▼
   Fixation Policy Network (PyTorch)
   trained via IRL on COCO-Search18 gaze data
        │
        ▼
   Sequential Fixation Sequence
   (goal-conditioned, human-like)
        │
        ▼
   Target Found / Search Terminated

Stack

Component	Tool
Vision backbone	CLIP ViT-L/14 (OpenAI)
Deep learning	PyTorch
Computer vision	OpenCV
IRL training	Custom reward learning loop
Dataset	COCO-Search18 (human gaze sequences)

Dataset

COCO-Search18 — 18 target object categories with human eye-tracking fixation sequences recorded during goal-directed visual search tasks. Used to learn a reward function via IRL that captures human search behavior.

How It Works

Feature Extraction — each image patch is encoded using CLIP ViT-L/14 to get rich semantic representations
Reward Learning (IRL) — a reward function is learned from human fixation sequences in COCO-Search18, capturing what makes a fixation "good" given a target goal
Policy Training — a fixation policy is trained to maximize the learned reward, producing goal-conditioned sequential fixations
Evaluation — the agent is evaluated against a passive baseline using Search Score (SS) and Target Fixation Proportion (TFP)

References

COCO-Search18 — Yang et al., 2020
CLIP — Radford et al., 2021
Inverse Reinforcement Learning — Ziebart et al., 2008

Author

Arnav — @https_arnav · GitHub · Portfolio

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
model		model
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Goal-Directed Active Vision System

Overview

Results

Architecture

Stack

Dataset

How It Works

References

Author

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Goal-Directed Active Vision System

Overview

Results

Architecture

Stack

Dataset

How It Works

References

Author

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages