SAI RAJESWAR
I am a Senior Research Scientist at ServiceNow Research in Montreal, engaged in fundamental AI research. My current focus is on the effective learning of vision-language generative models that form the backbone of Foundation Multimodal Models. My work also spans building generalist agent models by effectively learning vision-language-action models. In summary, I aim to integrate perception and action to improve real-world applicability, always with an eye towards responsible impact on society at large.
I obtained my Ph.D. at MILA, University of Montreal, supervised by Prof.Aaron Courville. where I am included in the Dean’s Honor List for the graduating year 2022-23. During my Ph.D. I had an opportunity to work as Research Scientist Intern at Google DeepMind and Google Research. Previously, I obtained masters in computer science from IIT Delhi, where I was a recipeint of Prof. A.K.Sinha best student award for the graduating year 2015-16.
See Google Scholar page for more research.
- If my line of work interests align with yours, I am open to collaborations.
Recent Research (To be Updated)
MULTIMODAL FOUNDATION WORLD MODELS FOR GENERALIST EMBODIED AGENTS
GenRL allows one to specify tasks through vision and/or language prompts, ground them in the embodied domain’s dynamics, and learns the corresponding behaviors in imagination. This exhibits strong multi-task generalization in locomotion and manipulation domains. Read more.
REPRESENTING POSITIONAL INFORMATION IN GENERATIVE WORLD MODELS FOR OBJECT MANIPULATION
We introduce a general approach that empowers world model-based agents to effectively solve object-positioning tasks. We propose two declinations of this approach for generative world models: position-conditioned (PCP) and latent-conditioned (LCP) policy learning Read more.
EQUIVARIANT ADAPTATION OF LARGE PRETRAINED MODELS
Equivariant networks are specifically designed to ensure consistent behavior with respect to a set of input transformations, leading to higher sample efficiency and more accurate and robust predictions. Read more
THE UNSOLVED CHALLENGES OF LLMS AS GENERALIST WEB AGENTS: A CASE STUDY
In this work, we investigate the challenges associated with developing goal-driven AI agents capable of performing novel tasks in a web environment using zero-shot learning. Our primary focus is on harnessing the capabilities of large language models (LLMs) as generalist web agents interacting with HTML-based user interfaces (UIs). Read more
MASTERING THE USUPERVISED REINFORCEMENT LEARNING FROM PIXELS
In this work, we study the URLB and propose a new method
to solve it, using unsupervised model-based RL, for pre-training the agent, and a task-aware finetuning strategy combined with a new proposed hybrid planner, Dyna-MPC, to adapt the agent for downstream tasks. Read more
CHOREOGRAPHER: LEARNING & ADAPTING SKILLS IN IMAGINATION
We present Choreographer, a model-based agent that exploits its world model to learn and adapt skills in imagination. Our method decouples the exploration and skill learning processes, being able to discover skills in the latent state space of the model. During adaptation, the agent uses a meta-controller to evaluate and adapt the learned skills efficiently by deploying them in parallel in imagination. Read more
EFFICIENT DYNAMICS MODELING IN INTERACTIVE ENVIRONMENTS WITH KOOPMAN THEORY
We approach this problem from the lens of Koopman theory, where the
nonlinear dynamics of the environment can be linearized in a high-dimensional latent space. This allows us to efficiently parallelize the sequential problem of long-range prediction using convolution while accounting for the agent’s action at every time step. Read more