Weijian Deng

Student Recruitment

I am looking for highly motivated students interested in ambitious research at the intersection of multimodal intelligence, physical 3D vision, and diffusion models. The directions below are active starting points; I also welcome strong proposals that connect naturally to these themes.

Self-Evaluating and Self-Improving MLLMs

This research extends prior PhD work on unsupervised model evaluation to large multimodal models. The goal is to build AI agents that can understand and evaluate their own strategies, identify failure modes, and take the right actions to improve in new environments, starting with spatial intelligence.

Physical 3D Vision and World Models

Generated videos, geometry, and scenes should be useful, not merely plausible, so we study physical consistency, geometry, object behavior, world models, and aircraft-design applications. Physical 3D vision and world models ground self-improving AI agents in the physical world. These projects study how agents explore, perceive, understand, manipulate, and interact with 3D environments.

Understanding and Steering Diffusion Models

We analyze which properties of diffusion models lead to specific behaviors, then use those insights to make generation more useful. Example questions include which generated images improve generalization, how multi-view generation helps models understand the observed world, and how latent spaces can be explored or post-trained.

Weijian Deng

Student Recruitment

Self-Evaluating and Self-Improving MLLMs

Physical 3D Vision and World Models

Understanding and Steering Diffusion Models

Before Contacting Me

I. Predicting Model Generalization

Ranked from within: Ranking large multimodal models for visual question answering without labels

MANO: Exploiting Matrix Norm for Unsupervised Accuracy Estimation Under Distribution Shifts

What Does Softmax Probability Tell Us about Classifiers Ranking Across Diverse Test Conditions?

A Bag-of-Prototypes Representation for Dataset-Level Applications

Confidence and Dispersity Speak: Characterising Prediction Matrix for Unsupervised Accuracy Estimation

AutoEval: Are Labels Always Necessary for Classifier Accuracy Evaluation?

What Does Rotation Prediction Tell Us about Classifier Accuracy under Varying Testing Environments?

Are Labels Always Necessary for Classifier Accuracy Evaluation?

II. Monitoring Model Reliability

Toward a Holistic Evaluation of Robustness in Clip Models

An Empirical Study into What Matters for Calibrating Vision-Language Models

A Closer Look at the Robustness of Contrastive Language-Image Pre-Training (CLIP)

Adaptive Calibrator Ensemble: Navigating Test Set Difficulty in Out-of-Distribution Scenarios

On the Strong Correlation Between Model Invariance and Generalization

III. 3D Modeling & Generation

Pos3R: 6D Pose Estimation for Unseen Objects Made Easy

Unsupervised Decomposition of 3D Shapes into Expressive and Editable Extruded Profile Primitives

Can We Achieve Efficient Diffusion without Self-Attention? Distilling Self-Attention into Convolutions

Manual-PA: Learning 3D Part Assembly from Instruction Diagrams

3D-GPT: Procedural 3D Modeling with Large Language Models

Ray Deformation Networks for Novel View Synthesis of Refractive Objects

Differentiable Neural Surface Refinement for Transparent Objects

IV. Enhancing Model Generalization

Delving into Cascaded Instability: A Lipschitz Continuity View on Image Restoration and Object Detection Synergy

Image-Image Domain Adaptation with Preserved Self-Similarity and Domain-Dissimilarity for Person Re-identification

Similarity-preserving Image-image Domain Adaptation for Person Re-identification

Domain Alignment with Triplets

Rethinking Triplet Loss for Domain Adaptation

Fine-grained Classification via Categorical Memory Networks

Split to Learn: Gradient Split for Multi-Task Human Image Analysis

Ranking Models in Unlabeled New Environments

SVDNet for Pedestrian Retrieval