Multimodal Unlearning Across Vision, Language, Video, and Audio: Survey of Methods, Datasets, and Benchmarks

Sarwar, Nobin; Dipta, Shubhashis Roy; Liu, Zheyuan; Patil, Vaidehi

Multimodal Unlearning Across Vision, Language, Video, and Audio: Survey of Methods, Datasets, and Benchmarks

Nobin Sarwar

, Shubhashis Roy Dipta

, Zheyuan Liu

, Vaidehi Patil

University of Maryland, Baltimore County ·

University of Notre Dame ·

UNC Chapel Hill

Paper TechRxiv Repository

Overview of Survey

Multimodal unlearning requires identifying effective intervention points within the model pipeline. Figure 2 illustrates methods spanning data-side, training-time, architecture-constrained, and decoding-time stages, producing an updated model (MFM′). Training-free approaches instead apply direct parameter or representation edits (Δ).

Figure 2: System-level intervention points for multimodal unlearning across the model pipeline.

We organize multimodal unlearning via a system-first taxonomy across five intervention stages: Data-Side Interventions (Section 3.1); Training-Time Edits (Section 3.2); Architecture-Constrained Unlearning (Section 3.3); Training-Free Unlearning (Section 3.4); Decoding-Time Unlearning (Section 3.5).

Figure 1: Taxonomy of multimodal unlearning by intervention stage and control pathway.

Evaluation Metrics

Evaluation uses metric suites that assess forgetting, utility retention, robustness, and efficiency, as summarized in Figure 3. We defer detailed metric definitions and evaluation protocols to Appendix C.

Figure 3: Evaluation dimensions and representative metrics for multimodal unlearning.

Applications of Multimodal Unlearning

Multimodal unlearning enables selective removal of specific identities, attributes, or concepts without full retraining while preserving overall capability and stability. Detailed use cases and representative studies are provided in Appendix F.

Figure 4: Core application scenarios of multimodal unlearning.

Contact

This repository is actively maintained and continuously updated 🚀.
If you notice any issues or would like your work included, please open an issue or contact us:

📧 smsarwar96@gmail.com

BibTeX

@article{sarwar2026mm-unlearning-survey,
  title = {{Multimodal Unlearning Across Vision, Language, Video, and Audio: Survey of Methods, Datasets, and Benchmarks}},
  author = {Sarwar, Nobin and Roy Dipta, Shubhashis and Liu, Zheyuan and Patil, Vaidehi},
  year = {2026},
  doi = {10.36227/techrxiv.176945748.88280394/v1},
  url = {https://doi.org/10.36227/techrxiv.176945748.88280394/v1},
  publisher = {Institute of Electrical and Electronics Engineers (IEEE)},
  month = jan
}