I am currently a URECA student at MMLab@NTU. I am fortunate to be supervised by Prof. Ziwei Liu, I am also grateful for the extensive help I have received from Bo Li. My research interests currently lie in large-scale multi-modality models.
During my time at Shaoxing No.1 High School, I actively participated in the Olympiad in Informatics. Additionally, I competed in the prestigious ICPC competition, where I earned a gold medal in the 2022 Kunming Asia Regional Contest and a silver medal in the 2022 Nanjing Asia Regional Contest.
The advances of large foundation models necessitate wide-coverage, low-cost, and zero-contamination benchmarks. Despite continuous exploration of language model evaluations, comprehensive studies on the evaluation of Large Multi-modal Models (LMMs) remain limited. In this work, we introduce LMMS-EVAL, a unified and standardized multimodal benchmark framework with over 50 tasks and more than 10 models to promote transparent and reproducible evaluations. Although LMMS-EVAL offers comprehensive coverage, we find it still falls short in achieving low cost and zero contamination. To approach this evaluation trilemma, we further introduce LMMS-EVAL LITE, a pruned evaluation toolkit that emphasizes both coverage and efficiency. Additionally, we present Multimodal LIVEBENCH that utilizes continuously updating news and online forums to assess models’ generalization abilities in the wild, featuring a low-cost and zero-contamination evaluation approach. In summary, our work highlights the importance of considering the evaluation trilemma and provides practical solutions to navigate the trade-offs in evaluating large multi-modal models, paving the way for more effective and reliable benchmarking of LMMs. We opensource our codebase and maintain leaderboard of LIVEBENCH at https://github.com/EvolvingLMMs-Lab/lmms-eval and https://huggingface.co/spaces/lmms-lab/LiveBench.
@article{zhang2024lmms,title={LMMs-Eval: Reality Check on the Evaluation of Large Multimodal Models},author={Zhang, Kaichen and Li, Bo and Zhang, Peiyuan and Pu, Fanyi and Cahyono, Joshua Adrian and Hu, Kairui and Liu, Shuai and Zhang, Yuanhan and Yang, Jingkang and Li, Chunyuan and Liu, Ziwei},year={2024},eprint={2407.12772},archiveprefix={arXiv},primaryclass={cs.CV},google_scholar_id={qjMakFHDy7sC},}
High-quality instructions and responses are essential for the zero-shot performance of large language models on interactive natural language tasks. For interactive vision-language tasks involving intricate visual scenes, a large quantity of diverse and creative instruction-response pairs should be imperative to tune vision-language models (VLMs). Nevertheless, the current availability of vision-language instruction-response pairs in terms of quantity, diversity, and creativity remains limited, posing challenges to the generalization of interactive VLMs. Here we present MultI-Modal In-Context Instruction Tuning (MIMIC-IT), a dataset comprising 2.8 million multimodal instruction-response pairs, with 2.2 million unique instructions derived from images and videos. Each pair is accompanied by multi-modal in-context information, forming conversational contexts aimed at empowering VLMs in perception, reasoning, and planning. The instruction-response collection process, dubbed as Syphus, is scaled using an automatic annotation pipeline that combines human expertise with GPT’s capabilities. Using the MIMIC-IT dataset, we train a large VLM named Otter. Based on extensive evaluations conducted on vision-language benchmarks, it has been observed that Otter demonstrates remarkable proficiency in multi-modal perception, reasoning, and in-context learning. Human evaluation reveals it effectively aligns with the user’s intentions. We release the MIMIC-IT dataset, instruction-response collection pipeline, benchmarks, and the Otter model.
@article{li2023mimicit,title={MIMIC-IT: Multi-Modal In-Context Instruction Tuning},author={Li, Bo and Zhang, Yuanhan and Chen, Liangyu and Wang, Jinghao and Pu, Fanyi and Yang, Jingkang and Li, Chunyuan and Liu, Ziwei},year={2023},eprint={2306.05425},archiveprefix={arXiv},primaryclass={cs.CV},google_scholar_id={u5HHmVD_uO8C},}