# Artificial Intelligence
# text-image
# multi-modality
# long-read
Published On: February 8, 2024 (Last updated on: April 15, 2024)
1198 words · 6 min
Before For a long time, the machine learning model (deep learning model) cannot understand more than one modality, i.e., whether they knows how to do the text-based task or they know how to play with the image. As artificial intelligence, the researchers would like to the models have the cability of manipulating the multimodal data as the natural intelligence is not limited to just a single modality. Such that the AI shell read and write text while they could also see images and watch videos and hear the audio.