Multi-Modal AI

Home »  » Multi-Modal AI

Natural language processing of audio files has used quite often in the last decade as the quality has continued to scale with computing power. In 2023, several leading AI models began incorporating visual input interfaces.  Essentially, you can point a smartphone camera at an image and ask the AI to interface with it. Visual recognition technology has been utilized in several disciplinary applications (like plant identification). However, direct incorporation of visual inputs into frontier Large language models (LLM) like ChatGPT and Google Gemini have increased the likelihood of these tools intersecting with teaching and learning in higher education.

How Does It Work?

Most LLMs use a few common recognition algorithms. There are two common ways for these LLMs to analyze visual material. First, through the chatbot or API interface, users can upload an image, video file directly to the software. Then, they can use the prompting system to respond to their request. For example, Google Gemini’s Ultra Pro software (available mid-2024) can analyze an hour of video or 11 hours of audio. Second, app interfaces allow direct uploading of pictures from smartphones. With the release of GPT-4o in May of 2024, robust and general visual analysis is available to all users of frontier LLM models.

What Does This Mean for Teaching and Learning?

This depends greatly on your context and perspectives on AI use. It is safe to assume that some students in your courses are willing to pay a subscription for access to leading frontier models. There will be an increasing likelihood that out of class activities and assignments that use visual media cannot be assumed to be free of AI assistance.  For example, even a quick picture of a biochemistry problem on a computer screen is sufficient for ChatGPT to provide a correct solution. LLMs are also able to solve Parsons Problems, a type of programming puzzle where students must assemble mixed pieces of code into a logical order. LLMs can also view pictures and translate across languages. While current models struggle with certain physics and math concepts, there are multiple, publicly available visual scientific concepts that can be solved by taking a simple photograph with a phone. This complicates significantly any out of class visual analysis in courses that prioritize such skills and practice.

If you are concerned about your course’s specific content and learning outcomes and how multi-modal AI might allow learners to avoid parts of visual identification process, you can reach out to and we will be happy to explore how multi-modal AI might influence the structure of your learning activities.

Are There Creative Uses for This Technology?

While not yet widely available, the paid frontier models can be used for demonstrative purposes in you have access to them.  Historian Benjamin Breen has used them to translate historical manuscripts, guess redacted words in documents and draw social and political trends from historic advertisements. You might consider modeling the decision-making process that accompanies image identification and then compare that to a multi-modal LLM.  Multi-modal AI are also capable translators of written work notes on whiteboards to typed text in document form.  Students can save their progress on group and team work and ask an LLM to transcribe their work, and possibly even summarize. You might also try this as a critical learning exercise. If you use substantive written text during your in-class activities and you feel comfortable integrating student AI-use into your courses, try having a visual AI transcribe, analyze and summarize the written content. Then, you can collectively evaluate the output and determine what information may have been misconstrued, what mirrored your own classes analysis, and what additional information might be needed for additional evaluation.

Module Navigation

Leave Your Feedback