The Next Generation of ChatBot - Visual ChatGPT



ChatGPT has been public for almost six months since it was released on November 2022 and yet has acquired approximately 100 million users worldwide according to two sources [1] [2]. Even though has so many users in no time, compared with the previous existing tech giant products. Still, ChatGPT lack one of the biggest human need when coming to digital conversation or messaging, which is the capability to process visual information.
ChatGPT is a language model that excels at text-based conversations and reasoning. In contrast, Stable Diffusion is a foundation model that specializes in generating images from textual descriptions or prompts. By combining the capabilities of these two models, it would be possible to create an AI product that can handle both textual and visual information, thereby meeting a wide range of human needs.
Logically if we combine these two models' features then we will eventually get an AI product that is capable of handling both text and visual information. Recently a few researchers from Microsoft Research Asia have released a paper to give a solution for this, Visual ChatGPT.
Visual ChatGPT
It enables you to send, receive, and edit images during chatting.

Visual ChatGPT is a system, not a model that can handle both textual and visual information.
Could we build a ChatGPT-like system that also supports image understanding and generation? One intuitive idea is to train a multi-modal conversational model. However, building such a system would consume a large amount of data and computational resources. Besides, another challenge comes that what if we want to incorporate modalities beyond languages and images, like videos or voices? Would it be necessary to train a totally new multi-modality model every time when it comes to new modalities or functions? - Microsoft Research Asia
The explanation is trained a new model will be very costly and time-consuming besides it's not practical if later we need the model to handle information other than text or images too.
How does It work?
Visual ChatGPT uses the existing Visual Foundation Models or VFMs and combines them with ChatGPT to make a system that can process text as well as visual information but how to glue them together and make it works? meet the manager, Prompt Manager.
Prompt Manager
The sole purpose of Prompt Manager is of course to manage prompts. It uses prompts to communicate between VFMs and uses VFMs to generate or process visual information. On the other hand, Prompt Manager feeds prompts and generates prompts by using the most intelligent assistant on the planet, which is ChatGPT!

One of the beautiful things about Visual ChatGPT is the use of Prompts. It proves that Prompt will probably be "the first-class AI programming language".
According to the paper, Prompt Manager support functions are:
• Explicitly tells ChatGPT the capability of each VFM and specifies the input-output formats.
• Converts different visual information, for instance, png images, depth images, and mask matrix, to language format to help ChatGPT understand.
• Handles the histories, priorities, and conflicts of different Visual Foundation Models.
Multi-modal Dialogue

The screenshot above shows us how Visual ChatGPT handles multi-modal dialogue.
If you want to dig deeper into its implementation I suggest you dig into their source code at GitHub. The code is written in Python, a recommended language if you want to learn AI professionally.
Limitations
No system is perfect neither is Visual ChatGPT. Yes, it is a promising solution but for now, there are a few limitations that you should be aware of:
• Dependence on ChatGPT and VFMs
Visual ChatGPT relies heavily on ChatGPT to assign tasks and on VFMs to execute them. The performance of Visual ChatGPT is thus heavily influenced by the accuracy and effectiveness of these models.
• Heavy Prompt Engineering
Visual ChatGPT requires a significant amount of prompt engineering to convert VFMs into language and make these model descriptions distinguishable. This process can be time-consuming and requires expertise in both computer vision and natural language processing.
• Limited Real-time Capabilities
Visual ChatGPT is designed to be general. It tries to decompose a complex task into several subtasks automatically. Thus, when handling a specific task, Visual ChatGPT may invoke multiple VFMs, resulting in limited real-time capabilities compared to expert models specifically trained for a particular task.
• Token Length Limitation
The maximum token length in ChatGPT may limit the number of foundation models that can be used. If there are thousands or millions of foundation models, a pre-filter module may be necessary to limit the VFMs fed to ChatGPT.
• Security and Privacy
The ability to easily plug and unplug foundation models may raise security and privacy concerns, particularly for remote models accessed via APIs. Careful consideration and automatic checks must be given to ensure that sensitive data should not be exposed or compromised.