The AI Enthusiast

The Next Generation of ChatBot - Visual ChatGPT

Cover Image for The Next Generation of ChatBot - Visual ChatGPT
Equan P.

ChatGPT has been public for almost six months since it was released on November 2022 and yet has acquired approximately 100 million users worldwide according to two sources [1] [2]. Even though has so many users in no time, compared with the previous existing tech giant products. Still, ChatGPT lack one of the biggest human need when coming to digital conversation or messaging, which is the capability to process visual information.

ChatGPT is a language model that excels at text-based conversations and reasoning. In contrast, Stable Diffusion is a foundation model that specializes in generating images from textual descriptions or prompts. By combining the capabilities of these two models, it would be possible to create an AI product that can handle both textual and visual information, thereby meeting a wide range of human needs.

Logically if we combine these two models' features then we will eventually get an AI product that is capable of handling both text and visual information. Recently a few researchers from Microsoft Research Asia have released a paper to give a solution for this, Visual ChatGPT.

Visual ChatGPT

It enables you to send, receive, and edit images during chatting.

Visual ChatGPT is a system, not a model that can handle both textual and visual information.

Could we build a ChatGPT-like system that also supports image understanding and generation? One intuitive idea is to train a multi-modal conversational model. However, building such a system would consume a large amount of data and computational resources. Besides, another challenge comes that what if we want to incorporate modalities beyond languages and images, like videos or voices? Would it be necessary to train a totally new multi-modality model every time when it comes to new modalities or functions? - Microsoft Research Asia

The explanation is trained a new model will be very costly and time-consuming besides it's not practical if later we need the model to handle information other than text or images too.

How does It work?

Visual ChatGPT uses the existing Visual Foundation Models or VFMs and combines them with ChatGPT to make a system that can process text as well as visual information but how to glue them together and make it works? meet the manager, Prompt Manager.

Prompt Manager

The sole purpose of Prompt Manager is of course to manage prompts. It uses prompts to communicate between VFMs and uses VFMs to generate or process visual information. On the other hand, Prompt Manager feeds prompts and generates prompts by using the most intelligent assistant on the planet, which is ChatGPT!

One of the beautiful things about Visual ChatGPT is the use of Prompts. It proves that Prompt will probably be "the first-class AI programming language".

According to the paper, Prompt Manager support functions are:
• Explicitly tells ChatGPT the capability of each VFM and specifies the input-output formats.
• Converts different visual information, for instance, png images, depth images, and mask matrix, to language format to help ChatGPT understand.
• Handles the histories, priorities, and conflicts of different Visual Foundation Models.

Multi-modal Dialogue

The screenshot above shows us how Visual ChatGPT handles multi-modal dialogue.

If you want to dig deeper into its implementation I suggest you dig into their source code at GitHub. The code is written in Python, a recommended language if you want to learn AI professionally.


No system is perfect neither is Visual ChatGPT. Yes, it is a promising solution but for now, there are a few limitations that you should be aware of:

Dependence on ChatGPT and VFMs
Visual ChatGPT relies heavily on ChatGPT to assign tasks and on VFMs to execute them. The performance of Visual ChatGPT is thus heavily influenced by the accuracy and effectiveness of these models.

Heavy Prompt Engineering
Visual ChatGPT requires a significant amount of prompt engineering to convert VFMs into language and make these model descriptions distinguishable. This process can be time-consuming and requires expertise in both computer vision and natural language processing.

Limited Real-time Capabilities
Visual ChatGPT is designed to be general. It tries to decompose a complex task into several subtasks automatically. Thus, when handling a specific task, Visual ChatGPT may invoke multiple VFMs, resulting in limited real-time capabilities compared to expert models specifically trained for a particular task.

Token Length Limitation
The maximum token length in ChatGPT may limit the number of foundation models that can be used. If there are thousands or millions of foundation models, a pre-filter module may be necessary to limit the VFMs fed to ChatGPT.

Security and Privacy
The ability to easily plug and unplug foundation models may raise security and privacy concerns, particularly for remote models accessed via APIs. Careful consideration and automatic checks must be given to ensure that sensitive data should not be exposed or compromised.

Insight Corner

Cover Image for Auto-GPT, The Future of Autonomous AI Agent

Auto-GPT, The Future of Autonomous AI Agent

The Do-It-All Machine. This is a to-do list that completes itself. Yes, you read that correctly. Some people say that Auto-GPT is one example of AGI, which stands for Artificial General Intelligence, which refers to a type of artificial intelligence that has the ability to understand, learn, and apply knowledge across a wide range of tasks and domains, much like a human. It contrasts with narrow or specialized AI, which is designed to perform specific tasks or solve particular problems. An AGI system would be capable of learning any intellectual task that a human can do, adapting to new situations, and exhibiting a high level of autonomy.

Equan P.
Cover Image for Directing AI Model with Prompt Engineering

Directing AI Model with Prompt Engineering

Prompt engineering is a process of creating a prompt (question or command) that directs a language model along a specific path, facilitating specific tasks and limiting the scope of the language model's output. The goal is to ensure that the language model understands and answers questions correctly and relevantly.

Equan P.