Microsoft has open sourced a visual version of ChatGPT

Microsoft has open-sourced a visual version of ChatGPT in the last two days. It can handle both text and images. Its open-source code was awarded 6.8K stars in two days, which is very popular. What is more shocking is an official DEMO put out, we can look at it specifically. For example, if you ask it to generate a cat picture through text input, it can really generate a cat picture. Based on this cat image, you can ask it to do some edge detection, or you can replace it with a dog image. All of this can be done with text control. You input a picture and ask it to extract the content in the picture. For example, ask it what color is this motorcycle, and it says black. Then ask it to remove this motorcycle, and he can really remove this motorcycle for you. In terms of effect, these two examples are relatively shocking. But many people will think that this Microsoft open-source visual version of ChatGPT is multimodal from zero training multimodal model, but this is wrong. This is a visual version of the Microsoft chatGPT corresponding to a paper. Let me briefly explain how he actually did it.

The core innovation of this paper is: by constructing a prompt manager, which is similar to the dialogue manager in the dialogue system, the text version of chatGPT and the 22 visual base models are integrated. That is, you enter a paragraph of text, ChatGPT to understand what the text is to do, and then extract the task category, for example, you need to generate a beautiful girl, or to do style migration, and so on to extract the task type. After extracting the task type, you can use this manager to extract or call the corresponding visual model that is already in place.

Let’s take a look at the corresponding example in this paper. Let’s take a look at this example in the paper, and one of its processes looks like this. Can you see how many parts it’s divided into? Look at the top left corner of my mouse, which is the input part. The top right corner takes up most of the section, which is called the visual model library, that is, there are 22 visual base models. In this part of the prompt manager, the following are the generated results, which is a process, and finally an output part.

Let’s look at it part by part. What is this example trying to do? This example is trying to turn a yellow flower into a cartoon type of red flower, what are the approximate steps done? The first step is your input section, your input section in the top left corner, which is your query section. Your query section is where you enter a picture of a yellow flower. Then you enter a text in the right half. This text section is what we generally call a prompt template. This prompt template is very important in ChatGPT. The text probably means please generate a red flower. Under what conditions? In the depth condition of the original image generate a red flower, and then go based on this red image into a cartoon style. And then you’ll see that it’s two things, and you’ll see that the text is very verbose. It’s not directly telling the dialog manager, please help me generate a cartoon picture of a red flower, but it’s rather verbose to use such a large paragraph to describe what you want to do. Because as I said at the beginning of the manager, the manager needs to understand this part of the text, then extract different task types through this part of the text, and then different task types to call different graphical models. If you directly say help me generate a cartoon picture of a red flower, for this manager, it is very difficult to understand. So that’s why we expand what we want to do step by step, tell this conversation manager, and then this conversation manager, it’s easier to extract from it, extract your corresponding task type. In this more verbose model, the AI field has a technical term called the thinking chain of derivation. Some time ago Amazon came out with a called multimodal, based on how the first multimodal chain of thought derivation. Everyone who is interested can go and have a look.

After the manager sees this text here, two task types are extracted, one is depth estimation and one is style migration. After extracting these two task types, it will extract the corresponding task corresponding to the mature model from the right half of this task list, which is your visual base model task list. For example, this stable diffusion is specialized in style migration. Then your BLIP model is specialized in comprehension and generation. This dynamic modal model can do both understanding and generation. This ControlNet is a new and famous new generative model that came out recently in 2023, which is a text-to-image generative model. It will add some constraints to generate a better image. Here are 22 extracted corresponding to some good, good quality tasks. The corresponding models are extracted and then implemented step by step in the process. For example, here I get a depth map first, and then here I get a red image with depth information, i.e., a red flower. After getting the red flower, I then go to a cartoon type. This is done step by step. This is an overall process. It’s not a multimodal model trained from scratch, but rather it utilizes some models that are more mature in terms of vision. The news is that GPT4, which is a multimodal model, will be released soon. GPT4 is a multimodal model that basically covers all the visual tasks and this visual ChatGPT. Officially, there is a point where it is linked to video, so let’s wait and see.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top