
GPT4o在Incontext-learning下的打标效果:
Tagging performance of GPT-4o under in-context learning:
可以看出准确度非常高,也没有多余的、令人讨厌的"幻觉"tags, 效果远好于传统的wd1.4
We can see that the accuracy is very high, and there are no unnecessary and annoying "hallucination" tags, its performance is far superior to the traditional WD1.4
同样的,Gemini 1.5 flash也能获得相似的精确度
Similarly, Gemini 1.5 Flash can also achieve comparable accuracy
——————————————————————————————————
Update_2024_10_02
1.新增"Enable time interval"选项,现在可选择是否开启图片传送时间间隔,默认为"不开启",(注意,类似openrouter之类的服务商会根据你的账户余额对你的请求频率进行限制,并且某些服务商还有自动化脚本检测机制,如果脚本在未开启"time interval"的情况下,突然开始频繁接收错误数据,那可能就是触发了类似机制)
***2.如果你使用的是免费的gemini api,那么就一定要打开"time interval",因为一旦请求速率超过15次/1分钟,就会开始报错,如果你使用的是付费的gemini api,那就没有这个限制了,完全可以关闭"time interval"
3.建议将“gemini-1.5-flash-latest”替换为“gemini-1.5-flash-002”,根据我的测试,这个最新的gemini flash模型在视觉能力上已经超越了gpt4v,并且接近甚至达到gpt4o的水平,
1.Added an "Enable time interval" option. Users can now choose whether to enable a time interval between image transmissions. The default setting is "no". (Note: Some service providers, such as OpenRouter, may limit request frequency based on your account balance, and some may have mechanisms for detecting automated scripts. If the script suddenly starts receiving frequent errors without the "time interval" enabled, it might be triggering such a mechanism.)
***2.If you're using the free Gemini API, you must enable the "time interval". Exceeding a request rate of 15 calls per minute will result in errors. If you're using a paid Gemini API, this limit doesn't apply, and you can safely disable the "time interval".
3.It is recommended that everyone switches from "gemini-1.5-flash-latest" to "gemini-1.5-flash-002". Based on my tests, this new Gemini Flash model shows significant improvement in vision capabilities, with its actual performance even surpassing gpt4v and approaching the level of gpt4o.
__________________________________________________________________________________
Update_2024_09_30
我更新了一个非常好用的Instruction,它可以让模型稳定生成详细的danbooru格式标签,这个Instruction来源于LEOSAM制作的打标器:
jiayev/GPT4V-Image-Captioner (github.com)
感谢作者为此付出的努力
I've updated a great instruction that enables the model to consistently generate detailed tags in danbooru format. This instruction is derived from the tagger created by LEOSAM:
jiayev/GPT4V-Image-Captioner (github.com)
Many thanks to the author for his hard work
_______________________________________________________________________________
User Guide:
1. Tagger_All-in-One is an image tagger, similar to the renowned wd14 tagger. It can automatically annotate your dataset using the most powerful LLMs currently available (GPT4o, Claude3.5, Gemini 1.5 pro). This not only significantly reduces manual labor (especially when dealing with hundreds, thousands, or even tens of thousands of training images) but also potentially surpasses human annotation quality through optimized instructions and in-context learning techniques.
2. The usage is straightforward. If Python is not installed on your computer, please download and install Python first (Welcome to Python.org). Python 3.10 or above is recommended. (After installation, open the command prompt and enter 'python --version' to check if Python is successfully installed). Then, download Tagger_All-in-One and extract it to some place. In the navigation bar, type 'cmd' to bring up the command prompt, then enter 'python gui.py' to launch the script. The first launch may take longer as it installs certain dependencies.
3. The GUI contains four pages: "Gemini (without In-context Learning)", "Gemini (In-context Learning)", "OpenAI (without In-context Learning)", and "OpenAI (In-context Learning)". The two Gemini pages can use the free Gemini API, which I'll explain how to obtain at the end. The two OpenAI pages can use OpenAI's API and OpenAI-Compatible API (such as deepbricks and openrouter).
4. Now, I'll explain the parameters in OpenAI (without In-context Learning):
a. First, fill in your API URL in the "api url" field. For example, if you're using OpenAI, the URL would be https://api.openai.com. You should enter https://api.openai.com/v1/chat/completions in the api url field. For any OpenAI-Compatible API, you need to add v1/chat/completions at the end.
b. Then, click the "Browse" button to select your dataset.
c. In the "Temperature (0-2)" field, enter the temperature value. Lower values make the LLM's output more stable and adherent to your commands but also more rigid; higher values make the output less stable and less adherent but more creative. I usually set it to 1.0. (Based on my tests, a temperature of 0.85 works better for Gemini Flash 1.5)
d. In the "Model" field, you can input the model you want to use. Note that the model name must be the standard name recognizable by the service provider. For instance, if you're using OpenAI's API and want to use GPT4o for tagging, you must enter gpt-4o, not GPT4o, Gpt4o, or GPT4O, which the service provider might not recognize. Even for the same model, the standard name may differ across service providers. You can find the standard model names from your API service provider.
e. Enter the API Key you obtained from the model service provider in the "API key" field. (***It's advisable to change your API key frequently and revoke used API keys on the model service provider's website to prevent unauthorized use)
f. Since most current LLMs can accept images with a maximum of 1 million pixels, and the cost of using LLMs is closely related to the input image resolution, you must compress your images if their resolution is too high. I use LANCZOS for image compression. You can specify the desired pixel count for compression in the "Image Pixels" field, with a valid range of [400000, 1000000]. Values below 400000 or above 1000000 will cause an error. Generally, for GPT-4o and Claude3.5, providing images with 500,000 pixels is sufficient to ensure their image recognition capabilities. (I created a temp folder to temporarily store compressed images, the compression process does not affect your original dataset)
***If you don't fill in anything here, the script will directly upload the images from your dataset to the model. If the images in your dataset are too large (over 1 million pixels), it may not return any data or return incorrect data.
g. Then, you need to fill in the instruction. In simple terms, this step is to tell the LLMs how to describe your images. Generally, more complex and clearer instructions can lead to better results, but for many models, too complex instructions may be difficult to execute. This brings us to the in-context learning technique we mentioned at the beginning. Simply put, it's a technique to "brainwash" LLMs, forcing LLMs to output according to the answer template you've prepared. It can greatly enhance the model's ability to follow complex instructions but will also increase costs. If you highly value annotation quality and want the model to perform at its best, I strongly recommend using the two modules with "In-context Learning". For specific usage methods, you can refer to this article:
自动打标器的重大更新 || The major update of "Automated tagger with Openai-Compatible API" | Civitai
h. After filling in all the parameters, click "Run", and the script will start using the LLMs to automatically annotate the images in your dataset. For each processed image, the script creates a txt file with the same name in your dataset folder, and then stores the tags in that txt file
使用说明:
1. Tagger_All-in-One是一个图片标记器,类似大名鼎鼎的wd14,它可以借助目前市面上最强大的LLMs(GPT4o,Claude3.5,gemini 1.5 pro)对你的数据集进行自动化标注,不仅可以大幅节省人力(尤其是在数百张、数千张、甚至上万张训练图片的情况下),而且通过优化后的指令以及Incontext-learning(上下文学习)技术,它的标注水平甚至可能超越人类,
2.使用方法很简单,如果你的电脑上没有安装Python,那么请先下载并安装Python(Welcome to Python.org),建议Python3.10以上(安装完成后,打开命令提示符,输入’python --version’检查python是否安装成功),之后下载Tagger_All-in-One并将其解压到某个位置,然后在导航栏中输入’cmd’唤出命令提示符,然后输入’python gui.py’就可以启动脚本了,首次启动可能会安装某些依赖,所以可能时间会长些,
3.GUI中包含4个页面,分别是”Gemini(without Incontext-learning)”, ”Gemini(Incontext-learning)”, ”Openai(without Incontext-learning)”以及”Gemini(Incontext-learning)”,两个gemini页面可以使用免费Gemini API,关于如何获得免费Gemini API,我会在最后写到,两个Openai页面则可以使用Openai的API,以及任何兼容Openai API格式的API(比如deepbricks, openrouter以及openai,或者TB上众多的模型贩子,它们提供的API基本兼容openai的API),
4.现在我针对Openai(without Incontext-learning)中的参数做一下说明:
a.首先在api url这一栏填写你的api url,以openai举例,openai的url是这样的: https://api.openai.com,那么你就应该在api url栏中填写
https://api.openai.com/v1/chat/completions, 没错,针对任何openai-compatible api,都要在末尾添加v1/chat/completions,
b.之后,点击“浏览”键,选择你的数据集,
c.在“Temperature(0-2)”这一栏填写温度,这个值越低,LLMs的输出就越稳定且越遵循你的命令,但是回答也越死板;这个值越高,LLMs的输出就越不稳定且越不遵循你的命令,但是其回答也会越来越有创意,我一般将其设定为1.0,(根据我的测试,对于gemini flash 1.5来说,0.85的温度值更好)
d.在“模型”这一栏可以选择你想使用模型,这里需要注意,填写的模型名称必须是那种能被服务商识别的标准模型名称,比如说你使用的是openai的API,现在你想使用gpt4o对数据集打标,那么你就必须填写gpt-4o,而不是GPT4o、Gpt4o或者GPT4O这种不能被服务商识别的模型名,即使对于同一个模型来说,在每个服务商那里的标准名称可能都是不同的,标准的模型名可以去你购买API的服务商那里找到,
e.在"API key"这一栏填写你从模型服务商获取的API Key,只有第一次使用这个脚本时需要填写API Key,(***建议经常更换新的API key,废弃使用的API key应该去模型服务商官网将其注销,以保证API key不被窃用)
f.因为目前大部分LLMs可接收的图片的最大像素一般是100万,且使用LLMs的花费和输入的图片的分辨率息息相关,所以如果你的图片的分辨率太大,就必须对其进行压缩,我使用了LANCZOS对图片进行压缩,你可以在“图片像素”这一栏填写你想要将图片压缩到多少分辨率以内,可填写的范围是[400000,1000000],小于400000或大于1000000都会让程序报错,一般来说,对于GPT4o和Claude3.5,你只需要提供50万像素的图片就能保证它们的图像识别能力了,(我创建了一个temp文件夹,用来暂时储存被压缩的图片,图片压缩过程不会对数据集产生任何影响)
***如果你不在这里填写任何东西,那么脚本就会直接将你数据集中的图片上传到模型,如果你的数据集中的图片的像素太大(大于100万像素),那么有可能无法返回任何数据或返回错误数据,
g.最后你需要填写instruction,简单来说,这一步是为了告诉LLMs应该如何描述你的图片,一般来说,更复杂、更清晰的instruction可以带来更好的效果,但是对于很多模型来说,较复杂的instruction可能在执行上比较困难,所以这就引出了我们最开始提到的Incontext-learning(上下文学习)技术,简单来说,它就是一种给LLMs进行”洗脑”的技术,强制大模型按照你编写好的回答模板进行输出,它可以极大地增强模型对复杂指令的遵循能力,但是也会增加相关的花费,如果你非常重视标注质量,希望模型可以最大程度发挥它的水平,那么我非常推荐你使用那两个带”Incontext-learning”的模块,具体的使用方法可以参考这篇文章:
自动打标器的重大更新 || The major update of "Automated tagger with Openai-Compatible API" | Civitai
h.当你将全部参数填写完成后,点击“运行”,脚本就会开始利用你选择的模型对数据集里的图片进行自动标注,每标注好一张图片,这个脚本便会在你的数据集中创建一个和被处理图片同名的txt文件,然后将tags储存在这个txt文件中,
————————————————————————————————
关于Gemini免费API的介绍以及如何获取Gemini免费API:
Introduction to Gemini's free API and how to obtain it:
1.Google的Gemini模型现在一共有两个,分别是Gemini 1.5 flash和Gemini 1.5 pro,其中Gemini 1.5 pro的视觉能力大概有GPT4o和Claude3.5 Sonnet的90%的水平,属于T0行列;
Gemini 1.5 flash是Gemini 1.5 pro的青春版,视觉能力大概有gemini 1.5 pro 85%~90的水平,强于Claude3 haiku,属于T1行列,在描述图片时基本不会犯什么错,对于免费的API,谷歌那边的说明是这样的:如果你使用的是Gemini 1.5 flash,那么你每分钟最多被允许发送15次请求,一天最多被允许发送1500次请求,这对于给图片打标来说已经非常够用了,至于gemini 1.5 pro,如果你只有免费API,那么一分钟最多发送两次请求,这基本是没法用的,所以gemini 1.5 pro暂时不在我们的考虑范围内,
1.There are two Gemini models now: Gemini 1.5 flash and Gemini 1.5 pro. Gemini 1.5 pro's visual capabilities are about 90% of GPT4o and Claude3.5 Sonnet, and Gemini 1.5 flash is like a lite version of Gemini 1.5 pro, with visual capabilities about 85-90% of Gemini 1.5 pro. It's stronger than Claude3 haiku, and rarely makes mistakes when describing images. For the free API, Google states that with Gemini 1.5 flash, you're allowed up to 15 requests per minute and 1500 requests per day. This is more than enough for image tagging. As for Gemini 1.5 pro, the free API only allows two requests per minute, which is practically unusable, so we're not considering it for now.
2.你可以在这里获取Gemini API: https://ai.google.dev/aistudio, 进入页面后是这样的:
2.You can get the Gemini API here: https://ai.google.dev/aistudio. The page looks like this:
点击红圈内的Get your API key,然后跟着引导走就可以很轻松地获取免费API了,(注意:某些国家和地区可能无法使用Gemini API服务,具体情况可以在Google官网查询)
Click on "Get your API key" in the red circle, then follow the prompts to easily obtain your free API. (Note: Gemini API services may not be available in some countries and regions. Check Google's official website for specifics.)
描述:
Automate Tagger with OpenAI-Compatible API (few-shot learning version)
使用兼容OpenAI接口的自动打标器(上下文学习版)
训练词语:
名称: automatedTaggerWithFreeGeminiAPI_fewShotLearningV12.zip
大小 (KB): 356
类型: Archive
Pickle 扫描结果: Success
Pickle 扫描信息: No Pickle imports
病毒扫描结果: Success