
This is just doubling of what i have on github for v1, and new v2 filtering, which idk when i will post to github, as it has policy which doesn't permit links to suggestive content, and some of deprecated tags of Danbooru, are, apparently, f*cking links, come on Danbooru taggers, wtf are you doing... I don't feel like clearing them out, and i'll need to pull new lists anyway, as some tags are still missing, even though there are already roughly 780k tags.
P.S. I am not responsible for whatever active and deprecated tags you'll find in those big lists, as i didn't edit them, but it's fun to open notepad++ and search some words, meme for example, and see what memes are out there. Only editing i've done is adding wanted_tags_2 list to the end of it, so if there are any important tags missing from that lossy pull, they wot get deleted.
Please, read version descriptions, as i will add instructions there.
https://ko-fi.com/anzhc | https://www.patreon.com/anzhc
Q: What is that?
A: This is a script to filter out tags that you don't want to see in training. V1 shown improvement of editability and quality, when autotagged dataset was filtered through unwated_tags list.
Q: What's the difference on v1 and v2?
A: V1 uses WD1.4 Tagger tag list, which consists of only 11k tags, which i manually went through(it is provided in full for you to make your own lists, you're likely gonna dislike my tag decisions, as i don't know most japanese tags(but dw, serafuku is intact))
V2 uses base of Danbooru tags, not whole, but as much as i pulled when i did that. It consists of roughly 780k tags, which leads to high quality filtering of datasets comprised of Danbooru images and Danbooru tags. It can be intercompatible with some other boorus, but for example rule34 uses 1girls, instead of 1girl, which will be missed, and you might filter it out on step 3. Please add variations of tags from your favorite sites to lists, if you see that they are missing from them.
Q: How would i know if tags i want are missing from lists?
A: V2 is working in 3 stages:
Stage 1 - main filtering - it goes through all tags in lists you've specified(you get to choose what to filter out(like general, copyright, artists, meta, character tags)) and removes those that it find in lists.
Stage 2 - additional filtering - this step uses V1 list of unwanted_tags_2, which removes tags that i deem destructive for training, as they are implying action, are not representable in image, redundant, or outright useless.(But again, you might not think so, or want to add something - feel free to change your local list, or hit me up so i can put it up for others to download)
Stage 3 - outliers filtering - After stage 2 is done, you are presented with a choice(after some time, depending on size of dataset), you'll be able to delete tags that were not found in any tag lists. This is usually a good practice, and will take care of new artists, characters, meta, copyright and missing tags that i was not able to pull at the time, or were created after the date of me pulling tags. This is unlikely to affect important tags. 
Stage 3 also presents to you what tags are classified as missing, so you'll see them, and could add to lists, if you see fit.
Q: That's cool, but any examples?
A: Sure, i guess
 f1 is v1, f2 is v2. Only tags that are removed manually are ones that are of character appearance.
f1 is v1, f2 is v2. Only tags that are removed manually are ones that are of character appearance.
Dataset for Sona was gathered using CherryScraper, which i recently released v2 of too, which added automatic mode and support for some more sites.
Q: How feasible it is to use for large datasets? Is it fast?
A: V1 is extremely fast nad will process tens of thousands of images in couple minutes at most.
V2 is using ~75 times larger tag base, and does take much longer time, but i wouldn't worry about that in datasets of under couple thousands, it'll just take some minutes. I don't think i implemented multi-threading, so, possible optimization in future.
In general, it saves much more time than it wastes, imho.
Q: Why did you pull so many tags?
A: I am stupid, and decided to pull deprecated tags too, because it didn't waste time to do so, but i guess that improves compatibility with tag lists from other boorus, as it accounts for lots of misspelled and different ways to call a single tag in danbooru. Pulling lists of tags expluding deprecated ones will make a good optimization.
Q: Why didn't you just use DeepDanbooru tag list?
A: Because it is not segmented. Deepdanbooru tag list is a single file with ~110k tags, while i pulled tags according to their category, which makes it much more variable in ways you could utilize it. All i could do with Deepbooru list, is check if there are tags that are ot in it and delete them, that's about it.
But i see why you might think it would've been a good pick for V1, but... Im just a human, it will take me month to go over that list manually, so i used compromise option. Not even talking that i obviously would fuck up even more tags i don't know meaninig of that way.
Github link - https://github.com/Anzhc/Anzhc-s-Dataset-Processing-Tools
Though, as i said, v2 is not there yet, i need to filter out links.
描述:
Just put .txt files in to input tags
launch .bat file
take your tags in output folder
launch_unwanted
Tags that are listed in unwanted_tags.txt will be removed from prompt, feel free to add and remove any you want
Repeating tags like "shirt" and "white shirt" will be filtered, and simpler variations will be removed. Similar complexity and varied tags will not be touched.
It will not take care of duplicate tags.
It it supposed to work with underscores in unwanted tags, but i don't think i did it same way for input, so, replace all underscores with space in Dataset tag editor first just in case.
Unwanted mode can help when dataset is autotagged. Not a panacea, but it helps.
launch_wanted
Launch wanted tags, and ONLY tags specified in respective text files will be left in your dataset. I find it useful for tags that were pulled from Danbooru/other site directly.
Please, backup yout text files in case anything goes wrong.
训练词语:
名称: tagFilteringV1V2_v1.zip
大小 (KB): 162
类型: Archive
Pickle 扫描结果: Success
Pickle 扫描信息: No Pickle imports
病毒扫描结果: Success
                    
                

 
                 
                 
                 
                 
                