My tunes have been ripped, over the years, with a variety of tools, ranging from grip (on linux) through AudioGrabber and EAC (Windows) and most recently iTunes (Mac). I've definitely butted heads with iTunes' XML file, when I wanted to copy part of my library from my Mac to a laptop for DJing.

Here's where things get tricky. I ship off a stack of CDs to get ripped. Most are new to the MP3 collection. Some are repeats. However, there will be subtle variations in the tagging, with causes ranging from typos and capitalization variation through additional metadata that I might have added and also including variations among the different ripping tools I've used over the years. As a result, there's no easy, sure-fire way to line up old and new data as necessarily having the same source.

I suppose I could hack together some sort of tool that tosses all of the words together (i.e., word vectors, which count how many times a word occurs anywhere in the tags) and takes dot products of the word vectors. That's a cheesy but possibly effective similarity metric that might help me cluster things, and then I'd have to work out something to apply the metadata both from ID3 and from the iTunes XML. This is starting to sound like more than a weekend hack to get it right, so I was (perhaps vainly) hoping somebody else had already hacked together something along these lines.

The possibility of sending them tag info from what I've already ripped is intriguing, but presumable these people are all about high volume and their trained monkeys don't know how to do something custom along these lines. I'm probably better off doing it myself, but the problem is figuring out how to do it properly.