Is MIREX Fair?

On the MIR Research blog, Elias points out that the team that organizes MIREX scored highest in a number of categories. Elias suggests that having insider knowledge of the specifics of an evaluation can give a submission an unfair advantage. I can see how this can happen, even unintentionally. A system that is built and tuned against a particular collection of music could be overfitting the collection. Now if you bring this overfitting system to a new collection, its performance will tend to drop since it is not tuned to for the new collection. The overfitting hurts your system. However, if you are testing, training and tuning against a test collection that happens to overlap the evaluation collection, then it is possible that overfitting can help your system in the final evaluation. So for example, if the IMIRSEL system happened to be built and tuned against the same collection that was used in the MIREX evaluations, this may give that system an advantage. Knowing how difficult it is in this day of copyright litigation to build up a sizable test collection, it is not hard to imagine the IMIRSEL team using the MIREX evaluation collection to tune their system.

Now, I have worked closely with just about all of the members of the IMIRSEL team, and I know that they are an extremely talented set of individuals and are quite capable of creating a winning system. I also know that these folks are diligent, methodical and careful in how they design the MIREX evaluation to ensure the utmost in fairness. I have little doubt that the MIREX results accurately reflect the state-of-the-art. However, not everyone has had the opportunity of working with the IMIRSEL team and so not everyone may share the same confidence in the MIREX results. To alleviate these concerns, I suggest that the IMIRSEL team detail the datasets used in training and tuning the IMIRSEL submission, highlighting any overlaps with the data used in the MIREX evaluation. Such a disclosure will help alleviate any of the concerns some folks may have about the fairness of MIREX. For MIREX to be a viable long term evaluation, MIR researchers have to have a high confidence that it is fair.

Posted on: Sep 17, 2007

Posted by: plamere

Category: music

Permanent link to this entry

Comments:

Please find below the response from Cameron Jones of the Imirsel team - originally posted to the evalfest and music-ir lists (I hope he doesn't mind me reposting it):

The recent discussions on Elias Pampalk's
(http://mir-research.blogspot.com/2007/09/mirex-results-online.html)
and Paul Lamere's (http://blogs.sun.com/plamere/entry/is_mirex_fair)
blogs on whether or not it is fair for the IMIRSEL team to participate
in the MIREX tasks has upset several members of IMIRSEL who always
strive for the utmost in fairness, thoroughness, and accuracy. As a
member of IMIRSEL, I can think of at least 4 reasons why IMIRSEL's
submission is legitimate, and above-the-bar. It is my hope to dispell
any hint of impropriety that may have been cast by the recent
discussions.

1. The features selected were based on the features used by other
labs in previous MIREXes, and other publicly available research.
2. The feature extractors we used were not developed by someone with
direct access to the data.
3. The classifiers we used were standard WEKA implementations.
4. IMIRSEL is the "Systems EVALUATION Lab" not the "Systems
Development Lab" and therefore, does not engage in large-scale MIR
system building activities of any sustained length. Meaning our
submission could not have been "tuned" as has been claimed.

1. Features were based on previous MIREX submissions.

The feature sets used in the IMIRSEL submission were based on a set of
features developed and used by Mandel and Ellis in MIREX 2005. Mandel
and Ellis had the a submission which performed well in both Audio
Artist Identification, and Audio Genre Classification tasks in MIREX
2005. The IMIRSEL submission used several additional psychoacoustic
features based on the dissertation research of Kris West. To me, this
kind of submission embodies the goal of MIREX: the publication of
algorithms and the meaningful comparison of their performance, thus
allowing MIR researchers to make informed, justifiable decisions about
what algorithms to use. IMIRSEL did not systematically compare
possible feature sets against the data. We built one working feature
set and reported the findings. The EXACT SAME feature extractor was
used on ALL of our submissions (for all tasks in which IMIRSEL
participated). Our decisions to use this feature set was based on
publicly knowable data and utilized no insider knowledge.

2. Feature extractors were developed by Kris West

The feature extractors used in the IMIRSEL submission were not
developed by anyone with direct access to this year's submission data.
In total, 2 feature extractors were written using M2K. The first
feature extractor, developed by Andreas Ehmann (a member of the
IMIRSEL lab), had a bug and crashed our servers, and thus did not
generate any meaningful features which could be used. Kris West (a
associated member of IMIRSEL, but not resident in Illinois) developed
the second feature extractor for the IMIRSEL submission. Kris did not
have any knowledge of, or direct access to IMIRSEL's databases when
building the extractors, beyond what was available on the MIREX Wiki
pages (public knowledge of the task definitions). Although Kris is
affiliated with IMIRSEL, he is pursuing his own independent research
agenda, and is not active in the day-to-day operations of the lab, nor
decisions about data management, beyond those posted to the public
MIREX Wiki.

3. Classifiers were not "tuned"

The classifiers we used were all standard Weka packages. We used
Weka's KNN and Poly SMO classifers. The Poly SMO classifier was used
with default parameters. The KNN submission was likewise run with
minimal configuration, I believe K was set to 9 because when we used
the default value of 10, it crashed on one of the splits of the N-Fold
validation. The belief that we may have possibly iteratively tuned and
optimzed our submission is just wrong.

4. Our passion is evaluation.

Overall, IMIRSEL is about evaluation, not algorithm development. While
it is true that IMIRSEL is responsible for the development of M2K, we
do not usually spend our days thinking about how to develop new
algorithms, approaches, or feature sets. What we do spend a lot of
time working on is improving the design of MIREX, the design and
selection of tasks, the evaluation metrics we use in MIREX, the
validity of the results we have. We spend a lot of our time looking
over past MIREX results data and interpreting it, looking for patterns
and anomalies, and overall trying to make sure that MIREX is being
executed to the best of our abilities.

Because of this, we did not have an "in house" submission lying around
that we could have submitted. We were not working on our submission
for months before hand, carefully selecting the feature sets, tuning
the classifier parameters, etc. We were too busy preparing for, and
then running MIREX. Rather, our submission this year was an attempt to
demonstrate to the community the power and flexibility of some new M2K
modules which integrate existing music and data mining toolkits, like
Weka and Marsyas. M2K presents a robust, high-speed development
environment for end-to-end MIR algorithm creation. IMIRSEL's
submission was supposed to be an also-ran, developed in response to a
challenge from Professor Downie to see "what could be hacked together
in M2K, QUICKLY!!!! We do not have a lot of time to fuss.". In
reality, the IMIRSEL submission was built in one evening.

Finally, as has been stated repeatedly, MIREX is not a competition and
there are no "winners". So, rather than wasting time arguing about
what is fair or not, we should be using this opportunity to learn
something. Why is it that IMIRSEL's algorithm performed as well as it
did (which is not, keep in mind, necessarily a statistically
significant performance difference from the next highest scores)?

M. Cameron Jones

Graduate Research Assistant
International Music Information Retrieval Systems Evaluation Lab

PhD Student
Graduate School of Library and Information Science

Posted by Kris West on September 19, 2007 at 06:34 AM EDT #

Cameron and all:

Thanks for the detailed response. This is probably a bad week to have to spend time responding to gripes from the bleacher seats. I apologize if any of the issues I've raised have upset members of the IMIRSEL team - I really do appreciate all of the efforts put in by the team to run MIREX. I realize that it is an incredible amount of effort and you have spent many late nights over the last few weeks getting all the submissions to run and all of the results tallied and posted. Having to fend off criticisms at this late hour is not something I would wish on anyone. Let me stress that I don't believe that there was anything improper with the IMIRSEL submission. As I said in my blog post, I know the folks on the IMIRSEL team, and have worked with many of them and have an extremely high opinion of the quality of the work, the methodical approach and the passion the team has for evaluation.

My concerns are all about transparency and process. An outsider looking at MIREX sees a closed evaluation, where only a select few can see behind the curtain as to the details of the evaluation. When the team that organizes MIREX, the team that can peek behind the curtain, submits a system, an outsider may raise an eyebrow. When that team's system scores best in many of the evaluations, the outsider may raise two eyebrows. It's as if the Republicans were put in charge of counting all the votes for an election and when they posted the voting results, the Republicans won all of the races. Even if the votes were accurately counted, there will be many raised eyebrows.
The same thing can happen here with MIREX. Without care, people will lose confidence in MIREX. I think that the IMIRSEL team can make sure that confidence in MIREX will remain high by ensuring that researchers that are privy to evaluation details that are not known to all other submitters do not submit systems. I think making sure that all submitters have access to the same information is just common sense for a blind evaluation like MIREX. A submission from team 'IMIRSEL' gives the impression that those with privileged access to evaluation details (even if this is not the case, as you point out) are submitting a system. In my mind, this reflects badly on MIREX. That is the crux of my concern.

I really appreciate the effort that the IMIRSEL team makes in organizing and executing MIREX. MIREX is an important part of the MIR community and it will be a key driver of progress for many MIR tasks. I'm excited to see MIREX continue to grow every year. I hope you interpret my comments as an attempt to help make MIREX better.

Posted by Paul on September 19, 2007 at 06:37 AM EDT #

Hi Paul -

I did want to respond to your comment about transparency in MIREX (avoiding the mess I seem to have started on the lists).

In my mind, the evaluation process for MIREX is completely transparent. All of the raw data results, for all algorithm submissions are posted to the wiki, and anyone is free to confirm that the results we've computed are indeed correct. Case in point, recently we received a message from a member of the MIR community about the accuracy of the MIREX 2006 Audio Beat Tracking results. In turns out that a mathematical error on our part resulted in incorrect evaluation statistics being posted to the wiki (and subsequently in the 2006 MIREX posters, participant abstracts, and possibly other derivative publications). While disheartening to hear we have erred, the fact that someone was able to double-check our calculations speaks to the transparency we strive towards.

At the moment, the only "black box" in MIREX is the actual running of the code. However, as I hope MIREX participants will confirm, this process is hardly "fire-and-forget" for participants. We are continually in contact with participants about the details of their code and its execution. In the past, we have even given participants access to our cluster in order to make sure that their code is compiled and working properly. While this is hardly "glass box" transparency, we are in the continual process of improving the design of MIREX. Our recent efforts to develop a web-service based Do-It-Yourself (DIY) framework are specifically designed to give participants greater access to our datasets and computational resources. As a part of the MIREX DIY Service we are also granting participants greater access to the metadata of our audio and symbolic collections through web databases.

- Cameron

Posted by Cameron Jones on September 19, 2007 at 12:12 PM EDT #

Duke Listens!: Visit my main blog at MusicMachinery.com

Is MIREX Fair?

About this weblog

Index

Your Current Location