0
$\begingroup$

My goal is to detect musical instruments with AI (machine learning).

I'm currently using the Yamnet model to make inferences, but it has a very wide range of categories, for example, "Growling", "Printer", and "Piano". I wonder if that causes it to be less precise in detecting instruments since instrument classes are only a fraction of the total classes.

The description of the Yamnet model on Kaggle states that:

You should expect to do some amount of fine-tuning and calibration to make YAMNet usable in any system that you build.

There is another model called NSynth, with a large dataset of musical instrument samples, but it is used for synthesizing new sounds, rather than classifying/detecting instruments.

Would fine-tuning the Yamnet model with NSynth make sense in that case?

$\endgroup$

1 Answer 1

0
$\begingroup$

Fine-tuning YAMNet with a dataset like NSynth makes sense for your use case.

Why Fine-Tuning YAMNet is Beneficial

YAMNet is a pre-trained model designed for various audio event detection tasks, including everything from musical instruments to environmental sounds. Its general-purpose nature means it can identify a vast array of sounds, but this broad scope can indeed dilute its precision when it comes to specific categories like musical instruments.

Challenges with YAMNet:

  • Wide Range of Classes: As you mentioned, YAMNet can detect various sounds, which might make it less precise for specific categories like musical instruments. This is because the model's attention is divided among many different sound types.
  • Class Imbalance: Instrument sounds might be underrepresented in the training data compared to other categories, leading to lower detection accuracy for those classes.

Using NSynth for Fine-Tuning

NSynth is a large dataset specifically designed for musical instrument sounds, containing high-quality samples of various instruments. Fine-tuning YAMNet with NSynth can help in the following ways:

  1. Focused Learning: By training YAMNet on a dataset that exclusively contains musical instrument sounds, you can make the model more sensitive and accurate in detecting those sounds.
  2. Improved Accuracy: Fine-tuning will allow YAMNet to adjust its weights specifically for musical instruments, potentially improving its precision and recall for these classes.
  3. Better Generalization: Fine-tuning on NSynth can help the model learn the nuances of different instruments, improving its ability to generalize and detect instruments in varied real-world audio samples.
$\endgroup$

Not the answer you're looking for? Browse other questions tagged or ask your own question.