My goal is to detect musical instruments with AI (machine learning).
I'm currently using the Yamnet model to make inferences, but it has a very wide range of categories, for example, "Growling", "Printer", and "Piano". I wonder if that causes it to be less precise in detecting instruments since instrument classes are only a fraction of the total classes.
The description of the Yamnet model on Kaggle states that:
You should expect to do some amount of fine-tuning and calibration to make YAMNet usable in any system that you build.
There is another model called NSynth, with a large dataset of musical instrument samples, but it is used for synthesizing new sounds, rather than classifying/detecting instruments.
Would fine-tuning the Yamnet model with NSynth make sense in that case?