OK, so you need to validate a one factor model with mean reversion, here are the questions I would ask myself
For calibration
Is the model capable of calibrating to both coterminals and caplets? If not how do I intend to calibrate it when pricing Callable Cap Floaters?
Is the model capable of calibrating to skew? If not how do I intend to calibrate it when pricing Callable Range Accruals and Collars?
Am I going to price zero coupon swaptions? What kind of calibration am I going to use?
Am I going to calibrate mean reversion? If so how? If not what number do I use?
Is the calibration stable? I would try running through the end of 2008 and see how many failures I get.
For pricing
Is the integration routine accurate? I would try pricing structures with OTM strikes and see when it breaks down. In particular, can I achieve convergence with a reasonable CPU time? Changing integration parameters (grid spacing, time steps) should give me an idea of the issues.
(Connected to previous) What is the implied distribution of Libor rates? Does it have fat tails? If so, does the integration routine cover a sufficient range (number of standard deviations).
Is the calibration appropriate? I would try calibrating an OTM bermudan to both ATM coterminals and at the strike coterminals. Do I get different prices? Why? What is the appropriate strike to be used?
What kind of forward vols/skews does the model imply? Does it make sense?
For greeks
Are the greeks stable? Again, I would try backtesting the model through turbulent periods. And again, I would try changing the integration parameters and see if I get oscillations.
How does the integration routine cope with disontinuities in the payoff? Is anything sophisticated used for avoiding spurious oscillations? Here is a paper with a standard technique used for eliminating noise
http://www.risk.net/data/risk/pdf/technical/risk_1106_Wackertapp.pdf
is anything like this employed in the integration routine?
How do I intend to calculate greeks? Am I planning to use model greeks or smile greeks (i.e. adjusted for movements in the vol surface) ? Which ones are better?
Not being a model validator myself I must have missed some important points. For what concerns the hedging performance, I've spent some time in the past implementing a backesting framework for one factor interest rate models and I think that's the only way to check hedging performance. Plus it allows you to find answers to a lot of the questions above. Beware however, it takes a lot of time and effort to implement one.