Alright, I plundered some more resources and appears like for CPU cache hit/miss counters, we have to go for individual process or pid or tid based tracing. That is, in other words, perf and oprofile.
For example perf stat gives this.
Performance counter stats for 'ls':
3.905621 task-clock # 0.831 CPUs utilized
1 context-switches # 0.000 M/sec
0 CPU-migrations # 0.000 M/sec
267 page-faults # 0.068 M/sec
379,003 cycles # 0.097 GHz [24.55%]
1,332,419 stalled-cycles-frontend # 351.56% frontend cycles idle [36.65%]
<not counted> stalled-cycles-backend
833,177 instructions # 2.20 insns per cycle
# 1.60 stalled cycles per insn
580,745 branches # 148.695 M/sec [95.65%]
37,799 branch-misses # 6.51% of all branches [71.09%]
0.004697863 seconds time elapsed
Oprofile gives the similar output but perf is pretty awesome, imo.
Other thing is, for memory banks, numastat gives you another level of detail.
$ numastat
node0
numa_hit 74263001
numa_miss 0
numa_foreign 0
interleave_hit 15459
local_node 74263001
other_node 0
Yeah, this system is a 1 node system.