48

How to normalize a histogram such that the area under the probability density function is equal to 1?

0

7 Answers 7

123

My answer to this is the same as in an answer to your earlier question. For a probability density function, the integral over the entire space is 1. Dividing by the sum will not give you the correct density. To get the right density, you must divide by the area. To illustrate my point, try the following example.

[f, x] = hist(randn(10000, 1), 50); % Create histogram from a normal distribution.
g = 1 / sqrt(2 * pi) * exp(-0.5 * x .^ 2); % pdf of the normal distribution

% METHOD 1: DIVIDE BY SUM
figure(1)
bar(x, f / sum(f)); hold on
plot(x, g, 'r'); hold off

% METHOD 2: DIVIDE BY AREA
figure(2)
bar(x, f / trapz(x, f)); hold on
plot(x, g, 'r'); hold off

You can see for yourself which method agrees with the correct answer (red curve).

enter image description here

Another method (more straightforward than method 2) to normalize the histogram is to divide by sum(f * dx) which expresses the integral of the probability density function, i.e.

% METHOD 3: DIVIDE BY AREA USING sum()
figure(3)
dx = diff(x(1:2))
bar(x, f / sum(f * dx)); hold on
plot(x, g, 'r'); hold off
5
  • 2
    The sum of the "Divide by area figure" doesn't equal 1. I see at least 10 bar plot points greater than 0.3. 0.3*10 = 3.0 Wouldn't a simpler solution be to divide f by the # of samples? In this case, 10000.
    – Rich
    Commented Mar 3, 2014 at 23:39
  • 9
    @Rich The bars are thinner than 1, so your calculation is wrong. Consider the triangle unter the curve from (-2,0) to (0, 0.4) to (2, 0) to estimate the area. This triangle has an area of 0.5*4*0.4 = 0.8 < 1.0
    – neingeist
    Commented May 21, 2014 at 9:50
  • 2
    to get the sum equal to 1, you need to multiply the new sum of bins by the width of the bin
    – alperovich
    Commented Nov 7, 2015 at 18:56
  • @abcd: But this article says, we can divide by the sum for normalizing: itl.nist.gov/div898/handbook/eda/section3/histogra.htm
    – kmario23
    Commented Oct 20, 2016 at 3:01
  • 1
    How to do this using histcounts instead of hist?
    – Pedro77
    Commented Jun 30, 2017 at 0:00
24

Since 2014b, Matlab has these normalization routines embedded natively in the histogram function (see the help file for the 6 routines this function offers). Here is an example using the PDF normalization (the sum of all the bins is 1).

data = 2*randn(5000,1) + 5;             % generate normal random (m=5, std=2)
h = histogram(data,'Normalization','pdf')   % PDF normalization

The corresponding PDF is

Nbins = h.NumBins;
edges = h.BinEdges; 
x = zeros(1,Nbins);
for counter=1:Nbins
    midPointShift = abs(edges(counter)-edges(counter+1))/2;
    x(counter) = edges(counter)+midPointShift;
end

mu = mean(data);
sigma = std(data);

f = exp(-(x-mu).^2./(2*sigma^2))./(sigma*sqrt(2*pi));

The two together gives

hold on;
plot(x,f,'LineWidth',1.5)

enter image description here

An improvement that might very well be due to the success of the actual question and accepted answer!


EDIT - The use of hist and histc is not recommended now, and histogram should be used instead. Beware that none of the 6 ways of creating bins with this new function will produce the bins hist and histc produce. There is a Matlab script to update former code to fit the way histogram is called (bin edges instead of bin centers - link). By doing so, one can compare the pdf normalization methods of @abcd (trapz and sum) and Matlab (pdf).

The 3 pdf normalization method give nearly identical results (within the range of eps).

TEST:

A = randn(10000,1);
centers = -6:0.5:6;
d = diff(centers)/2;
edges = [centers(1)-d(1), centers(1:end-1)+d, centers(end)+d(end)];
edges(2:end) = edges(2:end)+eps(edges(2:end));

figure;
subplot(2,2,1);
hist(A,centers);
title('HIST not normalized');

subplot(2,2,2);
h = histogram(A,edges);
title('HISTOGRAM not normalized');

subplot(2,2,3)
[counts, centers] = hist(A,centers); %get the count with hist
bar(centers,counts/trapz(centers,counts))
title('HIST with PDF normalization');


subplot(2,2,4)
h = histogram(A,edges,'Normalization','pdf')
title('HISTOGRAM with PDF normalization');

dx = diff(centers(1:2))
normalization_difference_trapz = abs(counts/trapz(centers,counts) - h.Values);
normalization_difference_sum = abs(counts/sum(counts*dx) - h.Values);

max(normalization_difference_trapz)
max(normalization_difference_sum)

enter image description here

The maximum difference between the new PDF normalization and the former one is 5.5511e-17.

1
  • The area under PDFs is not one in your histograms, which is impossible in probability theory. See the answer stackoverflow.com/a/38813376/54964 where some corrections. To match the area one under pdf, you should have the normalization set as probability, not pdf. Commented Aug 8, 2016 at 11:01
11

hist can not only plot an histogram but also return you the count of elements in each bin, so you can get that count, normalize it by dividing each bin by the total and plotting the result using bar. Example:

Y = rand(10,1);
C = hist(Y);
C = C ./ sum(C);
bar(C)

or if you want a one-liner:

bar(hist(Y) ./ sum(hist(Y)))

Documentation:

Edit: This solution answers the question How to have the sum of all bins equal to 1. This approximation is valid only if your bin size is small relative to the variance of your data. The sum used here correspond to a simple quadrature formula, more complex ones can be used like trapz as proposed by R. M.

0
5
[f,x]=hist(data)

The area for each individual bar is height*width. Since MATLAB will choose equidistant points for the bars, so the width is:

delta_x = x(2) - x(1)

Now if we sum up all the individual bars the total area will come out as

A=sum(f)*delta_x

So the correctly scaled plot is obtained by

bar(x, f/sum(f)/(x(2)-x(1)))
3

The area of abcd`s PDF is not one, which is impossible like pointed out in many comments. Assumptions done in many answers here

  1. Assume constant distance between consecutive edges.
  2. Probability under pdf should be 1. The normalization should be done as Normalization with probability, not as Normalization with pdf, in histogram() and hist().

Fig. 1 Output of hist() approach, Fig. 2 Output of histogram() approach

enter image description here enter image description here

The max amplitude differs between two approaches which proposes that there are some mistake in hist()'s approach because histogram()'s approach uses the standard normalization. I assume the mistake with hist()'s approach here is about the normalization as partially pdf, not completely as probability.

Code with hist() [deprecated]

Some remarks

  1. First check: sum(f)/N gives 1 if Nbins manually set.
  2. pdf requires the width of the bin (dx) in the graph g

Code

%http://stackoverflow.com/a/5321546/54964
N=10000;
Nbins=50;
[f,x]=hist(randn(N,1),Nbins); % create histogram from ND

%METHOD 4: Count Densities, not Sums!
figure(3)
dx=diff(x(1:2)); % width of bin
g=1/sqrt(2*pi)*exp(-0.5*x.^2) .* dx; % pdf of ND with dx
% 1.0000
bar(x, f/sum(f));hold on
plot(x,g,'r');hold off

Output is in Fig. 1.

Code with histogram()

Some remarks

  1. First check: a) sum(f) is 1 if Nbins adjusted with histogram()'s Normalization as probability, b) sum(f)/N is 1 if Nbins is manually set without normalization.
  2. pdf requires the width of the bin (dx) in the graph g

Code

%%METHOD 5: with histogram()
% http://stackoverflow.com/a/38809232/54964
N=10000;

figure(4);
h = histogram(randn(N,1), 'Normalization', 'probability') % hist() deprecated!
Nbins=h.NumBins;
edges=h.BinEdges; 
x=zeros(1,Nbins);
f=h.Values;
for counter=1:Nbins
    midPointShift=abs(edges(counter)-edges(counter+1))/2; % same constant for all
    x(counter)=edges(counter)+midPointShift;
end
dx=diff(x(1:2)); % constast for all
g=1/sqrt(2*pi)*exp(-0.5*x.^2) .* dx; % pdf of ND
% Use if Nbins manually set
%new_area=sum(f)/N % diff of consecutive edges constant
% Use if histogarm() Normalization probability
new_area=sum(f)
% 1.0000
% No bar() needed here with histogram() Normalization probability
hold on;
plot(x,g,'r');hold off

Output in Fig. 2 and expected output is met: area 1.0000.

Matlab: 2016a
System: Linux Ubuntu 16.04 64 bit
Linux kernel 4.6

6
  • I am confused, why does the MATLAB documentation say to use pdf instead of probability to have the bar areas sum to one? When you use sum(h.values) aren't you summing just the bin heights rather than the bin areas?
    – ITA
    Commented Feb 19, 2017 at 21:01
  • I had the same question as the OP and what confused me is that you are saying the exact opposite of MATLAB documentation. Please check mathworks.com/help/matlab/ref/… It clearly says to use pdf to have the bar areas sum to one and not probability. Moreover you are using sum(f) where f=h.Values to show that area is one. h.Values correspond to bin heights, so as per definition of probability normalization that will sum to one but that is not the same as bar areas.
    – ITA
    Commented Feb 20, 2017 at 22:10
  • 1
    "Code with histogram()": If you multiply randn(N,1) by some constant, the red line will not match the data anymore.
    – Pedro77
    Commented Jun 30, 2017 at 13:26
  • I'm using @marsei answer. And when my histogram is not "very" normal, and I'm using a fitted spline to h.Value.
    – Pedro77
    Commented Jun 30, 2017 at 22:34
  • 1
    For non normal: [curve, goodness, output] = fit(x(:),h.Values(:),'smoothingspline','SmoothingParam',0.9999999); lPlot = plot(x(:),curve(x));. For normal just look @marsei answer.
    – Pedro77
    Commented Jul 1, 2017 at 12:47
1

For some Distributions, Cauchy I think, I have found that trapz will overestimate the area, and so the pdf will change depending on the number of bins you select. In which case I do

[N,h]=hist(q_f./theta,30000); % there Is a large range but most of the bins will be empty
plot(h,N/(sum(N)*mean(diff(h))),'+r')
1
  • Hi! Is the quantity mean(diff(h)) supposed to be the width of the bins? Commented Mar 12, 2021 at 17:15
1

There is an excellent three part guide for Histogram Adjustments in MATLAB (broken original link, archive.org link), the first part is on Histogram Stretching.

0

Not the answer you're looking for? Browse other questions tagged or ask your own question.