1

I have a lot different pandas.Series looks like:

my_series:
0.0         10490405.0
1.0          3334931.0
2.0          2770406.0
3.0          2286555.0
4.0          1998229.0
5.0          1636747.0
6.0          1449938.0
7.0          1180900.0
8.0          1054964.0
9.0           869783.0
10.0          773747.0
11.0          653608.0
12.0          595688.0
...
682603.0           1.0
734265.0           1.0
783295.0           1.0
868135.0           1.0

This is the frequincies of my data: this mean there are 10490405 zeros in my data, 3334931 of 1 and etc. I want to plot histogram. I know I can do it using plt.bar:

plt.bar(my_series.index, my_series.values)

But It works bad because of large number of unique values in my_series (it can be thousand!). So bars at the plot too narrow and became invizible! So I really want to use hist to set manually number of bins and etc. But I can't use my_series.hist() because it has not such number of zeros it has just one value for zero label!


code to reproduce the problem:

val = np.round([1000000/el**2 for el in range(1,1000)])
ind = [el*10+np.random.randint(10) for el in range(1,1000)]
my_series = pd.Series(val, ind)

plt.bar(my_series.index, my_series.values)

enter image description here


As I already has close vote and wrong answer I got my problem description is really bad. I want to add the example:

val1 = [100, 50, 25, 10, 10, 10]
ind1 =  [0, 1, 2, 3, 4, 5]
my_series1 = pd.Series(val1, ind1)
my_series.hist()

enter image description here

This is just hist() on series values! So we can see, that 10 has value 3 (because there are three of them in the series) and all other has value 1 on the hist. What I want to get:

enter image description here

0 label has value 100, 1 label has value 50 and so on.

4
  • 1
    myseries.hist() and myseries.hist(bins=your_bins)? Commented Sep 20, 2019 at 13:36
  • @QuangHoang you didn't get my question. This is my fault, I edited question, added graphic exampe. Commented Sep 20, 2019 at 14:06
  • Now I got your problem, misunderstood the title. See my answer if it helps. Commented Sep 20, 2019 at 14:07
  • or you could change plt.bar to plt.fill_between(my_series.index, 0, my_series.values). In your case it works well because there's barely any difference between that and simply plt.plot :) Commented Sep 20, 2019 at 14:36

3 Answers 3

1

You can group by index values and plot bar:

# change bins as needed
bins = np.linspace(my_series.index[0], my_series.index[-1], 25)

my_series.groupby(pd.cut(my_series.index, bins)).sum().plot.bar()

# your data is very skewed, so log scale helps.
plt.yscale('log');

output:

enter image description here

1
  • Yep! This is good one! Awesome one-row-solution, I needed time to figure out it :) Thank you, also take a look at my solution too, please. Commented Sep 20, 2019 at 16:20
0

Taken from https://matplotlib.org/3.1.1/gallery/statistics/hist.html :

import matplotlib.pyplot as plt
import numpy as np
from matplotlib import colors
from matplotlib.ticker import PercentFormatter

# Fixing random state for reproducibility
np.random.seed(19680801)

N_points = 100000
n_bins = 20

# Generate a normal distribution, center at x=0 and y=5
x = np.random.randn(N_points)
y = .4 * x + np.random.randn(100000) + 5

fig, axs = plt.subplots(1, 2, sharey=True, tight_layout=True)

# We can set the number of bins with the `bins` kwarg
axs[0].hist(x, bins=n_bins)
axs[1].hist(y, bins=n_bins)

you can adjust the number of bins to fit your data. Please upload your data, so we can help in more detail.

1
  • Thank you for your responce! This will not work :) You use 100000 points, so 100000 values for x and y. But I have only their frequencies! This is like collection.Counter() used on your x and y! You can reproduce my data using code to reproduce the problem: section from my question! Commented Sep 20, 2019 at 13:53
0

I found one more unefficient solution :) but it look as I wanted, so:

func = lambda x,y: x*y
all_data = list(map(func, [[el] for el in my_series.index], [int(el) for el in my_series.values]))
merged = list(itertools.chain(*all_data))

plt.hist(merged, bins=6)
plt.show()

enter image description here

The Idea here is:

  1. pack all indexes to lists: [[el] for el in my_series.index]
  2. convert counts to int: [int(el) for el in my_series.values]
  3. now we can multiply it and so restore the full data: list(map(func, ...))
  4. now we have all the data and can use hist().

This is obviously unefficient, but in my task I need to calculate a lot of different parameters as mean, std etc. So I need to write function for all of them how to calculate. So I found faster way - just to restore data and then use builds-in.

Not the answer you're looking for? Browse other questions tagged or ask your own question.