To encode an event occurring with probability $p$ you need at least $\log_2(1/p)$ bits (why? see my answer on "What is the role of the logarithm in Shannon's entropy?").
So in optimal encoding the average length of encoded message is
$$
\sum_i p_i \log_2(\tfrac{1}{p_i}),
$$
that is, Shannon entropy of the original probability distribution.
However, if for probability distribution $P$ you use encoding which is optimal for a different probability distribution $Q$, then the average length of the encoded message is
$$
\sum_i p_i \text{code_length($i$)} = \sum_i p_i \log_2(\tfrac{1}{q_i}),
$$
is cross entropy, which is greater than $\sum_i p_i \log_2(\tfrac{1}{p_i})$.
As an example, consider alphabet of four letters (A, B, C, D), but with A and B having the same frequency and C and D not appearing at all. So the probability is $P=(\tfrac{1}{2}, \tfrac{1}{2}, 0, 0)$.
Then if we want to encode it optimally, we encode A as 0 and B as 1, so we get one bit of encoded message per one letter. (And it is exactly Shannon entropy of our probability distribution.)
But if we have the same probability $P$, but we encode it according to distribution where all letters are equally probably $Q=(\tfrac{1}{4},\tfrac{1}{4},\tfrac{1}{4},\tfrac{1}{4})$, then we get two bits per letter (for example, we encode A as 00, B as 01, C as 10 and D as 11).