You are not alone in having unrealistic expectations of a technology labelled as "AI" that appears to have very general capabilities regarding text. After all, it will always respond to any input, and the tone of responses for most of the fine tuned LLMs (of which Bard is one) give an air of confidence.
The latest releases of chatbots and related technology are all based on large language models (LLMs), where the core training process is to predict the next word, or more precisely a token which can be just part of a word, given all the text that has already been seen. This trained model is then leveraged to emulate chatting or other text generating processes. For each use case, the model can be fine tuned for the type of content and utility, and can also be "pre-prompted" to establish the nature of the content. The fine tuning and pre-prompting doesn't radically change the core nature of an LLM as a text prediction engine though.
During training, typically on immense amounts of data far larger than human experience of a single person can cope with, the LLM does learn some approximations of logic and facts, in as much as they help it predict the next token. Some of these learned capabilities are robust, useful and effective. Some less so. The more complex a real-world constraint or fact is, the less likely the LLM will have modelled it with an accurate approximate function internally.
The riddle problem you set in the question is exposing the limitations of the approximate internal models. The LLM is capable of generating a grammatically correct riddle, plus an explanation, plus meet the constraint placing a specific answer in the text. It is not capable of analysing its own output and figuring out that it has made an incorrect statement. That is, it has either no model or a weak approximate model for constraints on what should be in the mathematical statements in the riddle. From what you have written I would say "weak model" here, because the number produced by resolving the puzzle is still the correct order of magnitude, and that probably is not a coincidence.
Expect future iterations of AI chatbots and assistants to get better at this kind of thing - in my opinion this will happen purely because it is an expectation many people will have of "useful AI". Simply scaling the LLMs will only help a little, so future work is probably looking at adding further models or integrations that can fact-check outputs, or better understand and model requests (beyond representing as streams of tokens).
In the meantime, you are always recommended to fact check results from LLMs. Do not assume any output is factually correct, always validate anything where you need it to be correct or true.