Questions tagged [floating-point]
Mathematical questions concerning floating point numbers, a finite approximation of the real numbers used in computing.
466
questions
4
votes
2
answers
110
views
pow and its relative error
Investigating the floating-point implementation of the $\operatorname{pow}(x,b)=x^b$ with $x,b\in\Bbb R$ in some library implementations, I found that some pow ...
6
votes
0
answers
143
views
Algebraic Structures involving π½ππ½ (absorbing element).
IEEE 754 floating point numbers contain the concept of π½ππ½ (not a number), which "dominates" arithmetical operations ($+,-,β
,Γ·$ will return ...
3
votes
0
answers
53
views
Solve $10^{10^z} = 10^{10^x}+10^{10^y}$ for $z$ with floating point accuracy
In the following equation
$$10^{10^z} = 10^{10^x}+10^{10^y}$$
I want to find an algorithm that computes $z$ in a floating point accurate manner given any values of $x$ and $y$ (e.g. $x=y=2000$). The ...
1
vote
2
answers
64
views
How to transform this expression to a numerically stable form?
I have this function
$$f(x, t)=\frac{\left(1+x\right)^{1-t}-1}{1-t}$$
Where $x \ge 0$ and $t \ge 0$.
I want to use it in neural network, and thus need it to be differentiable.
While it has a ...
1
vote
0
answers
49
views
Proof that $\epsilon_{mach} \leq \frac{1}{2} b^{1-n}$
I have a question about the proof of the following statement:
For each set of machine numbers $F(b, n, E_{min}, E_{max})$ with $E_{min} < E_{max}$ the following inequality holds: $\epsilon_{mach} \...
1
vote
0
answers
58
views
Why does TI-84 show scientific notation for zeros sometimes but not others?
When graphing a function and then going through the process to calculate the zeroes (left bound, right bound, guess), is there a reason that sometimes it shows y = 0, but there are other times when it ...
2
votes
1
answer
73
views
Numerically stable way to compute ugly double fraction
I am looking for a numerically stable version of this (ugly) equation
$$
s^2=\frac{1}{\frac{1}{\beta_1}+\frac{1}{\beta_2}W}
$$
where
$$
\beta_1 = c_1-c_2m+(m-c_2)b\\
\beta_2 = \frac{1}{2}\left((a-m)^2-...
1
vote
0
answers
83
views
Fundamental Axiom of Floating Point Arithmetic for Complex Numbers Multiplication
I am trying to prove the fundamental axiom of floating point arithmetic also applies to complex number multiplication.
First, let $fl$ be a function that maps a number to its closest floating point ...
1
vote
1
answer
140
views
How do calculators represent floating points (somewhat) perfectly?
If you ask a programming language to calculate 0.6 + 0.7, youβll get something like 1.2999998, and thatβs because of how floating point numbers are represented in computers. But if you ask a ...
0
votes
0
answers
29
views
Calculating coordinates of vertices, given dimensions in an architectural floorplan
So, one of my friend is trying to learn autocad. They were given a floorplan. The floorplan had the dimensions. And they were asked to find the coordinates of the all the vertices of the plan. So we ...
3
votes
2
answers
87
views
Proof that $\frac 1{10}$ has no finite binary float representation
I am supposed to prove that $\frac 1{10}$ is not representable as a finite binary float. I tried proving this via induction but that did not seem to work, now I am out of ideas. Thank you
1
vote
2
answers
64
views
Absolute difference between largest IEEE754 number and its predecesor
In simple precision format, the largest possible positive number is
$A = 0 ~~~ 11111110 ~~~ 111\ldots 111$
Its predecessor is
$B = 0 ~~~ 11111110 ~~~ 111 \ldots 110$
But what is the absolute ...
0
votes
1
answer
59
views
Proof of `TWOSUM` implementation in "double-double" arithmetic
"double-double" / "compensated" arithmetic uses unevaluated sums of floating point numbers to obtain higher precision.
One of the basic algorithms is ...
0
votes
0
answers
10
views
Specify the conditions Exponent and Mantissa sizes must meet, so that the minimal distance between representable numbers is no more than 1.
Using the following floating-point representation:
s - one sign bit
m - mantissa - real number in range [1, 2), in which 1 and the comma are skipped, size of M bits
c - Exponent - natural number, ...
0
votes
0
answers
53
views
What is the computational complexity of calculating determinants for matrices of finite precision floating-point numbers?
Following up from this older question, I understand that calculation of determinants for integer-valued matrices is possible with polynomial scaling. However, I have been unable to locate any ...
0
votes
1
answer
55
views
How to compute the successor to a given floating point number
Let $F$ the set of all floating point number $n2^e$ such that $ -2^{53} < n < 2^{53}$ and $β1074 \leq e \leq 970$. Let $F^* = F - \{\max(F)\}$
I assume $F$ not to be dense, and therefore there ...
4
votes
1
answer
151
views
1
vote
0
answers
159
views
Show that $x+1$ is not backward stable
Suppose we use $\oplus$ to compute $x+1$, given $x \in \mathbb{C}$. $\widetilde{f(x)} = \mathop{\text{fl}}(x) \oplus 1$. This algorithm is stable but not backward stable. The reason is that for $x \...
1
vote
2
answers
173
views
Another way to compute the epsilon machine
Why the next program computes the machine precision? I mean, it can be proved that the variable $u$ will give us the epsilon machine. But I don't know the reason of this.
Let
$a = \frac{4}{3}$
$b = a β...
3
votes
0
answers
152
views
Justification for the definition of relative error, why is it not a metric?
The absolute error and relative error operators are very commonly encountered while reading about topics from the fields of floating-point arithmetics or approximation theory.
Absolute error is
${ae(a,...
0
votes
2
answers
99
views
Tricks in the floating point operations for better numerical results
I'm attempting to comprehend a passage from the book "Computational Modeling and Visualization of Physical Systems with Python" which I may be mentally fatigued to grasp. Here's the issue: ...
2
votes
1
answer
182
views
Is there a stable algorithm for every well-conditioned problem?
Reading these notes on condition numbers and stability, the summary states:
If the problem is well-conditioned then there is a stable way to solve it.
If the problem is ill-conditioned then there is ...
0
votes
0
answers
27
views
Floating Point Precision Algorithm
In my database, data stored as a precision of 10 digits Decimal(30,10).
User can enter x or 1/x. I need to save in 1/x. If user enters ...
0
votes
0
answers
60
views
Secant method optimization - initial guesses with floating point precision?
Say I want to find the root of $f(x) = e^{-x} - 5$, and assume I start with initial guesses $x_0 = -3$ and $x_1 = 3$.
I define my update function as $x_i = x_{i-1} - f(x_{i-1}) * \frac{x_{i-1} - x_{i-...
1
vote
1
answer
172
views
Does using smaller floating-point numbers decrease rounding errors?
I started learning about floating point by reading "What Every Computer Scientist Should know About Floating-Point Arithmetic" by David Goldberg. On page 4 he presents a proof for the ...
1
vote
0
answers
22
views
How to calculate converted value for each number in a set using a conversion rate, having its sum equal exactly a rounded fixed converted total?
Say I have three numeric values: a total, converted total, and a conversion rate. These are fixed, given numbers, and the two totals always have the precision of two decimal places.
...
0
votes
0
answers
56
views
Finding an expression for $\sqrt{x^2 + z^2}$ that is more precise in floating point arithmetic?
Assuming that both $x$ and $z$ have no representation errors, and that $\vert z^2 \vert \ll \vert x^2 \vert$. There must exist an expression for $\sqrt{x^2 + z^2}$ that is the same in exact arithmetic ...
0
votes
0
answers
38
views
How does a computer calculate matrix scalar multiplication order of operations (flops)
I am trying to understand the number of flops in the Householder QR factorization. In one line of the algorithm, it says
\begin{gather*}
v = v / \lVert v \rVert_2
\end{gather*}
I was wondering what ...
1
vote
1
answer
168
views
On the axioms of floating-point arithmetic
As I understand there are two "axioms" that should be satisfied in floating-point arithmetic:
$$\forall x\in \mathbb R,\ \exists |\varepsilon|\leq\varepsilon_{\text{machine}},\ \mbox{fl} (x) ...
0
votes
1
answer
60
views
Representation of rounding error in floating point arithmetic. [duplicate]
It is well known that in a Floating point number system:
$$
\mathbb{F}:=\{\pm \beta^{e}(\frac{d_1}{\beta}+\dots +\frac{d_t}{\beta^t}): d_i \in \{0,\dots,\beta-1\},d_1\neq 0, e_{\min}\leq e \leq e_{\...
1
vote
0
answers
34
views
Expression of sum in floating point system
This is a question of an exam on Numerical Analysis I had:
Consider the floating point system of base $2$, maximum number of decimals $53$, maximum exponent $1025$ and minimum exponent $-1022$. That ...
0
votes
2
answers
164
views
Evaluating $a(b + c)$ more accurately with FMA
I'm using machine-precision floating-point arithmetic, and every so often it happens that I need to evaluate an expression of the form $a(b + c)$. I found that the accuracy can be improved using FMA (...
2
votes
0
answers
82
views
Numerically stable evaluation of factored univariate real polynomial
Suppose we have a real univariate factored polynomial, meaning we have its factors: an arbitrary number of polynomials of degree less than or equal to two. To simplify things, if necessary, let's ...
0
votes
0
answers
104
views
Bias in Single Precision Floating numbers
I had a doubt regarding Single Precision Floating point numbers. It is about the bias number which can be derived from exponent part of this representation of numbers.
On searching up on google, most ...
3
votes
1
answer
109
views
How to compute this "smooth max operator"?
I was seeking for an alternate way to activate each neuron of a neural network non-linearly. Eventually, I came up with the following binary operation:
$$
x \lor y = \log (\exp x + \exp y)
$$
With $-\...
0
votes
0
answers
25
views
How to Multiply 2 arrays with unique non-integers to prodice an array with unique results?
Is there an Algortihm/formulae to multiply two arrays (1D & 2D) of unique numbers such that the resultant array contains unique results.
Would one have to create the 2 initial arrays in a certain ...
1
vote
1
answer
137
views
What is the set of all numbers that can be represented with a floating-point format?
Computers use single- (or, for more precise calculations, double-) precision floating-point formats to represent a subset of real numbers. While a decent chunk of real numbers can be stored with these ...
1
vote
0
answers
54
views
Method for finding the largest positive difference between two pairs of IEEE 754 double precision floating point numbers and fixed-point numbers
I have two pairs of IEEE 754 double precision (64-bit) floating-point numbers and unsigned fixed-point numbers, and I'm trying to find which pair has the greatest difference.
The fixed-point numbers ...
0
votes
0
answers
44
views
Is converting between roots and coefficients of a polynomial numerically stable?
Assume we're on a computer using 32-bit floats (or something similar), and I'm converting back and forth between the $n$ coefficients of a polynomial and the corresponding $n$ roots of the polynomial. ...
0
votes
1
answer
53
views
storing decimal number into computer with finite mantissa
I am learning about numerical methods and the following link caught my attention:
https://www.iro.umontreal.ca/~mignotte/IFT2425/Disasters.html
So from what I understand 0.1 is not exactly ...
0
votes
0
answers
66
views
Proof of loss of orthogonality in Gram-Schmidt
I am stuck at understanding about how to derive the following proofs related to error bounds which are given in the following slides. Can anyone please explain to me how these are derived?
-1
votes
1
answer
65
views
fl(A) where A is a square matrix
We defined $fl(x)$ to be the function $fl:\mathbb{R} \rightarrow \mathbb R_b (t, s)$ (i.e., takes reals and outputs the float). What does $fl(A)$ mean when $A \in \mathbb R ^{n \times n} $? I assume ...
-3
votes
1
answer
152
views
trouble understanding floating point representation
I had a quiz last week on floating point representation. After he graded the quiz, he walked us through each step so that we could see what we did wrong. I took notes so that I could study his ...
-1
votes
1
answer
66
views
Determine The Base of The Venusian Numeration System [closed]
this question is from Thomas Koshy's book called "Discrete Mathematics With Applications":
Any idea how to do this question? I can tell that the base of the system is at least 3 (since we ...
0
votes
1
answer
76
views
Fast computation of $x^{1/p}$, where $x\in\mathbb{R}^+$ and $p=2^{n}$, where $n\in\mathbb{N}$ with bit shifts?
There is plenty of literature regarding the legendary Fast inverse square root routine from Quake, but can we do something similar to compute $x^{1/p}$ as given in the title?
Given that $p$ is a power ...
0
votes
2
answers
76
views
Algorithm for drawing generalised circles
A generalised circle is either a circle in the plane or a line. The general equation of one is:
$$A(x^2 + y^2) + Bx + Cy + D=0,$$
where $4AD - B^2 - C^2 \leq 0$. This can be checked by completing the ...
2
votes
0
answers
61
views
Adding inverses of nilpotents as an extension of the "extended real numbers"
This is an idea that I had while playing with an automatic differentiation system built on dual numbers. This system, like most computer algebra systems built on floating point arithmetic, has the ...
0
votes
0
answers
613
views
Are there any ways to increase the precision in MATLAB without built in functions?
I am a beginner learning about MATLAB scientific computation, floating point numbers, and numerical error. When I am using a very small $x$ value for some equations, such as $y(x) = (\exp(x)-1-x)/x^2$,...
3
votes
2
answers
156
views
Explanation for MATLAB floating point number calculation?
I am a beginner studying scientific computation, more specifically floating point numbers and precision in matlab. When testing the outputs of 2 of the following equations, I am not sure how matlab ...
0
votes
1
answer
68
views
Round-Off Unit Formula [duplicate]
My textbook states the following:
If $x\in \mathbb R$ such that $x_{\text{min}}\leq |x| \leq x_{\text{max}}$, then $$fl(x) = x(1+\delta) \text{ with } |\delta | \leq u$$ where $$u = \frac12 \beta^{1-...