Collection that uses a bitset to store a huge set of unique integers from a given range (rev. 1)

Question

Some (skippable) context:

I'm writing program that minimizes an arbitrary logical function. Such function takes an unsigned integer as an argument and returns either true or false. The input to my program is a list of all values for which the minimized function returns true. Those values are called "minterms". Because I want to handle functions that take arguments up to 32 bits, the number of minterms may reach billions and I need a way to effectively store them.

Description of the code:

This is a collection that stores set of unique 32-bit unsigned integers from range 0..n where n is decided at runtime and can be as high as 2^32 - 1. In a typical case, the number of stored integers is close to the half of the size of the range.

The primary concern while writing this, was RAM usage. Using std::vector<std::uint32_t> would result in 16 GB of memory in the worst case. Instead, I've opted for using a std::vector<bool> as a bitset. Each bool in it represents an integer equal to its the index in the vector, where true means that the number is present and false that it is not. This reduces the memory usage to "merely" 512 MB.

Another, also very important concern, is the speed. This is the reason for which almost everything is inlined and contained in the header file. Fortunately, this way of storing the data allows for checking and storing arbitrary values in O(1) time and this is the may mode of interaction with the data.

The iterator is not invalidated when elements are added / removed, which is useful in a few algorithms.

I tried to make this code work in all the popular compilers. I also avoided using any external libraries to make it compile out of the box for every user.

Purpose of the review:

I think the algorithm I'm using is sound and decently optimized but of course I welcome any suggestions on how to make this code faster.

The main purpose of the review, however, is too gauge the quality and readability of the code.

I have quite a lot of experience with C++ but I'm mostly self-learned. I've never worked professionally with other people knowing this language so I might be not aware of some more obscure conventions in the language. Please tell me if I'm doing anything that an average professional C++ programmer would consider weird. One thing that I am aware of is that my naming convention does not match the one from the standard library. I just like the Java naming convention.

Do you think my code is readable? Are any names unclear? Should I add an explanatory comment somewhere?

Are there any aspects of the C++ language that would be in some way helpful but I seem unaware of? E.g. things like the attribute [[nodiscard]].

Roast my code:

This code uses 2 headers from my project that are not included in the review:

"Minterm.hh" defines the type Minterm which is just an alias to std::uint_fast32_t.
"global.hh" defines a value maxMinterm which depends on the program's input but it doesn't change after its initialization at the beginning. It's value is in range from 0 to 2^32 - 1.

Minterms.hh:

#pragma once

#include <algorithm>
#include <cstddef>
#include <cstdint>
#include <iterator>
#include <vector>

#include "global.hh"
#include "Minterm.hh"


class Minterms
{
    std::vector<bool> bitset;
    std::size_t size = 0;
    
public:
    using overlapping_t = std::vector<Minterm>;
    
    class ConstIterator
    {
        const Minterms &minterms;
        std::uint_fast64_t i;
        ConstIterator(const Minterms &minterms, const std::uint_fast64_t i) : minterms(minterms), i(i) {}
        friend class Minterms;
    public:
        [[nodiscard]] bool operator==(const ConstIterator &other) const { return this->i == other.i; }
        [[nodiscard]] bool operator!=(const ConstIterator &other) const { return this->i != other.i; }
        inline ConstIterator& operator++();
        inline ConstIterator& operator--();
        [[nodiscard]] Minterm operator*() const { return static_cast<Minterm>(i); }
    };
    
    Minterms() : bitset(static_cast<std::uint_fast64_t>(::maxMinterm) + 1, false) {}
    
    [[nodiscard]] bool operator==(const Minterms &other) const { return this->bitset == other.bitset; }
    [[nodiscard]] bool operator!=(const Minterms &other) const { return this->bitset != other.bitset; }
    
    [[nodiscard]] bool isEmpty() const { return size == 0; }
    [[nodiscard]] bool isFull() const { return size - 1 == ::maxMinterm; }
    [[nodiscard]] std::size_t getSize() const { return size; }
    [[nodiscard]] std::size_t getCapacity() const { return bitset.size(); }
    [[nodiscard]] bool check(const Minterm minterm) const { return bitset[minterm]; }
    [[nodiscard]] inline overlapping_t findOverlapping(const Minterms &other) const;  // Optimized for when there is little to none of them.
    inline bool add(const Minterm minterm);
    inline void add(const Minterms &other, const std::size_t overlappingCount);
    inline bool remove(const Minterm minterm);
    
    [[nodiscard]] inline ConstIterator begin() const;
    [[nodiscard]] ConstIterator end() const { return {*this, bitset.size()}; }
    [[nodiscard]] ConstIterator cbegin() const { return begin(); }
    [[nodiscard]] ConstIterator cend() const { return end(); }
    
#ifndef NDEBUG
    void validate() const;
#endif
};


Minterms::ConstIterator& Minterms::ConstIterator::operator++()
{
    for (++i; i != minterms.bitset.size() && !minterms.bitset[i]; ++i) { }
    return *this;
}

Minterms::ConstIterator& Minterms::ConstIterator::operator--()
{
    for (--i; i != 0 && !minterms.bitset[i]; --i) { }
    return *this;
}

Minterms::overlapping_t Minterms::findOverlapping(const Minterms &other) const
{
    overlapping_t overlapping;
    std::set_intersection(
            this->cbegin(), this->cend(),
            other.cbegin(), other.cend(),
            std::back_inserter(overlapping));
    return overlapping;
}

bool Minterms::add(const Minterm minterm)
{
    const bool previous = check(minterm);
    if (!previous)
    {
        bitset[minterm] = true;
        ++size;
    }
    return !previous;
}

void Minterms::add(const Minterms &other, const std::size_t overlappingCount)
{
    std::transform(
            other.bitset.begin(), other.bitset.end(),
            this->bitset.begin(), this->bitset.begin(),
            std::logical_or<bool>());
    this->size += other.size - overlappingCount;
}

bool Minterms::remove(const Minterm minterm)
{
    const bool previous = check(minterm);
    if (previous)
    {
        bitset[minterm] = false;
        --size;
    }
    return previous;
}

Minterms::ConstIterator Minterms::begin() const
{
    ConstIterator iter{*this, 0};
    if (!bitset.empty() && !bitset[0])
        ++iter;
    return iter;
}

Minterms.cc:

#include "./Minterms.hh"

#include <cassert>


#ifndef NDEBUG
void Minterms::validate() const
{
    std::size_t actualCount = 0;
    for ([[maybe_unused]] const Minterm minterm : *this)
        ++actualCount;
    assert(size == actualCount);
}
#endif

There is now another question with the second revision of this code: codereview.stackexchange.com/q/289262/78158 — Piotr Siupa, Commented Jan 31 at 23:20

G. Sliepen · Accepted Answer · 2024-01-27 22:44:25Z

Don't call the `class` itself `Minterms`

While your program might need to store a list of "minterms", the data structure you are using for this doesn't care what is stored in it. I would call the class you make Bitset or something like that, and then in your actual code you write something like:

Bitset minterms(…);

Consider not creating a `class`

Instead of creating your own class that wraps a std::vector<bool>, consider whether you can just use a std::vector<bool> directly. Most of your member functions do things you could directly do on a std::vector<bool>. I see two things that are different:

You have added a way to iterate over only the set bits in the bitset. However, you can add your own get_set_bits() function that takes a std::vector<bool> as an input, and returns a range that can be iterated over and that returns the indices of all the set bits. You can reuse your ConstIterator for that.
You keep track of the number of set bits. While you could call something like std::ranges::count() on a std::vector<bool> to count the number of set bits, that would have to iterate over the whole vector, which might take a lot of cycles if maxMinterm is very large.

If you don't need to know the number of set bits often and/or efficiently, then I would just use std::vector<bool> directly.

Use `std::bitset` if `maxMinterm` is known at compile time

If you know maxMinterm at compile time, then you can use std::bitset instead of a std::vector<bool>. It has a member function count() that returns the number of set bits.

Make it look and behave like a standard library container

If you do need to create your own class, then at least make it look and behave like the containers from the standard library. This way, someone already familiar with the STL doesn't need to learn that your class uses isEmpty() instead of empty() to check if the container is empty, instead of empty(). Furthermore, the standard algorithms that work on STL containers can also work on your container, but only if you provide the same interface. It also makes it easier to replace one type of container with another.

Remove functions that don't need to members. For example, findOverlapping() just calls std::set_intersection(). The user of your class can call that directly themselves on your class if you give it the right interface. Then they get more flexibility, for example if they want to do something other than have the result back-inserted into a std::vector<Minterm>.

Unsafe `add()`

The version of add that takes a reference to another Minterms and an overlappingCount is unsafe. It trusts that overlappingCount is passed correctly. But what if it is not? Then this->size will be incorrect. I would avoid the issue completely by just writing:

void Minterms::add(const Minterms &other)
{
    for (auto minterm: other) {
        add(minterm);
    }
}

About `operator--()`

With standard library iterators, it is never valid for the caller to decrement an iterator past begin(). So you don't need the check for i != 0, and can just write:

Minterms::ConstIterator& Minterms::ConstIterator::operator--()
{
    while (!minterms.bitset[--i]) {}
    return *this;
}

Keeping track of size in a variable is absolutely necessary. As I wrote, I deal with sets that can have billions of elements. The time of executing std::ranges::count() can be measured in literally seconds. Is there actually a good reason to not creating a class? It encapsulates things nicely so I don't need to worry about it being internally a bitset. E.g., I have for (const Minterm &minterm : minterms) in a few places in the program, which is very convenient to write and easy to read. — Piotr Siupa, Commented Jan 31 at 13:22
maxMinterm is not known at compile time. It comes from program's input. Although, there are only 33 possible values so theoretically I could make it into a template class. But then I would have to do it with the classes that use it to and I would end converting almost the entire program. This would probably increase it's size by order of magnitude. Alternatively, I could make some kind of wrapper that hides the template from other classes and chooses the right specialization at runtime but I think that would cripple performance too much. — Piotr Siupa, Commented Jan 31 at 13:29
The unsafe add() needs to stay for the same reason as the variable for size. I always know the count overlapping elements when I'm calling it and counting them again is too costly. I consider calling add() with a wrong count an undefined behavior, the same as decrementing begin(). However, I agree that if this class was meant more for general use and not strictly tailored to my needs, add() without the second argument would be better. Maybe I'll add it just in case someone want's to copy my code. — Piotr Siupa, Commented Jan 31 at 13:34
It seems you have a good understanding of the tradeoffs you are making, which is great! The safe add() I've shown doesn't have to count all the elements in bitset again; it just needs to count how many bits it flipped. That might cost a little bit extra, but on the other hand, you don't need to calculate overlappingCount any more, so maybe it evens out? — G. Sliepen, Commented Jan 31 at 15:00
I still calculate overlappingCount because I need it earlier in the algorithm that calls add(). I guess this solution is very specific to my particular case. — Piotr Siupa, Commented Jan 31 at 17:13

Stack Exchange Network

Collection that uses a bitset to store a huge set of unique integers from a given range (rev. 1)

1 Answer 1

Don't call the `class` itself `Minterms`

Consider not creating a `class`

Use `std::bitset` if `maxMinterm` is known at compile time

Make it look and behave like a standard library container

Unsafe `add()`

About `operator--()`

Not the answer you're looking for? Browse other questions tagged
c++
collections
c++20
memory-optimization
set
or ask your own question.

Linked

Hot Network Questions

Collection that uses a bitset to store a huge set of unique integers from a given range (rev. 1)

1 Answer 1

Don't call the class itself Minterms

Consider not creating a class

Use std::bitset if maxMinterm is known at compile time

Make it look and behave like a standard library container

Unsafe add()

About operator--()

Not the answer you're looking for? Browse other questions tagged c++collectionsc++20memory-optimizationset or ask your own question.

Linked

Related

Hot Network Questions

Don't call the `class` itself `Minterms`

Consider not creating a `class`

Use `std::bitset` if `maxMinterm` is known at compile time

Unsafe `add()`

About `operator--()`

Not the answer you're looking for? Browse other questions tagged
c++
collections
c++20
memory-optimization
set
or ask your own question.