Processing XYZ data from a large file

Question

I'm new to programming (1.5 years experience at university), and I'm working on my first computer vision related project! My goal is to take a file which might have CSV (comma separated) or txt format with XYZ coordinates and process the data. I want to round the X and Y coordinates into integers and leave the Z coordinate as a floating point.

Each file ranges from 200000 to 1400000 lines of XYZ data is is formatted as so:

# information about .txt file
253.9999929 58.0428367 -21.3930063253
.9999929 59.0435773 -21.2499391255
etc.....

Do note that the file might have commas, like a CSV file.

The function that is my main bottleneck is:

Loadfile(string filename);

I'm pretty sure my bottleneck is from the ifstream for reading the file one line at a time, but I'm having trouble finding ways to improve it. One idea I had was to convert the file into an array of bytes and read split it up into different domains and give it to multiple threads to read concurrently, but I'm not sure if that's a good practice.

Here is the my code so far:

class XYZCoordinate{
public:
    int x;
    int y;
    float z;

    XYZCoordinate(int X, int Y, float Z){
    x = X;
    y = Y;
    z = Z;
    }
};

class Loadfileoutput
{
public:
    int xmin;
    int xmax;
    int ymin;
    int ymax;
    vector<XYZCoordinate> values;

    Loadfileoutput(int xmn, int xmx, int ymn, int ymx, vector<XYZCoordinate> val){
    xmin = xmn;
    xmax = xmx;
    ymin = ymn;
    ymax = ymx;
    values = val;
    }
};

Loadfileoutput Loadfile(string filename)
{
    std::ifstream in(filename);
    if (!in) {
        cout << "Cannot open input file.\n";
        exit(EXIT_FAILURE);
    }

    std::string str;
    bool isfirst = true;
    int xmin, xmax, ymin, ymax;
    vector<XYZCoordinate> file = vector<XYZCoordinate>();
    while (std::getline(in, str)) //potential cause of bottleneck
    {
        if (str[0] == '#') //.skipping the .txt file header
            continue;
        vector<string> v = vector<string>();

        boost::split(v, str, boost::is_any_of(" ,"));

        string xstr = v[0];
        string ystr = v[1];
        string zstr = v[2];
        int xint, yint;
        float x,y,z;

        stringstream(v[0]) >> x;
        xint = (int)round(x);

        stringstream(v[1]) >> y;
        yint = (int)round(y);

        stringstream(v[2]) >> z;

        XYZCoordinate temp = XYZCoordinate(xint, yint, z);
        file.push_back(temp);

        //get bounds
        if (isfirst)
        {
            xmin = xint;
            xmax = xint;
            ymin = yint;
            ymax = yint;
            isfirst = false;
        }
        else
        {
            //set xmin/max for scaling
            if (xint > xmax)
                xmax = xint;
            else if (xint < xmin)
                xmin = xint;
            //set ymin/max for scaling
            if (yint > ymax)
                ymax = yint;
            else if (yint < ymin)
                ymin = yint;
        }
    }   
    Loadfileoutput output = Loadfileoutput(xmin,xmax,ymin,ymax,file);

    return output;
}

Why don't you just file >> x >> y >> z instead of reading a line, putting it into stringstream, thein reading them? It also looks like input file contains commas, whereas your example does not show that. — Incomputable, Commented Aug 15, 2018 at 17:49
Since you're already using boost, have you looked into Spirit? — yuri, Commented Aug 15, 2018 at 17:50
@CaseyPoon, you can ask iostream to treat comma as whitespace. Will it be sufficient? Of course, you can always grab a heavy one as yuri suggested using boost.Spirit. Anyway, please specify all of the possible input kinds in post itself. — Incomputable, Commented Aug 15, 2018 at 17:54
@CaseyPoon, I edited the post to include the information. Please do not miss those important details, as description was telling one requirement set, and the code another. It is quite confusing. — Incomputable, Commented Aug 15, 2018 at 18:21
What will you be doing with these coordinates? If you're just writing them straight out to another file, then you'll want to work as a line-by-line filter, instead of storing them into a vector. That will improve your performance greatly by allowing your OS to read ahead whilst you're processing and writing the output. — Toby Speight, Commented Aug 16, 2018 at 10:32

Edward · Accepted Answer · 2018-08-17 12:22:46Z

There are a number of things that may help you improve your program.

Don't abuse `using namespace std`

Putting using namespace std at the top of every program is a bad habit that you'd do well to avoid.

Use the required `#include`s

The code uses a number of #includes that are not listed. It was not difficult to infer, but it helps reviewers if the code is complete. After a bit of looking, it seems that this is what's needed:

#include <vector>
#include <string>
#include <iostream>
#include <fstream>
#include <sstream>
#include <cmath>
#include <boost/algorithm/string.hpp>

Prefer modern initializers for constructors

The constructor use the more modern initializer style rather than the old style you're currently using. Instead of this:

XYZCoordinate(int X, int Y, float Z){
x = X;
y = Y;
z = Z;
}

I would write this:

XYZCoordinate(int X, int Y, float Z) : x{X}, y{Y}, z{Z} {}

However, in this case, for such a simple class, in real code I wouldn't write a constructor at all and would simply delete the line. The class can still be initialized like this: XYZCoordinate xyz{3, 5, -23.22}; via aggregate initialization.

Use better variable names

The variable names xmax, xmin, etc. are good, but the name file is not. The first names explain something about what the variable means within the context of the code, but the latter is only confusing. A better name might be coordinates. Similarly, Loadfileoutput is another poor name -- I'd suggest Coordinates for the name of that class.

Use C++ idiom

The code currently contains this:

vector<XYZCoordinate> file = vector<XYZCoordinate>();

That's really not necessary. Just write this instead:

vector<XYZCoordinate> coordinates{};

Prefer a `struct` to a `class` for objects with no invariants

C++ has two primary means to encapsulate some behavior and data: struct and class. The only difference is that a struct has its members public by default, while the class members are private by default. If there is no invariant, (a thing that must always be true for the class to be coherent), then it often makes sense to use a struct instead. In this case the XYZCoordinate class has no invariants and is probably better suited as a struct.

Write a custom extractor

Here's the code I'd prefer to write to read in and create the desired object:

std::ifstream in(filename);
Coordinates c;
in >> c;

This can be done by writing an extractor for the data. In this case, I'd write two extractors. One for the Coordinates class and one for the XYZCoordinate class. Here's one way to write one for the XYZCoordinate class. Here's the complete definition for that class, which I have converted to a struct:

struct XYZCoordinate{
    int x;
    int y;
    float z;
    friend std::istream& operator>>(std::istream& in, XYZCoordinate &xyz) {
        float a, b;
        in >> a >> b >> xyz.z;
        xyz.x = std::rint(a);
        xyz.y = std::rint(b);
        return in;
    }
};

Now we can write an extractor for the Coordinates class. Here's that complete class:

class Coordinates
{
    int xmin;
    int xmax;
    int ymin;
    int ymax;
    std::vector<XYZCoordinate> values;

public:
    friend std::ostream& operator<<(std::ostream& out, const Coordinates &c) {
        return out << "[ " << c.xmin << ", " << c.xmax << "], [ " 
            << c.ymin << ", " << c.ymax << "], with " << c.values.size() << " points\n";
    }
    friend std::istream& operator>>(std::istream& in, Coordinates &c) {
        // skip comment lines
        while (in.peek() == '#') {
            in.ignore(std::numeric_limits<std::streamsize>::max(), '\n'); 
        }
        XYZCoordinate xyz;
        while (in >> xyz) {
            c.values.push_back(xyz);
            c.xmin = std::min<int>(c.xmin, xyz.x);
            c.xmax = std::max<int>(c.xmax, xyz.x);
            c.ymin = std::min<int>(c.xmin, xyz.y);
            c.ymax = std::max<int>(c.xmax, xyz.y);
        }
        return in;
    }
};

Note that the values of xmin, xmax, etc. are invariants for this class, so we leave it as a class with only private data members. If you need to access the items in the vector, you can add custom functions to handle that.

Provide a test driver

This is more about getting good reviews than the code itself, but it's often very useful if you provide some simple main that exercises the code. This helps reviewers (and yourself!) understand how it's to be used in context. In this case, I wrote this:

#include <vector>
#include <iostream>
#include <fstream>
#include <locale>
#include <limits>
#include <algorithm>

// all of the class code above goes here

int main(int argc, char *argv[]) {
    if (argc != 2) {
        std::cerr << "Usage: xyz infile.txt\n";
        return 1;
    }
    std::ifstream in(argv[1]);
    in.imbue(std::locale(std::locale(), new csv_reader()));
    Coordinates c;
    in >> c;
    std::cout << c;
}

The call to imbue uses a locale that treats commas as whitespace. That code is in this excellent answer. Now you don't have to explicitly check for commas or spaces and don't need boost. Everything here is standard C++.

Results

I wrote a program to create a set of 20 millon sets of random coordinates. Using that input, your original program took 41.9 seconds on my machine, while the one presented above took 14.8 seconds.

Out of curiosity, why x{X} instead of x(X)? Similarly, why vector<...> coordinates {}; instead of just vector<...> coordinates;? — anon, Commented Aug 15, 2018 at 21:18
Additional question, would it be considered standard practice to include the the modern constructor in the header file or should I keep it in the cpp file? — Casey Poon, Commented Aug 15, 2018 at 21:49
@NicHartley: I use the C++11 uniform initialization pretty much whenever I can, for the reasons listed in the link. — Edward, Commented Aug 15, 2018 at 22:21
@CaseyPoon: Good question -- for an extremely simple initialization like the one shown here for XYZCoordinate, I'd actually put it in neither file and let the compiler autogenerate it. — Edward, Commented Aug 15, 2018 at 22:23
Small correction: OP isn’t using “old style initialisation”. They’re not using intialisation, but instead assignment. This is a fundamental difference (it won’t work correctly with references and const variables, for instance). — Konrad Rudolph, Commented Aug 16, 2018 at 9:59

Community · Accepted Answer · 2020-06-10 13:24:26Z

Well, lets start from the CodeReview:

Since this is not really a library, I will omit my typical "Make it easy to use correctly, hard to use incorrectly", but it is still important thing to keep in mind.

Using classes instead of structs

In this case, it seems like struct should be used:

struct point {
    int x;
    int y;
    double z;
};

Better name, like Point3D might be applied. XYZCoordinate, although explicit, is a bit verbose. Matemathicians might argue that names miss coordinate system names, but I guess it is out of problem domain at the moment.

Passing by value

Although it is fine to accept by value and move at the call site, passing such big objects by value is rarely used, especially since read only view is needed. Prefer to use const T& for read only view, where T is the type of passed in object.

Do not call default constructor unless required

 vector<XYZCoordinate> file = vector<XYZCoordinate>();

should be just

vector<XYZCoordinate> file;

Calling default constructor explicitly increases chances of running into vexing parse. To call default constructor, just use {}:

std::vector<int>{};

Try to find better names
```
 class Loadfileoutput;
```

Is somewhat confusing. One would think that there should a file in it somewhere, but there is no file. Something like Dataset, or CollectedData might be better.

Knowledge of standard library

@R.Sahu showed you some suggestions on improving the performance of the reading part in .txt files. Here is an implementation which somewhat incorporates the first suggestion (misses the part about dealing with errors):

#include <iostream>
#include <cctype>
#include <vector>
#include <limits>

struct point {
    int x;
    int y;
    double z; 
};

std::istream& operator>>(std::istream& is, point& p) {
    double x, y, z;
    is >> x >> y >> z;
    p = {static_cast<int>(x), 
         static_cast<int>(y), 
         z}; //explicit conversion, as implicit is not allowed
    return is;
}

std::ostream& operator<<(std::ostream& os, const point& p) {
    return os << p.x << ' ' << p.y << ' ' << p.z;
}

struct custom_classification : std::ctype<char> {
    custom_classification() : ctype(make_table()) { }
private:
    static mask* make_table() {
        const mask* classic = classic_table();
        static std::vector<mask> v(classic, classic + table_size);
        v[','] |= space;
        return &v[0];
    }
};

std::vector<point> read_points(std::istream& is) {
    auto old_locale = is.imbue(std::locale(is.getloc(), new custom_classification));
    is.ignore(std::numeric_limits<std::streamsize>::max(), '\n');
    
    auto points = std::vector(std::istream_iterator<point>(is), {});
    is.imbue(old_locale);
    return points;
}

int main() {
    auto points = read_points(std::cin);
    for (auto&& a_point: points) {
        std::cout << a_point << '\n';
    }
}

Demo on Wandbox.

There are several things going on here:

point is standalone.

It knows how to read itself from a stream, how to print itself into stream, and it handles that narrowing conversion on its own. Do note the use of aggregate initializer.
Ignoring commas and other formatting issues are delegated to stream. The old locale is restored, to behave as a good citizen and avoid surprising behaior when code gets modified.
Iterators are used extensively to reduce boilerplate and naked loops, although at some cost of being beginner friendly

To search for min-max while reading, one could use idiomatic while (first != last) loop:

dataset dset; //contains vector, min and max
auto first = std::istream_iterator<point>(is);
auto last = std::istream_iteratior<point>{};
//initialize dset to reasonable defaults, for min-max search

while (first != last) {
    auto next_point = *first++;
    if (next_point.x > dset.xmax) {
       dset.xmax = next_point.x;
    } else if (next_point.x < dset.xmin) {
        dset.xmin = next_point.x;
    }
    //ditto for y
    dset.values.push_back(next_point);
}

Good review, but I think this misses the fact that there may be comment lines in the file. — Edward, Commented Aug 15, 2018 at 19:49

R Sahu · Accepted Answer · 2018-08-15 17:57:04Z

8

Suggestion 1: Don't split the lines

I don't see the need for splitting each line into tokens and extracting the numbers from each token. You can extract all the numbers from a line by using a istringstream with the entire line.

Replace the lines

boost::split(v, str, boost::is_any_of(" ,"));

string xstr = v[0];
string ystr = v[1];
string zstr = v[2];
int xint, yint;
float x,y,z;

stringstream(v[0]) >> x;
xint = (int)round(x);

stringstream(v[1]) >> y;
yint = (int)round(y);

stringstream(v[2]) >> z;

by

std:istringstream istr(str);
if ( istr >> x >> y >> z )
{
   // Process the data
}
else
{
   // Deal with the error.
}

Suggestion 2: Use a `char []` instead of `std::string` for lines

Constructing and destrucing std::strings for each line is potentially expensive. Use char [] of sufficient size.

const size_t MAX_LINE_LENGTH = 1000; // Make it large enough for your app.
char line[MXX_LINE_LENGTH];
while ( in.getline(line, MAX_LINE_LENGTH) )
{
   // Use line
}

answered Aug 15, 2018 at 17:57

R Sahu

3,50212 silver badges20 bronze badges

1

\$\begingroup\$ It seems like problem description is incomplete. \$\endgroup\$
– Incomputable
Commented Aug 15, 2018 at 17:58
\$\begingroup\$ @Incomputable, after reading the comment about CSV files, it does seem that way. \$\endgroup\$
– R Sahu
Commented Aug 15, 2018 at 18:03
1

\$\begingroup\$ I guess I'll upvote it anyway. Newcomers' posts are usually half baked on the start anyway, as they are new to the post format and community. \$\endgroup\$
– Incomputable
Commented Aug 15, 2018 at 18:05
1

\$\begingroup\$ Actually, I learnt a lot from your contributions on SO, so it should be me thanking you. In some sense, I was raised here :) \$\endgroup\$
– Incomputable
Commented Aug 15, 2018 at 18:09

Add a comment |

Community · Accepted Answer · 2021-10-07 07:34:52Z

class XYZCoordinate{
public:
    int x;
    int y;
    float z;

    XYZCoordinate(int X, int Y, float Z){
    x = X;
    y = Y;
    z = Z;
    }
};

Integrate a formatting utility, like clang-format or astyle, into your tool-chain. Your indentations and spacing are inconsistent.

Do you plan on working with x and y as floats? Avoid transformations until you absolutely need them.

For constructors, prefer initialization over assignment.

class XYZCoordinate {
public:
    XYZCoordinate(int x, int y, float z) : x{x}, y{y}, z{z} {}

    int x;
    int y;
    float z;
};

Consider leveraging the type system to safely operate on the dimensions of your coordinate class. You can read up on strong types here and here

Loadfileoutput(int xmn, int xmx, int ymn, int ymx, vector<XYZCoordinate> val){

Avoid using namespace std;.

Your coordinate vector argument is being copied here. Treat vector<XYZCoordinate> either as a input-only argument or as a sink argument.

// In-Only - Pass by reference to const then copy
Loadfileoutput(/* ... */, const std::vector<XYZCoordinate>& val) 
  : /* ... */
  , values{val} // copied!
{}  

// Sink - Pass by value and move into place
Loadfileoutput(/* ... */, std::vector<XYZCoordinate> val) 
  : /* ... */
  , values{std::move(val)} // moved!
{}

    std::string str;

Poorly-chosen names can mislead the reader and cause bugs. It is important to use descriptive names that match the semantics and role of the underlying entities, within reason. Your intent is to use the string as a buffer, so just name it buffer.

    bool isfirst = true;
    int xmin, xmax, ymin, ymax;

Any time you have a situation where you need a boolean flag to distinguish between a value being set or not, consider using an optional type (boost::optional, std::optional, etc).

    vector<XYZCoordinate> file = vector<XYZCoordinate>();

Don't repeat yourself.

    std::vector<XYZCoordinate> coordinates(); 
    ^                          ^
    Type mentioned once        Has an appropriate name

    while (std::getline(in, str))
        if (str[0] == '#') //.skipping the .txt file header

std::getline does not skip leading whitespace. Use the IO manipulator std::ws:

    while (std::getline(in >> std::ws, str))
        if (str[0] == '#') //.skipping the .txt file header

If you really want to use std::getline and a string buffer, consider using string_views liberally to access the buffer. Trim and split are easy to implement for string views. Numeric conversions already exists (std::from_chars).

        vector<string> v = vector<string>();

std::vector doesn't have a small buffer optimization to take advantage of, so every loop on a line builds up and tears down this buffer. Move it outside the loop. That will also mean that you need to appropriately size it (3 records + remaining) and you'll need to wrap either boost::split or boost::is_any_of with that limits to 3 splits.

std::string does have a small string optimization, but it's implementation defined on how much space you have before you are forced to allocate.

The longest coordinate from your example above, including sign and decimal, is 14 characters in length. If it's possible for coordinates to be longer than the SSO supported capacity of your implementation, consider using a string view instead (boost::string_view, std::string_view) and use from_chars() to convert to your numeric type.

        boost::split(v, str, boost::is_any_of(" ,"));

Separators are an interesting detail you must take care with. Locales using the SI style may use spaces for its thousands separator. Almost half of the countries using the hindu-arabic numeral system use period for its decimal separator. A similar amount use the comma as a decimal separator. The rest use some combination of both and other characters. Unless you are sure everyone uses the same locale, you just can't split on comma. Similarly, CSV fields typically allow commas to be part of the value, as long as they are part of an escaped sequence,

212311231,3.14,"3,14"
         ^    ^  ^
         |    |  Don't split on this
         Split on these commas

If you want to support CSVs, then support the full specification or document the extent by which you support comma separated values.

        string xstr = v[0];
        string ystr = v[1];
        string zstr = v[2];

These are unused variables and should be removed. Your compiler didn't mention it because copying may have intended side effects for non-pod types. Keep in mind that you have pointers and references if you need to locally refer to something.

        int xint, yint;
        float x,y,z;

        stringstream(v[0]) >> x;
        xint = (int)round(x);

        stringstream(v[1]) >> y;
        yint = (int)round(y);

        stringstream(v[2]) >> z;

You construct and destruct 3 instances of std::stringstream, which is expensive.

Make sure you are using the C++ version of std::round by including <cmath>.

Calls to round can fail. Check to see if std::round flagged an error with std::fetestexcept.

        XYZCoordinate temp = XYZCoordinate(xint, yint, z);
        file.push_back(temp);

You've created a named temporary object and you copy it into your vector. Since you no longer need temp, just move it into the vector.

        XYZCoordinate temp(xint, yint, z);
        file.push_back(std::move(temp));

You can shorten this up by avoiding the named temporary.

        file.push_back(XYZCoordinate(xint, yint, z));

You can avoid the temporary by directly emplacing the objects arguments.

        file.emplace_back(xint, yint, z);

Consider another approach that doesn't rely on string conversions and the temporary shuffling. Others have mentioned implementing operator>>(istream&, XYZCoordinate&) which does a formatted read and the rounded conversion.

   friend std::istream& operator>>(istream& in, XYZCoordinate& c) {
        float f_x;
        float f_y;

        if (!(in >> f_x >> f_y >> c.z)) {
            return in;
        }
        
        c.x = static_cast<int>(std::round(f_x)); // check the error!
        c.y = static_cast<int>(std::round(f_y)); // check the error!
        return in;
    }

Back in the line reading loop, you simply move the sentry in the file, skipping whitespace and unnecessary portions (your data is 1 record per line, so skip everything after the third value).

    while (in >> std::ws) {      // skip any whitespace
        if (in.peek() == '#') {  // skip comment lines
            consume_line(in);    // by skipping to '\n'
            continue;
        }

        auto& coord = coordinates.emplace_back();  // create the object     
        if (!(in >> coord)) {                      // write its values
            // Read failed: Log, throw, something
        }
        consume_line(in); // the record has been read, to the next record!

        // use coord.x and coord.y for minmax finding.
    }

As the comment says, consume_line just skips all the remaining characters upto and including the end-of-line character.

std::istream& consume_line(std::istream& stream) {
    return stream.ignore(std::numeric_limits<std::streamsize>::max(), stream.widen('\n'));
}

    Loadfileoutput output = Loadfileoutput(xmin,xmax,ymin,ymax,file);
    return output;

Similarly on simplifying things, you can directly initialize output.

    Loadfileoutput output(xmin, xmax, ymin, ymax, file);
    return output;

You don't need a named return variable as you just returning it immediately and making no changes.

    return Loadfileoutput(xmin, xmax, ymin, ymax, file);

Your compiler will still be able to use return value optimization and construct the returned value in-place.

AJNeufeld · Accepted Answer · 2018-08-15 19:19:00Z

Preallocate space in your vector for storage of your points, with .reserve(size). Without preallocation, .push_back() may need to reallocate and copy your data points several times, which takes time.

Instead of counting lines, a reasonable estimate could be based on the file size ... say, size of file divided by 32, for 3 nine-digit values per line, plus two spaces, two commas and a new line. Over estimating is probably better than underestimating; resize down to actual size after you’ve read all the data.

Stack Exchange Network

Processing XYZ data from a large file

5 Answers 5

Don't abuse `using namespace std`

Use the required `#include`s

Prefer modern initializers for constructors

Use better variable names

Use C++ idiom

Prefer a `struct` to a `class` for objects with no invariants

Write a custom extractor

Provide a test driver

Results

Suggestion 1: Don't split the lines

Suggestion 2: Use a `char []` instead of `std::string` for lines

Not the answer you're looking for? Browse other questions tagged
c++
performance
beginner
object-oriented
boost
or ask your own question.

Hot Network Questions

Processing XYZ data from a large file

5 Answers 5

Don't abuse using namespace std

Use the required #includes

Prefer modern initializers for constructors

Use better variable names

Use C++ idiom

Prefer a struct to a class for objects with no invariants

Write a custom extractor

Provide a test driver

Results

Suggestion 1: Don't split the lines

Suggestion 2: Use a char [] instead of std::string for lines

Not the answer you're looking for? Browse other questions tagged c++performancebeginnerobject-orientedboost or ask your own question.

Related

Hot Network Questions

Don't abuse `using namespace std`

Use the required `#include`s

Prefer a `struct` to a `class` for objects with no invariants

Suggestion 2: Use a `char []` instead of `std::string` for lines

Not the answer you're looking for? Browse other questions tagged
c++
performance
beginner
object-oriented
boost
or ask your own question.