6
\$\begingroup\$

Description

A regex-based method of validating ISBN-10 as strings. Can be digits (or 'X' at the end) only, or with dashes, according to those of English-speaking publications (which are defined by the regular expressions contained in the code below).

The last digit, a check digit (unless it is an 'X'), is calculated in the following manner. Multiply each digit, starting from the leftmost digit, by a weight, and then sum the results. The check digit should be such that this sum is divisible by 11. The weight starts at 1, and increases by 1 for each digit. For example, consider the ISBN 0-306-40615-2. The sum is calculated as follows:

sum = 0*1 + 3*2 + 0*3 + 6*4 + 4*5 + 0*6 + 6*7 + 1*8 + 5*9 + 2*10 = 165
165 mod 11 = 0 // the check digit, 2, is valid.

I am a .NET developer dabbling in C++ for fun. With this exercise, I have tried to hide certain things from anything outside the namespace, by creating an anonymous namespace inside it. Essentially, I have tried to achieve what the private keyword does in C#, in C++ semantics. I am curious if I have done anything "un C++ like" in this code. I envisage this code could be expanded to have a function that validates ISBN-13 numbers.

Code

#include <iostream>
#include <string>
#include <vector>
#include <regex>
#include <algorithm>

namespace isbn_validation
{
    namespace // anonymous namespace
    {
        // regex expressions
        // dashless
        const std::regex isbn10_no_dashes(R"((\d{9})[\d|\X])");
        // with dashes
        const std::regex isbn10_dashes1(R"((\d{1})\-(\d{5})\-(\d{3})\-[\d|\X])");
        const std::regex isbn10_dashes2(R"((\d{1})\-(\d{3})\-(\d{5})\-[\d|\X])");
        const std::regex isbn10_dashes3(R"((\d{1})\-(\d{4})\-(\d{4})\-[\d|\X])");
        const std::regex isbn10_dashes4(R"((\d{1})\-(\d{5})\-(\d{3})\-[\d|\X])");
        const std::regex isbn10_dashes5(R"((\d{2})\-(\d{5})\-(\d{2})\-[\d|\X])");
        const std::regex isbn10_dashes6(R"((\d{1})\-(\d{6})\-(\d{2})\-[\d|\X])");
        const std::regex isbn10_dashes7(R"((\d{1})\-(\d{7})\-(\d{1})\-[\d|\X])");

        bool isbn10_check_digit_valid(std::string isbn10)
        {
            auto valid = false;

            // split it
            std::vector<char> split(isbn10.begin(), isbn10.end());

            // if the very last character is an 'X', don't bother with it
            if (split[9] == 'X')
            {
                return true;
            }

            // all digits
            // validate the last digit (check digit)
            int digit_sum = 0;
            int digit_index = 1;
            for (std::vector<char>::iterator it = split.begin(); it != split.end(); ++it)
            {
                digit_sum = digit_sum + ((*it - '0')*digit_index);

                digit_index++;
            }
            valid = !(digit_sum%11);

            return valid;
        }
    }

    bool valid_isbn10(std::string isbn)
    {
        // can take ISBN-10, with or without dashes
        auto valid = false;

        // check if it is a valid ISBN-10 without dashes
        if (std::regex_match(isbn, isbn10_no_dashes))
        {
            // validate the check digit
            valid = isbn10_check_digit_valid(isbn);
        }

        // check if it is a valid ISBN-10 with dashes
        if (std::regex_match(isbn, isbn10_dashes1) || std::regex_match(isbn, isbn10_dashes2) || std::regex_match(isbn, isbn10_dashes3) ||
            std::regex_match(isbn, isbn10_dashes4) || std::regex_match(isbn, isbn10_dashes5) || std::regex_match(isbn, isbn10_dashes6) || std::regex_match(isbn, isbn10_dashes7))
        {
            // remove the dashes
            isbn.erase(std::remove(isbn.begin(), isbn.end(), '-'), isbn.end());

            // validate the check digit
            valid = isbn10_check_digit_valid(isbn);
        }

        return valid;
    }
}
\$\endgroup\$
1
  • \$\begingroup\$ Are isbn10_dashes1 and isbn10_dashes4 identical? Maybe I got this wrong! But the mere difficulty in checking it comes from the redundancy in the regular expressions. You should at least consider whether to construct the strings from a template string by replacing the numeric values. That would emphasize the differences and make them easier verifiable. (As always, the flip side of abstraction is that spelling it out is trivial which has its merits as well.) \$\endgroup\$ Commented Nov 2, 2020 at 13:02

2 Answers 2

6
\$\begingroup\$

When you're comparing against 8 different patterns, and then simply removing - for 7 of those validations, why not just remove - initially, and then validate against the only remaining pattern.

Another thing to note is, in the patterns, at the end; you have a character set: [\d|\X]. This actually will match one of:

  • a digit
  • literal |
  • literal X characters (you don't need \X though).

This should instead be:

\d{9}[\dX]

A general outline of how the code should work:

  • Remove all - from given string
  • Check if length of string is exactly \$ 10 \$
  • Validate against the pattern rewritten above
  • Check if last character is X
  • Validate individual digit sum and divisibility.
\$\endgroup\$
3
  • 1
    \$\begingroup\$ I have no insight on the standard, but is it desirable or even correct to accept a string with 9+1 digits and an arbitrary number of dashes at any positions? \$\endgroup\$ Commented Nov 1, 2020 at 12:41
  • \$\begingroup\$ @BlameTheBits According to a cursory reading on wikipedia, it seems like different countries/publishers use different dash formatting, but generally splitting into 4-5 parts \$\endgroup\$
    – hjpotter92
    Commented Nov 1, 2020 at 12:57
  • \$\begingroup\$ They are split into several parts. I have considered only English publication ISNB formats in that regard. These ISBNs have specific digits that are valid for these parts, but I have not taken that into account, as that would have meant creating dozens more regexes. \$\endgroup\$
    – Al2110
    Commented Nov 1, 2020 at 21:52
5
\$\begingroup\$

hjpotter92 starts with the right idea:
Unify and simplify.
You just should go much further.

Also, don't give up if the last digit is X, that's a perfectly fine digit for your code.

You aren't (or shouldn't) be interested in all the dashes, so ignore them.
Could you even know that you have all the used patterns?

Take a closer look at wikipedia on ISBN10, and you will find that the pattern is completely regular. Every digit is multiplied by its position, even the last (which might be 10, represented by X).

As there is no need to modify the string, you can use a std::string_view, giving any caller maximum convenience and efficiency.
Yes, it only came with C++17, if your library doesn't supply it either use a free implementation or at least fall back to std::string const& to hopefully avoid copies.

Avoiding needless allocations (even though small-object-optimization probably saves your bacon for the string-argument to your helper-function) is a very C++ thing to do.

Not that it can fix your creation of a perfectly superfluous std::vector.

And now you can mark it noexcept and constexpr, giving the callers assurance it will always succeed and can even be done at compile-time.

By the way, for-range is a thing in C++ since C++11, and much preferred over manual iterator-juggling.

And if you do the same operation multiple times with different data, consider storing it in an array and looping, like you would also do in C#.

constexpr bool validate_isbn10(std::string_view s) noexcept {
    unsigned num = 0; // unsigned so overflow is harmless
    std::size_t found = 0; // std::size_t so cannot overflow at all
    for (auto c : s)
        if (c >= '0' && c <= '9')
            num += ++found * (c - '0');
        else if (c == 'X' && found == 9)
            num += ++found * 10;
        else if (c != '-')
            return false;
    return found == 10 && num % 11 == 0;
}
\$\endgroup\$

Not the answer you're looking for? Browse other questions tagged or ask your own question.