90

Moving some code from Python to C++.

BASEPAIRS = { "T": "A", "A": "T", "G": "C", "C": "G" }

Thinking maps might be overkill? What would you use?

7
  • 21
    Why map might be overkill?
    – ForEveR
    Commented Mar 1, 2013 at 5:57
  • What are you planning on doing with them?
    – bchurchill
    Commented Mar 1, 2013 at 5:57
  • Can I define a map as a constant with the base values somehow in the class definiton?
    – y2k
    Commented Mar 1, 2013 at 5:58
  • 1
    @WHOEVENCARES Why couldn't you?
    – Rapptz
    Commented Mar 1, 2013 at 5:58
  • 1
    @user1881400 actually it's genetics.:D
    – Serid
    Commented May 6, 2018 at 16:22

9 Answers 9

155

You can use the following syntax:

#include <map>

std::map<char, char> my_map = {
    { 'A', '1' },
    { 'B', '2' },
    { 'C', '3' }
};
0
28

If you are into optimization, and assuming the input is always one of the four characters, the function below might be worth a try as a replacement for the map:

char map(const char in)
{ return ((in & 2) ? '\x8a' - in : '\x95' - in); }

It works based on the fact that you are dealing with two symmetric pairs. The conditional works to tell apart the A/T pair from the G/C one ('G' and 'C' happen to have the second-least-significant bit in common). The remaining arithmetics performs the symmetric mapping. It's based on the fact that a = (a + b) - b is true for any a,b.

1
  • @WHOEVENCARES I am not sure if it would be faster than the pure conditional Benjamin Lindley proposed. However, at least the subtraction part of my function could be performed in a vector register for several characters in parallel.
    – jogojapan
    Commented Mar 1, 2013 at 6:55
24

While using a std::map is fine or using a 256-sized char table would be fine, you could save yourself an enormous amount of space agony by simply using an enum. If you have C++11 features, you can use enum class for strong-typing:

// First, we define base-pairs. Because regular enums
// Pollute the global namespace, I'm using "enum class". 
enum class BasePair {
    A,
    T,
    C,
    G
};

// Let's cut out the nonsense and make this easy:
// A is 0, T is 1, C is 2, G is 3.
// These are indices into our table
// Now, everything can be so much easier
BasePair Complimentary[4] = {
    T, // Compliment of A
    A, // Compliment of T
    G, // Compliment of C
    C, // Compliment of G
};

Usage becomes simple:

int main (int argc, char* argv[] ) {
    BasePair bp = BasePair::A;
    BasePair complimentbp = Complimentary[(int)bp];
}

If this is too much for you, you can define some helpers to get human-readable ASCII characters and also to get the base pair compliment so you're not doing (int) casts all the time:

BasePair Compliment ( BasePair bp ) {
    return Complimentary[(int)bp]; // Move the pain here
}

// Define a conversion table somewhere in your program
char BasePairToChar[4] = { 'A', 'T', 'C', 'G' };
char ToCharacter ( BasePair bp ) {
    return BasePairToChar[ (int)bp ];
}

It's clean, it's simple, and its efficient.

Now, suddenly, you don't have a 256 byte table. You're also not storing characters (1 byte each), and thus if you're writing this to a file, you can write 2 bits per Base pair instead of 1 byte (8 bits) per base pair. I had to work with Bioinformatics Files that stored data as 1 character each. The benefit is it was human-readable. The con is that what should have been a 250 MB file ended up taking 1 GB of space. Movement and storage and usage was a nightmare. Of coursse, 250 MB is being generous when accounting for even Worm DNA. No human is going to read through 1 GB worth of base pairs anyhow.

6
  • But this still needs linear lookup time for char to base pair conversion
    – perreal
    Commented Mar 1, 2013 at 6:31
  • @perreal If by "Linear Lookup Time" you mean O(1), then yes, this whole premise is O(1) and is also maximally compressed for very little effort.
    – user1357649
    Commented Mar 1, 2013 at 6:33
  • @perreal Could you please explain how this is linear time? Genuinely interested.
    – Rapptz
    Commented Mar 1, 2013 at 6:33
  • @Rapptz, it is smt like BasePair CharToPair(char p) { for (i : 0..3) { if (BasePairToChar(i) == p) return BasePair(i); } }?
    – perreal
    Commented Mar 1, 2013 at 6:35
  • 1
    @ThePhD, I fail to understand your char to pair conversion.
    – perreal
    Commented Mar 1, 2013 at 6:39
13

Until I was really concerned about performance, I would use a function, that takes a base and returns its match:

char base_pair(char base)
{
    switch(base) {
        case 'T': return 'A';
        ... etc
        default: // handle error
    }
}

If I was concerned about performance, I would define a base as one fourth of a byte. 0 would represent A, 1 would represent G, 2 would represent C, and 3 would represent T. Then I would pack 4 bases into a byte, and to get their pairs, I would simply take the complement.

12

Here's the map solution:

#include <iostream>
#include <map>

typedef std::map<char, char> BasePairMap;

int main()
{
    BasePairMap m;
    m['A'] = 'T';
    m['T'] = 'A';
    m['C'] = 'G';
    m['G'] = 'C';

    std::cout << "A:" << m['A'] << std::endl;
    std::cout << "T:" << m['T'] << std::endl;
    std::cout << "C:" << m['C'] << std::endl;
    std::cout << "G:" << m['G'] << std::endl;

    return 0;
}
2
  • 9
    Unrelated but you are overflushing the stream.
    – Rapptz
    Commented Mar 1, 2013 at 6:22
  • 6
    @Rapptz: I wondered what you meant with 'overflushing', until i found out that std::endl does both a \n (newline) and std::flush.
    – trapicki
    Commented Jul 16, 2020 at 14:05
8

A table out of char array:

char map[256] = { 0 };
map['T'] = 'A'; 
map['A'] = 'T';
map['C'] = 'G';
map['G'] = 'C';
/* .... */
16
  • 2
    That's a very strange map.
    – Rapptz
    Commented Mar 1, 2013 at 6:00
  • 2
    This is an incredibly wasteful map. But, it gets... the job done... ?
    – user1357649
    Commented Mar 1, 2013 at 6:05
  • 4
    @ThePhD, incredibly? uses 256 bytes. What is the overhead of std::map? what is the lookup time?
    – perreal
    Commented Mar 1, 2013 at 6:07
  • 3
    You do realize this could be accomplished with a simple enumeration and a function? If you wanted to get slightly more fancy, you could use that enumeration into a table of other enums and use only 4 bytes. If you really wanted to grind some optomization gears, you'd pack ATCG into 2 bits (like what most sensible binary representations of DNA do). char is not a data type for DNA structures (there's a reason DNA-listing files are 4x bigger in ASCII).
    – user1357649
    Commented Mar 1, 2013 at 6:10
  • 6
    @spin_eight, this is a direct access table, lookup is O(1)
    – perreal
    Commented Mar 1, 2013 at 6:12
2

BASEPAIRS = { "T": "A", "A": "T", "G": "C", "C": "G" } What would you use?

Maybe:

static const char basepairs[] = "ATAGCG";
// lookup:
if (const char* p = strchr(basepairs, c))
    // use p[1]

;-)

1
  • I find this solution classy, even if not the fastest :D
    – PoneyUHC
    Commented Jul 6, 2023 at 12:56
2

This is the fastest, simplest, smallest space solution I can think of. A good optimizing compiler will even remove the cost of accessing the pair and name arrays. This solution works equally well in C.

#include <iostream>

enum Base_enum { A, C, T, G };
typedef enum Base_enum Base;
static const Base pair[4] = { T, G, A, C };
static const char name[4] = { 'A', 'C', 'T', 'G' };
static const Base base[85] = 
  { -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
    -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
    -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
    -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, 
    -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
    -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
    -1, -1, -1, -1, -1,  A, -1,  C, -1, -1,
    -1,  G, -1, -1, -1, -1, -1, -1, -1, -1, 
    -1, -1, -1, -1,  T };

const Base
base2 (const char b)
{
  switch (b)
    {
    case 'A': return A;
    case 'C': return C;
    case 'T': return T;
    case 'G': return G;
    default: abort ();
    }
}

int
main (int argc, char *args) 
{
  for (Base b = A; b <= G; b++)
    {
      std::cout << name[b] << ":" 
                << name[pair[b]] << std::endl;
    }
  for (Base b = A; b <= G; b++)
    {
      std::cout << name[base[name[b]]] << ":" 
                << name[pair[base[name[b]]]] << std::endl;
    }
  for (Base b = A; b <= G; b++)
    {
      std::cout << name[base2(name[b])] << ":" 
                << name[pair[base2(name[b])]] << std::endl;
    }
};

base[] is a fast ascii char to Base (i.e. int between 0 and 3 inclusive) lookup that is a bit ugly. A good optimizing compiler should be able to handle base2() but I'm not sure if any do.

3
  • But this solution assumes the input comes as numbers 0, 1, 2, 3, rather than ASCII characters. You'd still have to perform a mapping of the input, correct?
    – jogojapan
    Commented Mar 1, 2013 at 6:53
  • 2
    Good point. I fixed it with the fastest mapping I could think of, and a prettier but possibly slower switch-based mapping. Commented Mar 1, 2013 at 7:32
  • 5
    Good tip to know if you are dealing with C++ The "smallest space solution" is 49 lines long ;) Commented Jul 13, 2019 at 8:54
0

crusty old 'C/Asm' coder here...

Take a look at the binary representation for ASCII symbols A,T,C and G :

  • A = 0100 0001
  • C = 0100 0011
  • T = 0101 0100
  • G = 0100 0111

By using c=(c>>1)&3; we can convert any symbol into a number 0..3 and this will work regardless of whether the input is provided in lower or upper case.

Now look at the pairs AT and CG:

  • A = 00 C = 01
  • T = 10 G = 11

To convert any of these symbols to its partner, you simply flip the high bit (XOR 2) ... and this works for converting them back again.

This is far simpler than any lookup, map, table or math.

Additional benefits

Because each symbol is only 2 bits wide, you can store 4 symbols in an byte. You can convert all four symbols using XOR 10101010b; (0xCC in hex)

If you're on a 64bit architecture, you can process 32 symbols at a time, with a single bitwise XOR. So, storage and processing can be super efficient.

Similarly, input ASCII can be converted 8 characters at a time using c=(c>>1)& 0x0303030303030303; over 64bits of char data.

Output

When you need to output ASCII again, this is likely AFTER all your processing is done. Simply define const Symbol="ACTG"; and lookup character Symbol[value&3]; for every 2 bits of your sequence. Value>>=2; will discard each character and present the next.

This indirect is still way faster than any map or dictionary. It doesn't even need to access memory, as "ACTG" fits in a 32bit register.

Note: None of these features require any conditionals, they are branchless :)

Not the answer you're looking for? Browse other questions tagged or ask your own question.