1

There seems to be an issue with inserting into the hashtable. I create about 8 threads, and in each thread I do the following code. Each thread receives a char[] array. The job of each thread is to tokenize this array (look for spaces). Once a token is found, I need to add it to the hashtable if it doesn't exist. If it does exist, then I need to add 1 to the current value of that token (the key).

Questions you might ask:

Why not convert from char[] to String?

I tried this, and since strings are immutable, I eventually ran out of memory (I am processing a 10g file), or I spend too long garbage collecting. With Character[], I am able to reuse the same variable and not take up extra space in memory.

What is the issue?

When I am done processing the entire file, I run the code:

for (Entry<Character [], Integer> e : wordCountMap.entrySet()) {
    System.out.println(Arrays.toString(e.getKey()) + " = " + e.getValue());
}

in my main function. What I get as a result is less than 100 key/value pairs. I know that there should be around 20,000. There somehow seems to be some overlap.

    Character [] charArray = new Character[8];
    for (i = 0; i < newbyte.length; i++) { //newbyte is a char[] from main
        if (newbyte[i] != ' ') {
            charArray[counter] = newbyte[i];
            counter++;
        }
        else { 
            check = wordCountMap.putIfAbsent(charArray, 1);
            if (check != null) { 
                wordCountMap.put(charArray, wordCountMap.get(charArray) + 1);
            }
            for (j = 0; j < counter; j++) {
                charArray[j] = null;
            }//Null out the array

ConcurrentMap<Character [], Integer> wordCountMap //this is the definition in main

As some of the comments below have suggested, I am actually passing the reference to charArray when the line:

wordCountMap.put(charArray, wordCountMap.get(charArray) + 1);

is executed. So my question is, how do I pass the value? It actually makes perfect sense now, as in the end there are about 320 key/value pairs- 8 threads, 40 loops (Each thread gets 250/8 MBs per iteration).

9
  • Also, each thread receives a char[] because it was the easiest to convert from a byte []. The input is actually all numbers, so an int array would likely work better, but that would require a 2nd conversion (unless there is a way to convert directly from a byte[] to an int[]).
    – DanGordon
    Commented Mar 26, 2014 at 22:37
  • You can edit your question if you need to add more details. Commented Mar 26, 2014 at 22:39
  • Don't understand why you expect there to be more than 8 entries in that map if there are 8 threads running. Each thread has a single array that is uses as map key? It's going to just keep updating the same value.
    – Affe
    Commented Mar 26, 2014 at 22:41
  • 1
    But when you put into the map, since Character[] is a reference, it's storing a reference into the map as the key, I think? It doesn't make a copy, as far as I know. So if you create a map entry with a reference to one of your charArrays, and then you change the charArray, you're changing the key in the map, which seems likely to get everything discombobulated. I'm not sure I'm right without checking the source, but it doesn't look like this would work.
    – ajb
    Commented Mar 26, 2014 at 22:42
  • 1
    Your problem is that you want to check if the entity exists in the hashmap without creating a new one. Since creating the new one will cause memory usage which will have to be GC-ed. However, your "checker" cannot be put into the HashMap as you are doing here. There is another problem. get() may not give you the correct value. According to docs, it is the value of the last "completed" put. So, you will get less count because two calls might give you the same answer. I will try to write a complete answer if I have time.
    – Chip
    Commented Mar 27, 2014 at 6:41

4 Answers 4

1

I do not believe this is achievable without synchronizing both the get() and put() operations.

As per the ConcurrentHashMap docs

Retrieval operations (including get) generally do not block, so may overlap with update operations (including put and remove). Retrievals reflect the results of the most recently completed update operations holding upon their onset.

This means if two of your thread encounter the same counter simultaneously, the get() will return the same value ( say 2 ) and both of them will insert 2+1=3. So, the number of tokens will be undercounted - i.e be 3 instead of 4.

To be consistent, you need to synchronize before the get() operation, which will greatly reduce the benefit of multithreading.

Here's how you would do that if you wanted to:

class Key {
   char[] buffer = new char[8];
   Key copy() {
       Key copy = new Key();
       for ( int i =0; i < 8; i++) {
          copy.buffer[i] = this.buffer[i];        
       }
   }
   public int hashCode() {
      return Arrays.hashCode(buffer);
   }
   public boolean equals(Object obj) {
      if ( obj instanceof Key) {
        return Arrays.equals(((Key) obj).buffer, this.buffer); 
      }
      return false;
   }
}
//YOur code modified:
Key checker = new Key();
for (i = 0; i < newbyte.length; i++) { //newbyte is a char[] from main
    if (newbyte[i] != ' ') {
        checker.buffer[counter] = newbyte[i];
        counter++;
    }
    else { 
            synchronized (wordCountMap) {
               Integer value = workCountMap.get(checker);
               if ( value == null ) {
                  workCountMap.put(checker.copy(), 1);    
               } else {
                  wordCountMap.put(checker.copy(), value + 1);
               }
            }
        for (j = 0; j < counter; j++) {
            checker.buffer[j] = null;
        }//Null out the array
   }

This will solve your memory problem, because you do a new() (via copy()) only if you have to insert into the table. So, memory used is the minimum that you need (not counting the i,j, checker, etc.). However, you lose almost all parallelism.

If I were you, I would break up the file into a number of fragments and process each fragment in a separate thread. Each thread can maintain its own hashmap. At the end of the whole file you will have n hashtables ( n being the number of threads ). You can then merge the n hashmap. The memory required would be n times the size of your previous hashmap.

Let me know if you want more detail on this approach and I will try to help.

0

When you use the array as the key, it is the reference to the array itself that gets used, not the contents. So changing the contents isn't going to make more entries in the map, it's just going to keep updating the same value. Consider the simple program:

public static void main(String[] args) throws Exception {
    Character[] charArray = new Character[8];
    charArray[1] = 'A';
    Set<Character[]> set = new HashSet<Character[]>();
    set.add(charArray);
    charArray[1] = 'B';
    System.out.println(set.contains(charArray));
}

The output of that is true because charArray is still the same Array, its contents aren't taken into consideration.

If you want to be able to get the content back later, at the end, such as in:

for (Entry<Character [], Integer> e : wordCountMap.entrySet()) {
    System.out.println(Arrays.toString(e.getKey()) + " = " + e.getValue());
}

You have to keep it somewhere! If it's too big for memory, you need to allocate more memory or use some kind of external storage. Maybe key the map on MD5s of the Strings and keep an NoSQL database on disk of MD5->Original String so you can get them back later?
In your code you wiped out the data as you went, but expected it to still be there at the end!

4
  • Each individual key is only 5 bytes long, and once its in the table I don't need it anymore elsewhere. How would I go about making a copy of it to be able to pass to putIfAbsent?
    – DanGordon
    Commented Mar 26, 2014 at 23:06
  • Arrays.copyOf(charArray, charArray.length); ? If you ran out of memory using Strings it doesn't seem like you would really be saving so very much to not also run out of memory like that, but who knows!
    – Affe
    Commented Mar 26, 2014 at 23:43
  • If I do the code: check = wordCountMap.putIfAbsent(Arrays.copyOf(charArray, counter), 1); Then I have a different problem. Every putIfAbsent call returns null for some reason. But now my table has duplicates (at least thats what it looks like)
    – DanGordon
    Commented Mar 27, 2014 at 3:16
  • Well yes, it would. Now every array that has the same content is a different array, so they are different keys. Sounds like you really ought to use a String or make a POJO that contains the array to use as key and check the actual contents for equality.
    – Affe
    Commented Mar 27, 2014 at 17:07
0

I think that instead of using a Character[] as the map key, you will need to define your own class that represents an 8-character array (*). You'll need to redefine equals() and hashCode() in that class; define equals() so that it returns true if all 8 characters are the same, and define hashCode() as some value that depends on those 8 characters. You can't redefine equals() or hashCode() for an array; that's why you'll need to define your own class. That class will use a char[] or Character[] internally.

The class should also have a copy or clone method of some sort, or a copy constructor, so that you can create a new object whose data (the 8 characters) is the same as the existing object.

Now, instead of this:

check = wordCountMap.putIfAbsent(charArray, 1);
if (check != null) { 
    wordCountMap.put(charArray, wordCountMap.get(charArray) + 1);
}

you will need to make sure you use a copy when putting a new key into the map. Using putIfAbsent as you have above would put a reference to your local variable in the map, which is wrong since your local variable could change. This is wrong too:

check = wordCountMap.putIfAbsent(new CharArray(charArray), 1);

where new CharArray(charArray) makes a copy of the existing array--that's what I mean by a "copy constructor". (I'm assuming CharArray is the name you've given your new class.) This is wrong because you would be creating new objects in the case when you don't need a new object, which you're trying to avoid. So probably something like

Integer existing = wordCountMap.get(charArray);
if (existing == null) {
    wordCountMap.put(new CharArray(charArray), 1);
} else {
    wordCountMap.put(charArray, existing + 1);
}

That should create a new CharArray only when needed, and it won't put into the map a reference to the CharArray that you plan to keep changing. You'll probably have to add some locking to the above to prevent race conditions.

(*) After looking over your post again, I'm not sure if an 8-character array is what you really want, but you did say new Character[8] in your code. The technique should work for any buffer size, though. You could set up your class so that the instances that could be mutable have a larger buffer, and the instances that you put in the hash map only keep as many characters as needed.

0

There are at least three problems: you modify the keys of the map, the hashCode of an array is reference-based, and there is a race condition here:

check = wordCountMap.putIfAbsent(charArray, 1);
if (check != null) { 
    wordCountMap.put(charArray, wordCountMap.get(charArray) + 1);
}

Other answers address the hashing, so I'll tackle the race condition. putIfAbsent is atomic, but put(increment(get())) is not. You can fix this by using an AtomicInteger instead of a normal Integer:

AtomicInteger check = wordCountMap.putIfAbsent(key.copy(), new AtomicInteger(1));
if (check != null) { 
    check.incrementAndGet();
}

There are allocations for the key and value here, but they will be easily garbage collected if the key is already present. If you wanted to avoid them, you could incur the overhead of an additional get(), or you could use one of the other suggestions in @Chip's answer.

Not the answer you're looking for? Browse other questions tagged or ask your own question.