Optimizing month abbreviation to number

Question

(This comes from a real-world requirement.)

The idea is easy - if you have the three-letter abbreviated name of a month, say "Jan", get the numeric month number (for "Jan", 1). Sure, sure, pick a language with tables or hashes and this is simple and straightforward - but for the purposes of this exercise, we care about optimizing for number of characters compared.

For the naive approach, doing this in whatever language:

if (month == "Jan") then m=1;
else if (month == "Feb") then m=2;
else if (month == "Mar") then m=3;
$\vdots$

if we make the obvious assumption that a string comparison works left to right and stops at the first mismatch, then when we run this for each day of the year (that is, for all $365$ days: "Jan 1", "Jan 2", ..., "Dec 30", "Dec 31"), this will end up requiring 3328 comparisons, or 9.12 character comparisons on average. Surely we can do better!

So the challenge is: write the lowest cost abbreviation-to-number converter you can.
How low can you go?

My current best is $\bf{1423}$ for an average of $\bf{3.90}$ character comparisons per call.

Current leading answer (best Verified submission) without hash/table use is by @ffao
(also) with a score of $\bf{1423}$ for an average of $\bf{3.90}$ character comparisons per call.
— this is believed to be optimal.

Current Accepted answer (best Verified submission) using hash/table as per rules is by
@Stewart Becker with a score of $\bf{487}$ for an average of $\bf{1.33}$ character comparisons per call.

The specifics:

Code gets executed in-order as written; optimizations done by a compiler don't count.
Relative comparisons not allowed - you get equal or not-equal only. (This is an actual limitation of the real-life application.)
Character arithmetic/conversion to integers is not allowed. (Also an actual limitation of the real-life application.) You have to work with the characters as themselves, sorry.
String compares are done left to right, and stops either at the first mismatch or when all characters have been compared and matched; one comparison is charged per character actually tested. (Comparing xyz to abc costs $1$ and is false; to xya, $3$ and false; to xyz, $3$ and true.)
You can use substrings or individual characters (C string-array indexing, Python slicing, perl/PHP substr(), etc.) - comparing substrings or single characters works the same as any other string compare.
For our purposes, a table/hash lookup counts as a single string compare of your key against a (probably theoretical) index.
Ignore leap years. There are 365 days in a year, and February has 28 days.
Your code needs to take an input that is a month abbreviation (exactly one of these: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec) and return the corresponding month number (1 through 12).
Your code will be run 365 times, corresponding with each of the 365 days in a year, not necessarily in order. That is, 31 times it will be run with the input "Jan", 28 times with "Feb", 31 times with "Mar", 30 with "Apr", 31 with "May", 30 with "Jun", 31 with "Jul", 31 with "Aug", 30 with "Sep", 31 with "Oct", 30 with "Nov", and 31 with "Dec", in some random ordering.
Your score is the total number of characters your code compares for all 365 invocations.
Please provide your score and average comparison count with your answer!
I'll verify your score.
Best score gets the $\color{green}{\checkmark \small\text{Accept}}$!

______
^{Oh, and yes - this is a puzzle. Finding how to best do this is actually pretty interesting, and optimizing it even more so!}

@NL628 Any way you want to do it. Only the comparisons are counted, so any algorithm or technique you want to use to minimize those, go for it. — Rubio, Commented May 10, 2018 at 23:27
@Rubio - have you tried asking this on codegolf.SE? It might get you more eyes on it. — Phylyp, Commented May 11, 2018 at 3:08
I'm not on codegolf, but my impression (apparently misfounded?) is that their general goal is smallest code, not most efficient. In any case, I thought there was enough of an optimization puzzle here to make it interesting to our audience. (And I already had the solution, which has now been reasonably proven optimal, so I don't know that more eyes will add anything at this point :) But thanks for the thought!) — Rubio, Commented May 11, 2018 at 3:31
@Phylyp That's true, but shouldn't buy much more since you have to test one of the characters to distinguish them anyways. Maybe it isn't so surprising considering how many letters could go in two spots (although English is denser). — theREALyumdub, Commented May 11, 2018 at 4:02
@Rubio I believe they have a fastest-code tag along with a code-golf tag and also an efficient-solution tag, if I'm not mistaken :D — NL628, Commented May 11, 2018 at 4:54

Rubio · Accepted Answer · 2018-05-14 15:04:54Z

2

You can do much better than the posted answers by using hash tables.

Table N:
   a -> 1 // "Jan"
   u -> 6 // "Jun"

Table R:
   a -> 3 // "Mar"
   p -> 4 // "Apr"

Table X:
  b -> 2 // "Feb"
  c -> 12 // "Dec"
  g -> 8 // "Aug"
  l -> 7 // "Jul"
  n -> Table N
  p -> 9 // "Sep"
  r -> Table R  
  t -> 10 // "Oct"
  v -> 11 // "Nov"
  y -> 5 // "May"

Algorithm:

Look up the third letter in Table X. This uses a key which is a string of length 1 and by the rules this counts as 1 comparison. If you get a number then return it. Otherwise you get another table. Look up the second letter in this table to get the number to return. Note that most days take just 1 comparison - this is why we choose the third letter as it is the most diverse. Only 122 days (i.e. days in Jan, Mar, Apr and Jun) require a second comparison - for a total score of 487 comparisons or an average of 1.33.

Additional note:

Arguably, the "If you get a letter" test is another comparison - just not the kind that is being scored. If we also score this as 1 comparison then, by noting we have to do this every time, the total score becomes 852 or an average of 2.33.

edited May 14, 2018 at 15:04

Rubio♦

41.7k6 gold badges90 silver badges242 bronze badges

answered May 11, 2018 at 11:55

Stewart Becker

1,1138 silver badges16 bronze badges

$\begingroup$ 2nd line n -> 1... wouldn't that be a -> 1? $\endgroup$
– Phylyp
Commented May 11, 2018 at 12:37
$\begingroup$ So - I think this is a solution for the way I (mis)wrote the rules for hash lookups. If you can do this, then just building a table mapping each of the 12 strings to their values would let you solve the problem trivially with 3x365=1095 compares for a better score than mine, but of course if that was permissible then this puzzle would be kinda pointless. I probably should just exclude table/hash lookups, as the intro text implies; explaining what I intended for the cost of a lookup is probably not worth the trouble. +1 though for use of rules-as-unintentionally-written. $\endgroup$
– Rubio ♦
Commented May 11, 2018 at 21:25
$\begingroup$ There are a couple errors in your submission (Table N, r→1 should be a→1; Table X, c→3 should be c→12). Also your comparison count is wrong: Verified score is $487 (1.33)$. $\endgroup$
– Rubio ♦
Commented May 11, 2018 at 21:39
$\begingroup$ I've edited adding fixes $\endgroup$
– Stewart Becker
Commented May 14, 2018 at 9:05

Add a comment |

2012rcampion · Accepted Answer · 2018-05-11 03:17:58Z

First off, we should only test single characters. Any test gives us exactly one bit of information, no matter the number of comparisons, so comparing more than once at a time doesn't help.

If we could test membership in arbitrary sets, Huffman coding provides a lower bound of 1336 comparisons (3.66 average).

However, we cannot test for membership in arbitrary sets. In fact, the largest testable subset we can form is abbr[2] == 'a': Jan (31), Mar (31), and May (31), for 93 altogether. (Multi-character comparison again doesn't help, since it can only test for membership in more specific, i.e. smaller, sets.)

We can use this fact to guide our search for optimum trees. Starting with the list of all testable subsets, at each step we take one of those subsets and form all the decision trees with that subset on the left and the remainder on the right. Do this for each subset, and compute the remainder of the tree recursively, and you end up with a list of all possible decision trees.

Note that keeping only the best-scoring decision (sub)trees at each step prunes the search space considerably.

I find that the minimum trees have a score of:

1423/365 (3.8986), the same as Rubio and ffao.

Although the subtrees can vary, they all follow the same basic pattern:

First test for the abbr[2] == 'a' set (Jan (31), Mar (31), May (31)), then the abbr[2] == 'u' set (Jun (30), Jul (31), Aug (31)), then abbr[2] == 'e' (Feb (28), Sep (30), Dec (31)), and finally, if none of those match, test for Oct (31), otherwise Apr (30) or Nov (30).
Note that each time we take the largest distinguishable subset to turn into a subtree. This pattern continues in the subtrees. In the first, Jan, May, and Mar can be tested in any order, since they all have 31 days. But in the next, Either Jul or Aug (31) are tested first, never Jun (30). Each of the bottom two subtrees have exactly one month with 31 days, so their order is fixed.

Here's a picture to (hopefully) make it clear what I mean when I'm talking about trees:

ffao · Accepted Answer · 2018-05-11 00:41:54Z

If I understand the problem correctly, a good solution should be

To divide the months into groups of 3, then determine in which group the month lies with our first comparisons.

This gets a total of

1423 comparisons, or an average of 3.90 per call.

which is the same as Rubio's. I doubt we can improve this any further.

Code:

def get_month(month):
     if month[1]=='a': # 365 comparisons
         if month[0] == 'J': # 93 comparisons
             return 1
         if month[2] == 'r': # 62 comparisons
             return 3
         return 5
 
     if month[1] == 'u': # 272 comparisons
         if month[0] == 'A': # 92 comparisons
             return 8
         if month[2] == 'n': # 61 comparisons
             return 6
         return 7
 
     if month[1] == 'e': # 180 comparisons
         if month[2] == 'c': # 89 comparisons
             return 12
         if month[2] == 'p': # 58 comparisons
             return 9
         return 2
 
     if month[2] == 't': # 91 comparisons
         return 10
 
     if month[2] == 'r': # 60 comparisons
         return 4
 
     return 11

Oops, I actually missed a compare in my harness - so my result was off by a bit. Looking at your code I was surprised at your score, cuz your code is the same as mine! :) — Rubio, Commented May 11, 2018 at 0:37

TheLethalCipher · Accepted Answer · 2018-05-11 14:27:30Z

We could try to prove optimality with clever observations, but the small search space also allows us to write some simple code to find the optimum. The following C++ program finds the optimum number of comparisons and all solutions that obtain it, using the naïve definition that two solutions are distinct if they have a distinct branching tree of comparisons. It confirms the optimum

$1423$

comparisons found by @ffao and finds $20160$ optimum solutions. For demonstration, the solution found by @ffao is found in the list of optimum solutions. (Apologies for the lack of syntax highlighting.)

#include <algorithm>
#include <array>
#include <iostream>
#include <map>
#include <vector>

const int LEN = 3;
const std::array<std::string,12> MONTHS
  {"JAN","FEB","MAR","APR","MAY","JUN","JUL","AUG","SEP","OCT","NOV","DEC"};
const std::array<int,12> DAYS {31,28,31,30,31,30,31,31,30,31,30,31};

typedef std::vector<int> month_set; // set of month indices
typedef std::pair<int,char> comp; // index, character comparison
// number of comparisons, list of comparisons that result in this number
typedef std::pair<int,std::vector<comp>> sol;
typedef std::vector<std::string> all_sols; // list of solution strings

std::map<month_set,sol> opt_sol; // map of optimum solutions
std::map<month_set,all_sols> all_sols_map; // map of list of solution strings

// gets all distinct comparisons from characters in set of months
std::vector<comp> get_comps(month_set ms) {
  std::vector<comp> v;
  for (auto m : ms)
    for (int i = 0; i < LEN; i++)
      v.push_back({i, MONTHS[m][i]});
  // remove duplicates
  std::sort(v.begin(), v.end());
  v.erase(std::unique(v.begin(), v.end()), v.end());
  return v;
}

// splits set of months according to comparison
std::pair<month_set,month_set> split_months(month_set ms, comp c) {
  month_set f, t;
  for (auto m : ms)
    if (MONTHS[m][c.first] == c.second)
      t.push_back(m);
    else
      f.push_back(m);
  return std::make_pair(f, t);
}

// updates the stored optimum solution according to a score and a comparison
int update_opt(month_set ms, int score, comp c) {
  auto it = opt_sol.find(ms);
  int opt_score = score;
  if (it == opt_sol.end() || score < (opt_score = it->second.first))
    opt_sol[ms] = std::make_pair(score, std::vector<comp>(1, c));
  else if (score == opt_score)
    opt_sol[ms].second.push_back(c); // add comparison to list
  return opt_score;
}

// sum the number of days in a set of months
int day_total(month_set ms) {
  int total = 0;
  for (auto m : ms)
    total += DAYS[m];
  return total;
}

// finds the optimum solutions for a set of months using top-down DP
int find_opt(month_set ms) {
  if (ms.size() == 1)
    return 0;
  auto it = opt_sol.find(ms);
  if (it != opt_sol.end())
    return it->second.first;
  auto comps = get_comps(ms);
  const int days = day_total(ms);
  int min_score = 0; // will be replaced
  for (auto c : comps) {
    auto p = split_months(ms, c);
    if (p.first.size() && p.second.size())
      min_score = update_opt(ms, days+find_opt(p.first)+find_opt(p.second), c);
  }
  return min_score;
}

std::string comp_to_string(comp c) {
  return '(' + std::to_string(c.first) + ',' + c.second + ')';
}

// finds all solution strings for a set of months using top-down DP
// format: (comp)[false branch][true branch]
// example: find_all_sols({5, 6}) == {"(2,L)[JUN][JUL]", "(2,N)[JUL][JUN]"}
all_sols find_all_sols(month_set ms) {
  if (ms.size() == 1)
    return {MONTHS[ms[0]]};
  auto it = all_sols_map.find(ms);
  if (it != all_sols_map.end())
    return it->second;
  all_sols v;
  for (auto c : opt_sol.find(ms)->second.second) {
    auto p = split_months(ms, c);
    auto str = comp_to_string(c);
    for (auto s : find_all_sols(p.first))
      for (auto t : find_all_sols(p.second))
        v.push_back(str + '[' + s + "][" + t + ']');
  }
  return (all_sols_map[ms] = v);
}

int main() {
  month_set ms;
  for (size_t i = 0; i < MONTHS.size(); i++)
    ms.push_back(i);
  std::cout << "Comparisons: " << find_opt(ms) << std::endl; // 1423
  std::cout << "Solutions: " << find_all_sols(ms).size() << std::endl; // 20160
  // solution on SE
  const std::string SE_SOL = "(1,A)"
                               "[(1,U)"
                                 "[(1,E)"
                                   "[(2,T)"
                                     "[(2,R)"
                                       "[NOV]"
                                       "[APR]]"
                                     "[OCT]]"
                                   "[(2,C)"
                                     "[(2,P)"
                                       "[FEB]"
                                       "[SEP]]"
                                     "[DEC]]]"
                                 "[(0,A)"
                                   "[(2,N)"
                                     "[JUL]"
                                     "[JUN]]"
                                   "[AUG]]]"
                               "[(0,J)"
                                 "[(2,R)"
                                   "[MAY]"
                                   "[MAR]]"
                                 "[JAN]]";
  for (auto s : find_all_sols(ms))
    if (s == SE_SOL)
      std::cout << s << std::endl; // found
  return 0;
}

Try it online!

I took the liberty of making your code available to run online via tio.run, so curious folk can see it run for themselves. — Rubio, Commented May 11, 2018 at 3:47
@TheLethalCoder FWIW, syntax highlighting isn't enabled on Puzzling.SE. (I approved it nonetheless, in case it will ever be enabled in the future.) — Glorfindel, Commented May 11, 2018 at 14:28

floraline · Accepted Answer · 2018-05-10 23:41:23Z

Here's hoping I understand the puzzle correctly. If the example provided is the #1 naive solution, mine is probably the #2. I beat the naive solution's score of 3416, but I did not beat the asker's score of 1252.

I have posted my psuedocode on pastebin here.

This should result in a score of 1842, or an average of 5.04 comparisons per call.

This solution assumes that only valid inputs will be passed in to the function.

To summarize my code:

I compare the third character of each abbrevation because it contains the most unique values. To resolve the two ambiguous cases, I compare again by another character. I also ordered the comparisons by number of days in the month, from least to most. I chose to put the ambiguous cases first, but perhaps I could get a better score if I changed the order. Regardless, this looks like another very naive implementation. I wanted to post it as an answer just to document my efforts.

Further thoughts:

I could probably marginally improve my score by re-ordering some of the comparisons, but I suspect that the real gains to be had here are to take a different approach entirely.

$\begingroup$ Fixed test harness - confirm 1842 (5.05) $\endgroup$
– Rubio ♦
Commented May 10, 2018 at 23:53 — Rubio, Commented May 10, 2018 at 23:53

Kelvin Ng · Accepted Answer · 2018-05-11 11:30:05Z

1

Not sure if I understand this bit correctly

For our purposes, a table/hash lookup counts as a single string compare of your key against a (probably theoretical) index.

If I brute force a hash function that could fit all the month abbreviation into an unique index. That would be a cost of 3 to compute the unique index? So, 3*365 = 1095?

answered May 11, 2018 at 11:30

Kelvin Ng

111 bronze badge

$\begingroup$ You can't do character arithmetic or conversion to integers or the like (was mentioned previously in the comments on the question, and I recently updated the question to include that). It wouldn't be possible to create a useful hash function without that. I may be misunderstanding what you're asking here, though. $\endgroup$
– Rubio ♦
Commented May 11, 2018 at 21:11
$\begingroup$ Ah, so I suspect you were asking about exactly the idea I mention here which was not intended but is allowed by the rules. Yes, that method gives a Verified score of $1095 (3.00)$. $\endgroup$
– Rubio ♦
Commented May 11, 2018 at 21:52

Add a comment |

Stack Exchange Network

Optimizing month abbreviation to number

The specifics:

6 Answers 6

Not the answer you're looking for? Browse other questions tagged
computer-puzzle
optimization
or ask your own question.

Hot Network Questions

Optimizing month abbreviation to number

The specifics:

6 Answers 6

Not the answer you're looking for? Browse other questions tagged computer-puzzleoptimization or ask your own question.

Related

Hot Network Questions

Not the answer you're looking for? Browse other questions tagged
computer-puzzle
optimization
or ask your own question.