Extract data from poorly formatted phonebook

Question

The Problem

John keeps a backup of his old personal phone book as a text file. On each line of the file he can find the phone number (formated as +X-abc-def-ghij where X stands for one or two digits), the corresponding name between < and > and the address.

Unfortunately everything is mixed, things are not always in the same order; parts of lines are cluttered with non-alpha-numeric characters (except inside phone number and name).

Examples of John's phone book lines:

"/+1-541-754-3010 156 Alphand_St. \n"

" 133, Green, Rd. NY-56423 ;+1-541-914-3010!\n"

" +48-421-674-8974 Via Quirinal Roma\n"

Could you help John with a program that, given the lines of his phone book and a phone number num returns a string for this number : "Phone => num, Name => name, Address => adress"

Examples

s = "/+1-541-754-3010 156 Alphand_St. <J Steeve>\n 133, Green, Rd. <E Kustur> NY-56423 ;+1-541-914-3010!\n"

phone(s, "1-541-754-3010") should return "Phone => 1-541-754-3010, Name => J Steeve, Address => 156 Alphand St."

If more than one person has the given phone number, return "Error => Too many people: {num}"

If no one has the queried phone number, return "Error => Not found: {num}"

My solution



import re
from string import ascii_letters as letters, digits
VALID_CHARS = letters + digits + "\. -"
def phone(strng, num):
    people = strng.split("\n")
    found = False
    right_person = ""
    for person in people:
        if num in person:
            if not found:
                right_person = person  
                found = True
            else:
                 return f"Error => Too many people: {num}"
    if not right_person:
        return f"Error => Not found: {num}"
    
    no_num = right_person.replace(num, "")
    name = re.search("<.*>", no_num).group()
    messy_address = no_num.replace(name, "")
    name = name[1:-1]
    partial_clean = re.sub(f"[^{VALID_CHARS}]", " ", messy_address)
    address = " ".join(partial_clean.split())
    return f"Phone => {num}, Name => {name}, Address => {address}"

Harith · Accepted Answer · 2024-03-26 16:05:08Z

Use type-hints:

def phone(strng, num):

At first glance (I hadn't read the problem specification yet), I had incorrectly inferred from the names that the first parameter was of type str and the second of type int. That wouldn't have been the case had you annotated them so:

def phone(strng: str, num: str) -> str:
    ...

Additionally, strng is too generic, phonebook or phonebook_lines would be more self-explanatory.

The name of the function phone is quite vague. It doesn't specify what the function does, and we are missing a docstring as well.

Use an automatic code-formatter:

Whitespace is free to use. You have been too stingy with it. The indentation is all over the place. Consider using black or ruff to format it according to PEP-8.

Missing test cases:

I do not see any unit tests. I'd have expected at least a main() function to exercise its correctness.

toolic · Accepted Answer · 2024-03-26 16:37:33Z

Overview

You've done an good job:

Encapsulating code into a function
You leveraged code written by others with the imports
Used meaningful names for many of the variables

Here are some adjustments for you to consider, mainly for coding style.

Lint check

pylint identified a few issues:

Inconsistent indentation
Trailing whitespace

You can use the black program to automatically modify the code for you.

Layout

Add blank lines after the import lines:

import re
from string import ascii_letters as letters, digits

VALID_CHARS = letters + digits + "\. -"

def phone(strng, num):

Documentation

Add a docstring to the top of the file to summarize the purpose of the code.

Also add a docstring for the function. Describe its inputs and the returned value.

Naming

The function name, phone, is not very descriptive. I think a better name would be something like format_phonebook.

The function input strng is too generic. Perhaps book would be better.

Booboo · Accepted Answer · 2024-03-27 13:37:30Z

Overview

What you have done is a good start. Is there room for improvement? I think so. In the following comments I will try not to repeat what has already been mentioned before in the previous reviews.

Correct Errors

You have VALID_CHARS = letters + digits + "\. -". But this generates a warning:

SyntaxWarning: invalid escape sequence '\.'

VALID_CHARS is used within a regular expression character class as in f"[^{VALID_CHARS}]" and in that context the '.' character does not need to be escaped since it will not be interpreted as the special regex character representing any character except newline.

Then when I call your function passing as follows:

unoformatted_phonebook = '''/+1-541-754-3010 156 Alphand_St. <J Steeve>
133, Green, Rd. <E Kustur> NY-56423 ;+1-541-914-3010!
'''

print(phone(unoformatted_phonebook, 'E Kustur'))

The output is:

Phone => E Kustur, Name => , Address => 133 Green Rd. NY-56423 1-541-914-3010

That is clearly not correct.

What Should Be Returned?

On a lookup you are returning either an error message or a formatted entry as a string. Wouldn't it be more flexible to return an object representing the entry or None if not found and allow the user to format the data or error message as they see fit?

Process the Unformatted Phonebook Once

Every time your function is called, it will scan through the entire unformatted phone book on average half the entries and then parse the correct entry to extract the name, phone and address. This is okay when you have two entries in the phone book, but what if you have a million entries?

The unformatted phonebook should be processed once and a dictionary created that maps a phone number to an entry consisting of a name and address. Once the unformatted phonebook had been processed phone numbers can then be looked up in \$O(1)\$ time.

What is the Abstraction?

What you are doing is creating a "reverse phonebook" that allows one to provide a phone number and return the name and address associated with that phone. Since we should be creating a dictionary once that will be subsequently used for all lookups, why not take advantage of Python's object-oriented capabilities and create a ReversePhonebook class that processes the unformatted phone book in its initializer and provides a lookup_by_phone method?

Better Choice of Names

It's already been mentioned that phone is not the best name for your function. But what about the argument names strng and num. What does strng tell me about this other than it is probably a string being passed? What does it actually represent? And what type of num are we expecting? Actually we are expecting a string value representing a phone number to look up.

When you split up the passed unformatted phone book on newline characters, you get a list of strings. What does each string represent? In your code you are referring to these as people (for the list) and person (for an individual element). But what you have is better thought of as a list of phone book entries rather than a list of people and these varibles should be named accordingly.

Perform Validation

You have name = re.search("<.*>", no_num).group(). But if no match is found this will raise an exception. When it comes to user-supplied input, take nothing for granted: you should check to see if the call to re.search actually returns a non-None value.

A Possible Implementation

The following incorporates the previous suggestions I and others have made.

"""
This module parses a poorly formatted phonebook into a well-formatted
one permitting efficient look-ups by name.
"""

from collections import namedtuple
import re
from string import ascii_letters as letters, digits

VALID_CHARS = letters + digits + ". -"

# A phonebook entry looked up by name:
PhoneBookEntry = namedtuple('PhoneBookEntry', ['name', 'address'])

class ReversePhoneBook:
    """This class implements the reverse phonebook."""

    def __init__(self, unformatted_phonebook: str):
        """See https://codereview.stackexchange.com/questions/291252/extract-data-from-poorly-formatted-phonebook
        for the format of the unformatted phonebook."""

        self._entries: dict[str, PhoneBookEntry] = {}

        phonebook_entries = unformatted_phonebook.strip().split('\n')
        name_regex = re.compile(r'<(.+)>')
        phone_regex = re.compile(r'\+([0-9]{1,2}-[0-9]{3}-[0-9]{3}-[0-9]{4})')
        invalid_chars_regex = re.compile(fr'[^{VALID_CHARS}]+')
        multiple_spaces_regex = re.compile(fr' +')

        for entry in phonebook_entries:
            name_match = name_regex.search(entry)
            phone_match = phone_regex.search(entry)
            if name_match is None or phone_match is None:
                raise RuntimeError(f'Invalid entry: {entry}')
            name = name_match[1]
            phone = phone_match[1]
            if phone in self._entries:
                raise RuntimeError(f'Duplicate phone {phone} for name {name}')

            # Remove invalid characters:
            cleaned_entry = invalid_chars_regex.sub(' ', entry)

            # Remove name and phone from entry:
            cleaned_entry = cleaned_entry.replace(name, '').replace(phone, '')

            # What's left is the address:
            address = cleaned_entry.strip()
            # Replace multiple spaces with a single space:
            address = multiple_spaces_regex.sub(' ', address)
            self._entries[phone] = PhoneBookEntry(name, address)

    def lookup_by_phone(self, phone: str) -> PhoneBookEntry | None:
        """Look up phonebook entry by name."""

        return self._entries.get(phone)

if __name__ == '__main__':
    unoformatted_phonebook = '''/+1-541-754-3010 156 Alphand_St. <J Steeve>
133, Green, Rd. <E Kustur> NY-56423 ;+1-541-914-3010!
'''
    phonebook = ReversePhoneBook(unoformatted_phonebook)
    for phone in ('1-541-754-3010', '1-541-914-3010'):
        entry = phonebook.lookup_by_phone(phone)
        if entry is None:
            print('No entry found for', phone)
        else:
            print(f'Phone => {phone}, Name => {entry.name}, Address => {entry.address}')

Prints:

Phone => 1-541-754-3010, Name => J Steeve, Address => 156 Alphand St.
Phone => 1-541-914-3010, Name => E Kustur, Address => 133 Green Rd. NY-5642

Future Extensions

Right now you are requiring phone numbers to be unique. But my spouse and I share the same phone number. Can't we each have an entry in the unformatted phonebook so that when a lookup is done by phone number two entries are returned?

What if you also wanted to provide a facility for the user to be able to look up an entry by name? Names are not unique, so this facility in general returns a list of matches.

How difficult given the above implementation would it be to implement these ideas compared to your original implementation?

Thanks for the help. As I understood it, classes were used to make objects that could be reüsed - wouldn't this Phonebook class only have one instance of itself, and thus not be very useful as a class? — Vessel, Commented Mar 28 at 13:05
First of all, in principle you could have many different instances of Phonebook each initialized with different unformatted phonebook files, even if that is not your particular use case. Even with a single instance the class, the class encapsulates the creation of the dictionary that holds the mapping between phone number and name/address info and the method(s) for accessing that dictionary. Think of the class as an implementation of a "phonebook" abstract data type that allows a client via the method lookup_by_phone to get information from it without having to know how it works. — Booboo, Commented Mar 28 at 14:22

Stack Exchange Network

Extract data from poorly formatted phonebook

The Problem

Examples

My solution

3 Answers 3

Use type-hints:

Use an automatic code-formatter:

Missing test cases:

Overview

Lint check

Layout

Documentation

Naming

Overview

Correct Errors

What Should Be Returned?

Process the Unformatted Phonebook Once

What is the Abstraction?

Better Choice of Names

Perform Validation

A Possible Implementation

Future Extensions

Not the answer you're looking for? Browse other questions tagged
python
programming-challenge
or ask your own question.

Hot Network Questions

Extract data from poorly formatted phonebook

The Problem

Examples

My solution

3 Answers 3

Use type-hints:

Use an automatic code-formatter:

Missing test cases:

Overview

Lint check

Layout

Documentation

Naming

Overview

Correct Errors

What Should Be Returned?

Process the Unformatted Phonebook Once

What is the Abstraction?

Better Choice of Names

Perform Validation

A Possible Implementation

Future Extensions

Not the answer you're looking for? Browse other questions tagged pythonprogramming-challenge or ask your own question.

Related

Hot Network Questions

Not the answer you're looking for? Browse other questions tagged
python
programming-challenge
or ask your own question.