Obtain a list containing string elements excluding elements prefixed with any other element from initial list

Question

I have some trouble with filtering a list of strings. I found a similar question here but is not what i need.

The input list is:

l = ['ab', 'xc', 'abb', 'abed', 'sdfdg', 'abfdsdg', 'xccc']

and the expected result is

['ab', 'xc', 'sdfdg']

The order of the items in the result is not important

The filter function must be fast because the size of list is big

My current solution is

l = ['ab', 'xc', 'abb', 'abed', 'sdfdg', 'abfdsdg', 'xccc']
for i in range(0, len(l) - 1):
    for j in range(i + 1, len(l)):
        if l[j].startswith(l[i]):
            l[j] = l[i]
        else:
            if l[i].startswith(l[j]):
                l[i] = l[j]

print list(set(l))

EDIT

After multiple tests with a big input data, list with 1500000 strings, my best solution for this is:

def filter(l):
    if l==[]:
        return []
    l2=[]
    l2.append(l[0])
    llen = len(l)
    k=0
    itter = 0
    while k<llen:
        addkelem = ''
        j=0
        l2len = len(l2)
        while j<l2len:
            if (l2[j].startswith(l[k]) and l[k]!= l2[j]):
                l2[j]=l[k]
                l.remove(l[k])
                llen-=1
                j-=1
                addkelem = ''
                continue
            if (l[k].startswith(l2[j])):
                addkelem = ''
                break
            elif(l[k] not in l2):
                addkelem = l[k]
            j+=1
        if addkelem != '':
            l2.append(addkelem)
            addkelem = ''
        k+=1
    return l2

for which the execution time is around of 213 seconds

Sample imput data - each line is a string in list

is the resulting order important? I mean should the algorithm be stable and keep the original order? — bagrat, Commented May 15, 2015 at 21:21
The output order is not important, the filtering and execution time is critical — vasilenicusor, Commented May 15, 2015 at 21:37
checkout my answer, it takes about 1.5 seconds for 1.500.000 elements on my 2.7 GHz computer. — bagrat, Commented May 15, 2015 at 22:11

Benoit Esnard · Accepted Answer · 2015-05-16 18:36:30Z

11

+50

This algorithm completes the task in 0.97 second on my computer, with the input file submitted by the author (154MB):

l.sort()

last_str = l[0]
filtered = [last_str]
app      = filtered.append

for str in l:
    if not str.startswith(last_str):
        last_str = str
        app(str)

# Commented because of the massive amount of data to print.
# print filtered

The algorithm is simple: first sort the list lexicographically, then search for the first string which isn't prefixed by the very first one of the list, then search the next one which isn't prefixed by the last unprefixed one, etc.

If the list is already sorted (your example file seems to be already sorted), you can remove the l.sort() line, which will result in a O(n) complexity in both time and memory.

edited May 16, 2015 at 18:36

answered May 15, 2015 at 3:03

Benoit Esnard

2,0452 gold badges26 silver badges33 bronze badges

1

Quick and easy. Is the fastest solution, i like this
– vasilenicusor
Commented May 15, 2015 at 5:39
1

If optimized and made into a function this is probably the best approach.
– Inbar Rose
Commented May 15, 2015 at 7:22
Excellent solution. The point is that append takes constant time.
– bagrat
Commented May 16, 2015 at 11:22
1

Much simpler than my solution, I was way over complicating the issue, you would need to map(str.rstrip.. the file object to get the correct result which is 251
– Padraic Cunningham
Commented May 16, 2015 at 12:58
@PadraicCunningham: Thanks for the advice, but the author requested that the input was a list, and that the output was a list too. And I'm not sure if keeping the whole file in RAM while performing actions on filtered elements is really optimal.
– Benoit Esnard
Commented May 17, 2015 at 20:54

| Show 4 more comments

Padraic Cunningham · Accepted Answer · 2015-05-13 01:00:38Z

You can group the items by first letter and just search the sublists, no string can start with a substring unless it has at least the same first letter:

from collections import defaultdict

def find(l):
    d = defaultdict(list)
    # group by first letter
    for ele in l:
        d[ele[0]].append(ele)
    for val in d.values():
        for v in val:
            # check each substring in the sublist
            if not any(v.startswith(s) and v != s  for s in val):
                yield v

print(list(find(l)))
['sdfdg', 'xc', 'ab']

This correctly filters the words, as you can see from the code below that the reduce function does not, 'tool' should not be in the output:

In [56]: l = ["tool",'ab',"too", 'xc', 'abb',"toot", 'abed',"abel", 'sdfdg', 'abfdsdg', 'xccc',"xcew","xrew"]

In [57]: reduce(r,l)
Out[57]: ['tool', 'ab', 'too', 'xc', 'sdfdg', 'xrew']

In [58]: list(find(l))
Out[58]: ['sdfdg', 'too', 'xc', 'xrew', 'ab']

It also does it efficiently:

In [59]: l = ["".join(sample(ascii_lowercase, randint(2,25))) for _ in range(5000)]

In [60]: timeit reduce(r,l)
1 loops, best of 3: 2.12 s per loop

In [61]: timeit list(find(l))
1 loops, best of 3: 203 ms per loop

In [66]: %%timeit
..... result = []
....: for element in lst:
....:   is_prefixed = False
....:   for possible_prefix in lst:
....:     if element is not possible_prefix and  element.startswith(possible_prefix):
....:       is_prefixed = True
....:       break
....:   if not is_prefixed:
....:     result.append(element)
....: 
1 loops, best of 3: 4.39 s per loop

In [92]: timeit list(my_filter(l))
1 loops, best of 3: 2.94 s per loop

If you know the minimum string length is always > 1 you can optimise further, again if the minimum length string is 2 then a word has to have a minimum of the first two letters in common:

def find(l):
    d = defaultdict(list)
    # find shortest length string to use as key length
    mn = len(min(l, key=len))
    for ele in l:
        d[ele[:mn]].append(ele)

    for val in d.values():
        for v in val:
            if not any(v.startswith(s) and v != s for s in val):
                yield v


In [84]: timeit list(find(l))
100 loops, best of 3: 14.6 ms per loop

Lastly if you have dupes you may want to filter out the duplicated words from your list but you need to keep them to compare:

from collections import defaultdict,Counter

def find(l):
    d = defaultdict(list)
    mn = len(min(l, key=len))
    cn = Counter(l)
    for ele in l:
        d[ele[:mn]].append(ele)
    for val in d.values():
        for v in val:
            if not any(v.startswith(s) and v != s for s in val): 
                # make sure v is not a dupe
                if cn[v] == 1:
                    yield v

So if speed is important, an implementation using some variation of the code above is going to be significantly faster than your naive approach. There is also more data stored in memory so you should also take the into account.

To save memory we can create a counter for each val/sublist so we only store a single counter dict of words at a time:

def find(l):
    d = defaultdict(list)
    mn = len(min(l, key=len))
    for ele in l:
        d[ele[:mn]].append(ele)
    for val in d.values():
        # we only need check each grouping of words for dupes
        cn = Counter(val)
        for v in val:
            if not any(v.startswith(s) and v != s for s in val):
                if cn[v] == 1:
                    yield v

creating a dict each loop adds 5ms so still < 20ms for 5k words.

The reduce method should work if the data is sorted:

 reduce(r,sorted(l)) # -> ['ab', 'sdfdg', 'too', 'xc', 'xrew']

To make the difference clear between the behaviour:

In [202]: l = ["tool",'ab',"too", 'xc', 'abb',"toot", 'abed',
             "abel", 'sdfdg', 'abfdsdg', 'xccc',"xcew","xrew","ab"]

In [203]: list(filter_list(l))
Out[203]: ['ab', 'too', 'xc', 'sdfdg', 'xrew', 'ab']

In [204]: list(find(l))
Out[204]: ['sdfdg', 'too', 'xc', 'xrew', 'ab', 'ab']

In [205]: reduce(r,sorted(l))
Out[205]: ['ab', 'sdfdg', 'too', 'xc', 'xrew']

In [206]: list(find_dupe(l))
Out[206]: ['too', 'xrew', 'xc', 'sdfdg']

In [207]: list(my_filter(l))
Out[207]: ['sdfdg', 'xrew', 'too', 'xc']
In [208]: "ab".startswith("ab")
Out[208]: True

So ab is repeated twice so using a set or a dict without keeping track of how may times ab appeared would mean we consider that there was no other element that satisfied the condition ab "ab".startswith(other ) == True, which we can see is incorrect.

You can also use itertools.groupby to group based on the min index size:

def find_dupe(l):
    l.sort()
    mn = len(min(l, key=len))
    for k, val in groupby(l, key=lambda x: x[:mn]):
        val = list(val)
        for v in val:
            cn = Counter(val)
            if not any(v.startswith(s) and v != s for s in val) and cn[v] == 1:
                yield v

Based on your comments then we can adjust my first code if you don't think "dd".startswith("dd") should be True with repeated elements:

l = ['abbb', 'xc', 'abb', 'abed', 'sdfdg', 'xc','abfdsdg', 'xccc', 'd','dd','sdfdg', 'xc','abfdsdg', 'xccc', 'd','dd']


def find_with_dupe(l):
    d = defaultdict(list)
    # group by first letter
    srt = sorted(set(l))
    ind = len(srt[0])
    for ele in srt:
        d[ele[:ind]].append(ele)
    for val in d.values():
        for v in val:
            # check each substring in the sublist
            if not any(v.startswith(s) and v != s for s in val):
                yield v


print(list(find_with_dupe(l)))

['abfdsdg', 'abed', 'abb', 'd', 'sdfdg', 'xc']

Which run on a random sample of text runs in a fraction of the time your own code does:

In [15]: l = open("/home/padraic/Downloads/sample.txt").read().split()

In [16]: timeit list(find(l))                                         
100 loops, best of 3: 19 ms per loop

In [17]: %%timeit
   ....: l = open("/home/padraic/Downloads/sample.txt").read().split()
   ....: for i in range(0, len(l) - 1):
   ....:     for j in range(i + 1, len(l)):
   ....:         if l[j].startswith(l[i]):
   ....:             l[j] = l[i]
   ....:         else:
   ....:             if l[i].startswith(l[j]):
   ....:                 l[i] = l[j]
   ....: 

1 loops, best of 3: 4.92 s per loop

Both returning identical output:

In [41]: l = open("/home/padraic/Downloads/sample.txt").read().split()

In [42]:  
for i in range(0, len(l) - 1):
    for j in range(i + 1, len(l)):
        if l[j].startswith(l[i]):
            l[j] = l[i]
        else:
            if l[i].startswith(l[j]):
                l[i] = l[j]
   ....:                 


In [43]: 

In [43]: l2 = open("/home/padraic/Downloads/sample.txt").read().split()

In [44]: sorted(set(l)) == sorted(find(l2))
Out[44]: True

Hi Padraic. Would you be kind enough to test my answer? I would like to see how it stacks up against the others with the same data and on the same machine. — skolsuper, Commented May 12, 2015 at 13:06
+1 for nice examples. I have tested your examples with and works fine with small input data. If data input are increasing the the execution is increased exponential . Also this doesn't catch the duplicates. — vasilenicusor, Commented May 12, 2015 at 22:19
@vasilenicusor, the last part definitely does catch dupes. The counter dict counts all occurrences of words and no word appearing twice is yielded. What are you running it on? — Padraic Cunningham, Commented May 12, 2015 at 22:23
this script should get from a big list with file paths only the unique root paths (files of folders). e.g ['a:/dir1','a:/dir1/file1.txt','a:/dir1/dir2','a:/dir4/file4.tx', 'a:/dir4/file4.txt'] the result is ['a:/dir1','a:/dir4/file4.tx'] — vasilenicusor, Commented May 12, 2015 at 22:44
@Padraic Cunningham to get sample data for testing the script, use a list with paths from your HDD - i use dir "C:*" /s/b > e:/out.txt then load each line from out.tx in list. In my situation i have 1498136 files. and the script are still running after 10 minutes :) with 214 MB RAM and 30% CPU on a PC with I5 3.2 GHz and 16 GB RAM — vasilenicusor, Commented May 12, 2015 at 23:51

Dzarafata · Accepted Answer · 2015-05-12 10:36:46Z

4

lst = ['ab', 'xc', 'abb', 'abed', 'sdfdg', 'abfdsdg', 'xccc']
result = []

for element in lst:
  is_prefixed = False
  for possible_prefix in lst:
    if element is not possible_prefix and element.startswith(possible_prefix):
      is_prefixed = True
      break
  if not is_prefixed:
    result.append(element)

Here is some experimental, multithreaded version:

Test it well!

import thread
import math
import multiprocessing

list = ['ab', 'xc', 'abb', 'abed', 'sdfdg', 'abfdsdg', 'xccc']

def get_results(list, thread_num, num_of_threads):  
  result = []
  part_size = int(math.ceil(len(list) * 1.0 / num_of_threads))
  for element in list[part_size * thread_num: part_size * (thread_num + 1)]:    
    is_prefixed = False
    for possible_prefix in list:
      if element is not possible_prefix and     element.startswith(possible_prefix):
    is_prefixed = True
    if not is_prefixed:
      result.append(element)
  return result

num_of_threads = multiprocessing.cpu_count()
results = []
for t in xrange(num_of_threads):  
  thread.start_new_thread(lambda list: results.extend(get_results(list, t, num_of_threads)), (list,))

edited May 12, 2015 at 10:36

answered May 12, 2015 at 9:43

Dzarafata

5583 silver badges11 bronze badges

takes 8 seconds for a list with 7k stings
– Padraic Cunningham
Commented May 12, 2015 at 9:50
Now it is 2.96 s per loop
– Padraic Cunningham
Commented May 12, 2015 at 9:55
Is it enough? How often do You need to run it?
– Dzarafata
Commented May 12, 2015 at 9:58
Your code seems to check all of the possible prefixes. My goes to next element after first prefix is matched
– Dzarafata
Commented May 12, 2015 at 10:04

Add a comment |

Eli Korvigo · Accepted Answer · 2015-05-16 05:10:47Z

Edit3 After some meditation I wrote this algorithm. It is 1k times faster than the reduce-based method based on the big random data set provided by Padraic Cunningham (thanks for the set). The algorithm has ~ O(nlogn) complexity, though there is some space left for minor optimization. It's also very memory efficient. It takes roughly n additional space.

def my_filter(l):
    q = sorted(l, reverse=True)
    first = q.pop()
    addfirst = True
    while q:
        candidate = q.pop()
        if candidate == first:
            addfirst = False
            continue
        if not candidate.startswith(first):
            if addfirst:
                yield first
            first, addfirst = candidate, True
    if addfirst:
        yield first

Edit2 This thing is as fast as the reduce-based algorithm in my tests, but this comparison depends on the data set used. I simply parsed a text-book page into words. The algorithm is based on the following observation. Let A, B and C be strings, len(A) < min(len(B), len(C)). Observe that if A is a prefix of B it's sufficient to check if A is a prefix of C to say that there is a prefix of C.

def my_filter(l):
    q = sorted(l, key=len)
    prefixed = []
    while q:
        candidate = q.pop()
        if any(candidate.startswith(prefix) for prefix in prefixed):
            continue
        if any(candidate.startswith(string) for string in q):
            prefixed.append(candidate)
        else:
           yield candidate

Original post This is the original algorithm I proposed. in fact it's a concise version of your algorithm.

l = ['ab', 'xc', 'abb', 'abed', 'sdfdg', 'abfdsdg', 'xccc']

res = [string for string in l if sum(not string.startswith(prefix) for prefix in l) == len(l)-1]

Demo>>>

print res
# ['ab', 'xc', 'sdfdg']

@TonyYang it makes n comparisons for each of n items in the list. That is roughly n^2 in the big O notation. P.S. I don't take into account the cost of the startswith operation itself, taking it as constant (O(1)) — Eli Korvigo, Commented May 12, 2015 at 10:01
@PadraicCunningham Actually I though of utilizing the De Bruijn graph. One might obtain the graph and find isolated vertices. That would be beautiful, though slow. — Eli Korvigo, Commented May 12, 2015 at 10:11
@PadraicCunningham This problem received ridiculous amount of my attention :D Nevertheless I came up with a linear algorithm (if you don't count sorting complexity). I would be glad to know your opinion about it. — Eli Korvigo, Commented May 12, 2015 at 13:24
His solution gives the same results as ours and is a clear winner on speed (ours are about the same) — skolsuper, Commented May 12, 2015 at 14:59
@EliKorvigo I can't even work out how yours works but it is simple and fast, nice one — skolsuper, Commented May 12, 2015 at 15:03

skolsuper · Accepted Answer · 2015-05-12 14:41:22Z

3

from collections import Counter


def filter_list(mylist):

    longest_string = len(max(mylist, key=len))

    set_list = [set(filter(lambda x: len(x) == i, mylist))
                for i in range(1, longest_string)]

    def unique_starts_with_filter(string):
        for i in range(1, len(string)):
            if string[:i] in set_list[i-1]: return False
        return True

    cn = Counter(mylist)
    mylist = filter(lambda x: cn[x] == 1, mylist)

    return filter(unique_starts_with_filter, mylist)

Edited (again) for style and very minor optimizations

edited May 12, 2015 at 14:41

answered May 12, 2015 at 9:50

skolsuper

6391 gold badge6 silver badges21 bronze badges

@PadraicCunningham it would be cool if you could test this for me with the same data and machine as the others. I think it should be O(n)
– skolsuper
Commented May 12, 2015 at 13:03
Okay thanks very much. Rather than debug it, I think I will just leave it there as your answer is very thorough. I'd be surprised if the nut implementation doesn't involve a set somewhere though as membership tests are O(1) (same for defaultdict though I guess?)
– skolsuper
Commented May 12, 2015 at 13:10
1

yours works well if there are no dupes like my first answer. The problem with using a set or a dict is that if a word appears twice it will be counted as unique, if you added a counter check as per my answer I imagine yours should also cover all cases
– Padraic Cunningham
Commented May 12, 2015 at 13:12
1

yes but you are not removing them. If you have two "abc"'s your code will keep one
– Padraic Cunningham
Commented May 12, 2015 at 13:15
1

"abc".startswith("abc") -> True, unless the OP only wants to consider substring matches then yes it should be removed, I left a few different versions because the full implementation details are not known
– Padraic Cunningham
Commented May 12, 2015 at 13:22

| Show 13 more comments

Community · Accepted Answer · 2020-06-20 09:12:55Z

Solution 1 (Using LinkedList)

Here is a very simple solution using sorting (O(n * log(n))) and LinkedList - iterating (O(n)) and removing elements (O(1)). If you sort the initial data, the elements are going to be ordered in such way that the longest prefix element for another element, is going to be adjacent to the later. Thus that element may be removed.

Here is a code that filters out a sorted LinkedList:

def filter_out(the_list):
    for element in the_list:
        if element.prev_node and element.get_data().startswith(element.prev_node.get_data()):
            the_list.remove(element)

    return the_list

And use it like this:

linked_list = LinkedList(sorted(l))
filter_out(linked_list)

Then your linked_list will contain the filtered data.

This solution will take O(n * log(n))

And here is the implementation of LinkedList:

class Node(object):
    def __init__(self, data=None, next_node=None, prev_node=None):
        self.data = data
        self._next_node = next_node
        self._prev_node = prev_node

    def get_data(self):
        return self.data

    @property
    def next_node(self):
        return self._next_node

    @next_node.setter
    def next_node(self, node):
        self._next_node = node

    @property
    def prev_node(self):
        return self._prev_node

    @prev_node.setter
    def prev_node(self, node):
        self._prev_node = node

    def __repr__(self):
        return repr(self.get_data())


class LinkedList(object):
    def __init__(self, iterable=None):
        super(LinkedList, self).__init__()

        self.head = None
        self.tail = None
        self.size = 0

        if iterable:
            for element in iterable:
                self.insert(Node(element))

    def insert(self, node):
        if self.head:
            self.tail.next_node = node
            node.prev_node = self.tail
            self.tail = node
        else:
            self.head = node
            self.tail = node

        self.size += 1

    def remove(self, node):
        if self.size > 1:
            prev_node = node.prev_node
            next_node = node.next_node

            if prev_node:
                prev_node.next_node = next_node
            if next_node:
                next_node.prev_node = prev_node
        else:
            self.head = None
            self.tail = None

    def __iter__(self):
        return LinkedListIter(self.head)

    def __repr__(self):
        result = ''
        for node in self:
            result += str(node.get_data()) + ', '
        if result:
            result = result[:-2]
        return "[{}]".format(result)


class LinkedListIter(object):
    def __init__(self, start):
        super(LinkedListIter, self).__init__()

        self.node = start

    def __iter__(self):
        return self

    def next(self):
        this_node = self.node

        if not this_node:
            raise StopIteration

        self.node = self.node.next_node
        return this_node

Solution 2 (Using set)

If the algorithm is not required to be stable, i.e. the resulting list needs not to keep the original order of the elements, here is a solution using set.

So let us make some definitions:

n - the size of your list, i.e. number of elements

m - the maximum possible length of a string element of your list

First of all, initialise a new set inter out of your original list l:

inter = set(l)

by this we will remove all the duplicate elements, which will ease our further work. Also, this operation will take O(n) complexity in average and O(n^2) in worst case.

Then make another empty set where we will keep our results:

result = set()

So now let us check for each of the elements, whether there is a suffix for it in our inter set:

for elem in inter:                   # This will take at most O(n)
    no_prefix = True
    for i in range(1, len(elem)):    # This will take at most O(m)
        if elem[:i] in inter:        # This will take avg: O(1), worst: O(n)
            no_prefix = False
            continue

    if no_prefix:
        result.add(elem)             # This will take avg: O(1), worst: O(n)

So now you have got what you wanted in the result.

This final and core step will take O(n * (1 * m + 1)) = O(n * m) in average and O(n * (m * n + n)) = O(n^2 * (m + 1)) in worst case, thus if m is insignificant compared to n then you have the O(n) in average and O(n ^ 2) in worst case.

The complexities are calculated based on what Python provides for the set data structure. To further optimise the algorithm, you can implement your own tree-based set and gain O(n * log(n)) complexity.

53.2039999962 seconds with the provided demo input on PC with 3.2 Ghz, i5, 16 GB RAM. Nice solution — vasilenicusor, Commented May 15, 2015 at 22:23
optimized, see Solution 1, takes about 10 secs on your data. — bagrat, Commented May 15, 2015 at 23:46
till now, the best solution is the solution provided by @Benoit Esnard with 0.582999944687 seconds — vasilenicusor, Commented May 16, 2015 at 0:01

Pete Cacioppi · Accepted Answer · 2015-05-19 22:23:42Z

3

I believe the following could very likely be the most efficient.

from sortedcontainers import SortedSet
def prefixes(l) :
    rtn = SortedSet()
    for s in l:
        rtn.add(s)
        i = rtn.index(s)
        if (i > 0 and s.startswith(rtn[i-1])):
            rtn.remove(s)
        else :
            j = i+1
            while (j < len(rtn) and rtn[j].startswith(s)):
                j+=1
            remove = rtn[i+1:j]
            for r in remove:
                rtn.remove(r)
    return list(rtn)

I believe the most important case is probably really long input files, with much smaller output files. This solution avoids holding the entire input file in memory. If the input file is sorted, then it never holds a rtn argument longer than the eventual returned file. Moreover, it demonstrates the pattern of "maintain a solution that is valid for data seen so far, and extend this solution to be valid to each new piece of data". This is a nice pattern to familiarize yourself with.

answered May 19, 2015 at 22:23

Pete Cacioppi

8908 silver badges11 bronze badges

Good solution. ~10 seconds for sample input and ~18 seconds if input list is shuffled
– vasilenicusor
Commented May 20, 2015 at 7:04
How does that stack up against the others? I think if you create a super big input file that still has a reasonable sized output file (say, by concatinating your file together many times) mine will still scale well. The answer you accepted requires the ability to hold the entire input file in memory, whereas mine will hold something more like the output file in memory).
– Pete Cacioppi
Commented May 20, 2015 at 16:38
In other words, this solution will run on any iterable, including a generator.
– Pete Cacioppi
Commented May 20, 2015 at 16:49
It's true, but this list is not loaded from a file,it is builded at runtime. I posted this file only for testing and to give an overview about the amount of data from input list.
– vasilenicusor
Commented May 20, 2015 at 16:58
Sure. But we should try to address this problem in general. I believe for the case where the input set is massive, but the output set is reasonable, this solution will be the best. This is an important case to consider, as massive data is a real thing.
– Pete Cacioppi
Commented May 20, 2015 at 18:35

| Show 1 more comment

Sorin Negulescu · Accepted Answer · 2015-05-12 09:39:50Z

2

A solution using the reduce function:

def r(a, b):
    if type(a) is list:
        if len([z for z in a if b.startswith(z)]) == 0:
            a.append(b)
        return a
    if a.startswith(b):
        return b
    if b.startswith(a):
        return a
    return [a, b]

print reduce(r, l)

Probably, the [z for z in a if b.startswith(z)] part can be further optimized.

answered May 12, 2015 at 9:39

Sorin Negulescu

995 bronze badges

Add a comment |

vks · Accepted Answer · 2015-05-15 08:56:46Z

2

You can try this short solution.

import re
l = ['ab', 'xc', 'abb', 'abed', 'sdfdg', 'abfdsdg', 'xccc']
newl=[]
newl.append(l[0])
con=re.escape(l[0])

for i in l[1:]:
    pattern=r"^(?!(?:"+con+r")).*$"
    if re.match(pattern,i):
        newl.append(i)
        con=con+"|"+re.escape(i)


print newl

EDIT:For long lists use

import re
l = ['ab', 'xc', 'abb', 'abed', 'sdfdg', 'abfdsdg', 'xccc']
newl=[]
newl.append(l[0])
con=re.escape(l[0])

for i in l[1:]:
    for x in re.split("\|",con):
        pattern=r"^(?="+x+r").*$"
        if re.match(pattern,i):
            break
    else:
        newl.append(i)
        con=con+"|"+re.escape(i)


print newl

edited May 15, 2015 at 8:56

answered May 15, 2015 at 6:53

vks

67.7k11 gold badges94 silver badges127 bronze badges

Hi, this work OK with simple string, but don't work with string like C:\~sqltmp\express\xmlrw.dll the regex pattern will be ^(?!(?:C:\~sqltmp\express\xmlrw.dll)).*$ and will raise error invalid expression , sre_constants.error: bogus escape: '\\x'
– vasilenicusor
Commented May 15, 2015 at 7:05
because the size of con are increasing fast, after 582 iteration the length of pattern is 39615 and will raise OverflowError regular expression code size limit exceeded. For testing use sample data from soft2u.ro/out.txt
– vasilenicusor
Commented May 15, 2015 at 7:15
because the solution is limited by the size of input data. See my previous comment. Please add fix to your solution and down vote will be removed
– vasilenicusor
Commented May 15, 2015 at 7:40
Now looks more better
– vasilenicusor
Commented May 15, 2015 at 10:07
please take in consideration that using regex is time consuming. this script are still running after 15 minutes with sample demo input from soft2u.ro/out.txt :(
– vasilenicusor
Commented May 15, 2015 at 11:10

Add a comment |

Collectives™ on Stack Overflow

Obtain a list containing string elements excluding elements prefixed with any other element from initial list

9 Answers 9

Here is some experimental, multithreaded version:

Not the answer you're looking for? Browse other questions tagged
python
string
list
python-2.7
filtering
or ask your own question.

Linked

Hot Network Questions

Collectives™ on Stack Overflow

9 Answers 9

Here is some experimental, multithreaded version:

Not the answer you're looking for? Browse other questions tagged pythonstringlistpython-2.7filtering or ask your own question.

Linked

Related

Not the answer you're looking for? Browse other questions tagged
python
string
list
python-2.7
filtering
or ask your own question.