0

Let's say I have a text file with billions of text lines sorted alphabetically, like

Bar=10
Foo=6
Naz=42

How can I search for the line starting with Foo in the most optimal way (the file contains billions of variables like this), knowing lines are sorted alphabetically and that the line I want to find must start (or "contain" if it's easier to search for) a specific text?


Edit:

This question can be considered as duplicate of https://askubuntu.com/q/423886/10473 Answer is to use look which is fast enough for such research

8
  • What do you want out of the search? A "yes" or "no" or the actual line that matches, or just the number after =? Will you only be searching with a single string or with many separate strings (expecting many answers)? Do you care for substring matches (so that Foo matches not only Foo but also AhFoo and Foobiz, or Hoo=Foo etc.)? Are these variables that would be valid in a shell? Are there duplicated lines, or duplicated variable names?
    – Kusalananda
    Commented Jan 8, 2021 at 23:19
  • @Kusalananda I want the line (since I also want the variable value). I search only one string at a time (say Foo or Bar or Naz). I won't search for "Naz=" nor "42" nor "Naz=21" nor "Naz=42". I actually search the "full match" from line start (Foo matches Foo but not AhFoo nor Hoo=Foo); I don't care if it matches Foobiz: I'm not looking for it, but if it makes commander easier, it's fine
    – Xenos
    Commented Jan 8, 2021 at 23:24
  • 1
  • Binary search in a sorted text file Commented Jan 8, 2021 at 23:45
  • @ctrl-alt-delor Thanks, I didn't know look was actually what I looked for. I made it using ... | xargs -I "{}" look -f "{}" "sorted.txt" which returns the result within a second. You may make an answer if you want me to accept it and get the reputation from it ;) Thanks again!
    – Xenos
    Commented Jan 11, 2021 at 16:05

1 Answer 1

0

I don't know how this will scale to the volumes you're talking about, but it seems to work with a file containing this:

Foo=123
Foobar=646
Foobar=85489
Noo=8654
Noobar=8262
awk -F= '{if ($1 > "Foobar") { exit } ; if ($1 == "Foobar") { print $0 } }' sorted.txt

This is just a proof of concept. It would be a simple matter to adapt so the term you are matching against is passed in.

2
  • It didn't scale well, as it's taking more than minutes to run. I ended up using look, which I didn't know, from the comments in the question. Thanks anyway!
    – Xenos
    Commented Jan 11, 2021 at 16:03
  • Cool, glad you got there.
    – bxm
    Commented Jan 12, 2021 at 22:16

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .