28

The locate program of findutils scans one or more databases of filenames and displays any matches. This can be used as a very fast find command if the file was present during the last file name database update.

There are many kinds of databases nowadays,

So what kind of database does updatedb update and locate use?

Thanks.

5
  • Regardless of whether locate actually uses BerkelyDB, it's worth you investigating - it's a very old, simple, effective disk-based key-value store.
    – pjc50
    Commented Jul 21, 2017 at 13:09
  • @pjc50 I'd love to. Where are the files for the database? How shall I view their contents?
    – Tim
    Commented Jul 21, 2017 at 13:10
  • For locate? serverfault.com/questions/454127/…
    – pjc50
    Commented Jul 21, 2017 at 13:11
  • 1
    "Page Not Found" , the link should be serverfault.com/questions/454127/…
    – Tim
    Commented Jul 21, 2017 at 13:12
  • So what do the "keys" and "values" represent in the database? If I understand Stephen Kitt's comment unix.stackexchange.com/questions/379725/… correctly, the database isn't key-value.
    – Tim
    Commented Jul 21, 2017 at 13:15

3 Answers 3

30

Implementations of locate/updatedb typically use specific databases tailored to their requirements, rather than a generic database engine. You’ll find those specific databases documented by each implementation; for example:

  • GNU findutils’ is documented in locatedb(5), and is pretty much just a list of files (with a specific compression algorithm);
  • mlocate’s is documented in mlocate.db(5), and can also be considered a list of directories and files (with metadata).
12
  • 1
    Thanks. Where and how can i learn the principles of designing and implementing specific databases tailored to specific requirements? I'd appreciate any references for reading.
    – Tim
    Commented Jul 20, 2017 at 12:27
  • 12
    Designing databases boils down to designing data structures, so learn about those, and then about size-versus-speed design trade-offs... I don’t know of a specific resource that would be good, perhaps something like Programming Pearls would be a nice introduction to the way of thinking about these topics (and not over-thinking them too). Commented Jul 20, 2017 at 12:33
  • Thanks. I have learned something about data structures, and the next question would be finding references and ways to go from data structures to databases.
    – Tim
    Commented Jul 20, 2017 at 12:38
  • 2
    Databases as used by locate are just data structures stored on disk, so going from the data structures to the corresponding databases is relatively straightforward. Moving to databases as your question presents them is another thing entirely; there are books and courses dedicated to those topics. Designing and developing a database management system such as MongoDB or PostgreSQL is one of the harder problems in computer science and software engineering today, especially when you throw in the distributed side of things. Commented Jul 20, 2017 at 12:43
  • 3
    i've done a fair bit with locatedb & mlocate.db over the years. I originally had perl code to generate a locatedb for my dlocate program in debian. I ended up discovering that just grepping a text file was many times faster than searching a locatedb, and given the size of disks these days the file size savings were insignificant. So i switched to just grep. I also have a local cron job that dumps mlocate.db to plain text after the mlocate cron job runs, which i search with a local qlocate shell script....much faster than running mlocate and also has some useful extra options.
    – cas
    Commented Jul 20, 2017 at 12:48
16
+100

Seems to be a flat file of C structs, written/read using the Gnu LibC OBSTACKS Macros

See sources

https://github.com/msekletar/mlocate/blob/master/src/updatedb.c#L720

https://github.com/msekletar/mlocate/blob/master/src/locate.c#L413

You could get something similar with

find / -xdev -type f -not -path \*\.git\/\* | gzip -9 > /tmp/files.gz
zgrep file_i_want /tmp/files.gz
3
  • 3
    Thanks. What are the two commands at the end doing?
    – Tim
    Commented Jul 20, 2017 at 20:58
  • 3
    @Tim First command is searching filesystem (find) from root (/) directory, without descending into directories on other filesystems (-xdev), regular files (-type f), not in *.git directories (-not -path \*\.git\/\*). It compress output (| gzip -9) and save it to file /tmp/files.gz (> /tmp/files.gz). Next line is searching with zgrep for file file_i_want inside compressed file /tmp/files.gz
    – piotrekkr
    Commented Jul 21, 2017 at 6:53
  • The find example is very inspiring. I run it but not gzip the output and use vscode to check the contents. Also didn't expect find can traverse the / filesystem so quickly!
    – Rick
    Commented Jun 4, 2022 at 16:00
4

As far as I know behind is Berkeley DB which is key/value daemonless database. Follow the link for more info. Extract from Wikipedia:

Berkeley DB (BDB) is a software library intended to provide a high-performance embedded database for key/value data. Berkeley DB is written in C with API bindings for C++, C#, Java, Perl, PHP, Python, Ruby, Smalltalk, Tcl, and many other programming languages. BDB stores arbitrary key/data pairs as byte arrays, and supports multiple data items for a single key. Berkeley DB is not a relational database.

The location of database in RHEL/CentOS is /var/lib/mlocate/mlocate.db (not sure about the other distributions). The command locate --statistics will give you info about the location and some statistics of database (example):

Database /var/lib/mlocate/mlocate.db:
        16,375 directories
        242,457 files
        11,280,301 bytes in file names
        4,526,116 bytes used to store database

For mlocate format here is head of man page:

A mlocate database starts with a file header: 8 bytes for a magic number ("\0mlo- cate" like a C literal), 4 bytes for the configuration block size in big endian, 1 byte for file format version (0), 1 byte for the “require visibility” flag (0 or 1), 2 bytes padding, and a NUL-terminated path name of the root of the database.

The header is followed by a configuration block, included to ensure databases are not reused if some configuration changes could affect their contents. The size of the configuration block in bytes is stored in the file header. The configuration block is a sequence of variable assignments, ordered by variable name. Each vari- able assignment consists of a NUL-terminated variable name and an ordered list of NUL-terminated values. The value list is terminated by one more NUL character. The ordering used is defined by the strcmp () function.

4
  • 2
    It depends on the implementation of locate/updatedb... Commented Jul 20, 2017 at 12:10
  • 3
    mlocate most definitely does not use Berkeley DB. Commented Jul 20, 2017 at 12:20
  • 3
    Do you have any source backing your BerkeleyDB claim? The second part of your answer contradicts it.
    – Mat
    Commented Jul 21, 2017 at 7:23
  • locate --statistics was what I needed. Thanks. Commented Dec 19, 2022 at 23:33

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .