3

I am trying to use sed to extract a specific string from a line within a file. Currently I am reading in a file with a while loop and searching for a specific string. When that string is found I am extracting it, but I then need to use sed to parse the output so that I only get the string between two slashes (Its a directory name, so I need to keep both the starting and trailing slashes if possible). Here is the loop I am running to search for a file:

#!/bin/sh
file=configFile.conf
while read line 
do
    if  echo "$line" | grep -q "directory_root" 
    then DIR_ROOT="$line"
fi
done < "$file"
echo $DIR_ROOT
exit 0

The while loop works and echoes the following string:

directory_root /root/config/data/

I then need to use sed in order to get the following output in order to pass the correct directory name in to another script:

/root/

Is it possible to use sed and regular expressions to extract only the above from the echoed output?

Thanks

5
  • is the idea that you only want the topmost directory in that path? Commented Jan 11, 2018 at 15:58
  • Short answer - yes it is... What are you trying to do though? Are you just trying to get the first string surrounded by slashes? Commented Jan 11, 2018 at 15:58
  • Yes, the goal here is to extract only the top most directory of any path that is found on the line, which should always be the first string surrounded by slashes Commented Jan 11, 2018 at 15:59
  • I think that others are right that there are easier approaches than sed for this task. However, since you asked for a sed based solution, I gave you one below. :) Commented Jan 11, 2018 at 16:12
  • Actually, you can do the whole thing in a single line in sed. I've updated my response below. Commented Jan 11, 2018 at 16:28

5 Answers 5

7

If you want to use sed, this would work:

~/tmp> str="directory_root /root/config/data/"
~/tmp> echo $str | sed 's|^[^/]*\(/[^/]*/\).*$|\1|'
/root/

Or a single liner (assuming directory_root literal is in the line:)

 cat file | sed -e 's|^directory_root \(/[^/]*/\).*$|\1|;tx;d;:x'

Explanation of regex in first example:

s| : using the | as the dilimiter (makes it easier to read in this case)

^ : match beginning of line

[^/]* : match all non / characters (this is greedy so it will stop when it hits the first /.

\( : start recording string 1

/ : match literal /

[^/]* : match all non / charcaters

\) : finish rcording string 1

.* : match everything else to the end of the line

| : delimitter

\1 : replace match with string 1

| : delimitter

In the second example, I appended the ;tx;d;:x which does not echo lines that do not match see here. You can then run this on the entire file, and it will only print the lines it modified.

~/tmp> echo "xx" > tmp.txt
~/tmp> echo "directory_root /root/config/data/" >> tmp.txt
~/tmp> echo "xxxx ttt" >> tmp.txt
~/tmp>
~/tmp> cat tmp.txt | sed -e 's|^directory_root \(/[^/]*/\).*$|\1|;tx;d;:x'
/root/
4
  • Much cleaner that what I wrote. I started going down this path (of negation of slash within the match) but I didn't ever get it working cos I messed up the slash escaping. Using | as the sep is a very good idea in this context. Commented Jan 11, 2018 at 16:22
  • Great answer, the one liner does it cleanly and answer is well explained. Thanks Commented Jan 12, 2018 at 8:44
  • Yes, As @PesaThe has implied, you can also do sed -e 's|^directory_root \(/[^/]*/\).*$|\1|;tx;d;:x' file.txt, which does the same thing, but uses one less process (slightly more efficient). Outside of that, they produce the same result, which satisfies the adage: "There's more than one way to skin a cat"... (sorry, couldn't resist). Commented Jan 12, 2018 at 15:35
  • Also, note that you can start a new cycle with omitting the label name. So if you insist on using t command: sed '...$|\1|;t;d' will have the same effect.
    – PesaThe
    Commented Jan 16, 2018 at 23:25
1

You don't necessarily need sed for this. You can just use bash:

#!/bin/bash

f="directory_root /asdf/asdfad/fad"
regex="^directory_root (\/\w+\/).*$"
if [[ $f =~ $regex ]]
then
    name="${BASH_REMATCH[1]}"
    echo $name
fi

prints /asdf/

See: Capturing Groups From a Grep RegEx

1

You can use a two-step variable substitution to cut DIR_ROOT to just the top-dir:

DIR_ROOT="${DIR_ROOT#/}"    # cut away the leading slash
DIR_ROOT="/${DIR_ROOT%%/*}"  # cut the trailing path and re-add the slash
3
  • 1
    Might need to add a slash at the end as well.
    – PesaThe
    Commented Jan 11, 2018 at 16:10
  • Should be straightforward to add, if adopted by OP and indeed needed :-D
    – user594138
    Commented Jan 11, 2018 at 16:12
  • 1
    Heh, I guess you are right :)
    – PesaThe
    Commented Jan 11, 2018 at 16:12
1

Since you asked for a sed solution, I have one for you:

$ s="directory_root /root/config/data"
$ echo "${s}" | sed -e 's/\//\x00/; s/\//\x00/; s/.*\x00\(.*\)\x00.*/\/\1\//;'
/root/

How does this work? Well, since sed doesn't have a non-greedy match, the trick is to use a series of search and replaces to set things up so that you don't need non-greedy. The first s/// replaces the first slash with a NUL byte, then you do that once more. Now you have the first two slashes (only) replaced with a byte which isn't going to be in the input of any UNIX shell string, so now you can just extract the directory surrounded by \x00 with the regular, greedy sed search and replace (the third s///).

Cheers!

Note 1: this solution was partially inspired by an answer on unix stack exchange

Note 2: this solution requires GNU sed because of the null byte. If you're on BSD sed (macos), you may just want to use some other separator which won't appear in your input.


PS: It's probably easier not to use sed.

1
  • I voted for @HardcoreHenry's answer cos it's actually much cleaner that what I've done here! Commented Jan 11, 2018 at 16:20
0
sed -rn 's|^directory_root[[:blank:]]+(/[^/]*/?).*|\1|p' data
  • -n: suppresses automatic printing of pattern space
  • -r: enables extended regular expressions (no need to escape + etc)
  • s|regex|replacement|: you can choose a different delimiter
  • p: prints the current pattern space only if the regex has been matched
  • [:blank:]: matches <tab> or <space>
  • ( regex ): captures a group that can later be referenced with \1, \2, ...

/[^/]*/? matches /, followed by any number of non-slashes, optionally followed by another /. This will correctly output /root/.

However, what if you happen to have directory_root / or directory_root /dir. That's what the /? is for. If you want to print the directory only if it's surrounded by / on both sides, just remove the ?.

Not the answer you're looking for? Browse other questions tagged or ask your own question.