24

I have the following data (a list of R packages parsed from a Rmarkdown file), that I want to turn into a list I can pass to R to install:

d3heatmap
data.table
ggplot2
htmltools
htmlwidgets
metricsgraphics
networkD3
plotly
reshape2
scales
stringr

I want to turn the list into a list of the form:

'd3heatmap', 'data.table', 'ggplot2', 'htmltools', 'htmlwidgets', 'metricsgraphics', 'networkD3', 'plotly', 'reshape2', 'scales', 'stringr'

I currently have a bash pipeline that goes from the raw file to the list above:

grep 'library(' Presentation.Rmd \
| grep -v '#' \
| cut -f2 -d\( \
| tr -d ')'  \
| sort | uniq

I want to add a step on to turn the new lines into the comma separated list. I've tried adding tr '\n' '","', which fails. I've also tried a number of the following Stack Overflow answers, which also fail:

This produces library(stringr)))phics) as the result.

This produces ,% as the result.

This answer (with the -i flag removed), produces output identical to the input.

5
  • Do the delimiters need to be comma-space, or is comma alone acceptable? Commented Jan 17, 2017 at 18:29
  • Either is fine, but I do need a quote character surrounding the string, either ' or ".
    – fbt
    Commented Jan 17, 2017 at 18:31
  • 2
    see also Turn list into single line with delimiter Commented Jan 17, 2017 at 19:20
  • Am I the first to notice that the input data and the script to process it, are completely incompatible. There will be no output. Commented Jan 19, 2017 at 21:43
  • The script I listed is how I generate the input data. Someone asked for it. The actual input data would look something like this. Note that Github changes the formatting to remove the new lines.
    – fbt
    Commented Jan 20, 2017 at 2:56

8 Answers 8

32

You can add quotes with sed and then merge lines with paste, like that:

sed 's/^\|$/"/g'|paste -sd, -

If you are running a GNU coreutils based system (i.e. Linux), you can omit the trailing '-'.

If you input data has DOS-style line endings (as @phk suggested), you can modify the command as follows:

sed 's/\r//;s/^\|$/"/g'|paste -sd, -
9
  • 2
    On MacOS (and maybe others), you will need to include a dash to indicate that the input is from stdin rather than a file: sed 's/^\|$/"/g'|paste -sd, -
    – cherdt
    Commented Jan 17, 2017 at 19:08
  • True, "coreutils" version of paste will accept both forms, but "-" is more POSIX. Thx !
    – zeppelin
    Commented Jan 17, 2017 at 19:21
  • 2
    Or just with sed alone: sed 's/.*/"&"/;:l;N;s/\n\(.*\)$/, "\1"/;tl' Commented Jan 17, 2017 at 20:09
  • 1
    @fbt The note I now added at the end of my answer applies here as well.
    – phk
    Commented Jan 17, 2017 at 20:52
  • 1
    @DigitalTrauma - not really a good idea; that would be very slow (might even hang with huge files) - see the answers to the Q I linked in my comment on the Q here; the cool thing is to use paste alone ;) Commented Jan 17, 2017 at 21:07
12
Using awk:
awk 'BEGIN { ORS="" } { print p"'"'"'"$0"'"'"'"; p=", " } END { print "\n" }' /path/to/list
Alternative with less shell escaping and therefore more readable:
awk 'BEGIN { ORS="" } { print p"\047"$0"\047"; p=", " } END { print "\n" }' /path/to/list
Output:
'd3heatmap', 'data.table', 'ggplot2', 'htmltools', 'htmlwidgets', 'metricsgraphics', 'networkD3', 'plotly', 'reshape2', 'scales', 'stringr'
Explanation:

The awk script itself without all the escaping is BEGIN { ORS="" } { print p"'"$0"'"; p=", " } END { print "\n" }. After printing the first entry the variable p is set (before that it's like an empty string). With this variable p every entry (or in awk-speak: record) is prefixed and additionally printed with single quotes around it. The awk output record separator variable ORS is not needed (since the prefix is doing it for you) so it is set to be empty at the BEGINing. Oh and we might our file to END with a newline (e.g. so it works with further text-processing tools); should this not be needed the part with END and everything after it (inside the single quotes) can be removed.

Note

If you have Windows/DOS-style line endings (\r\n), you have to convert them to UNIX style (\n) first. To do this you can put tr -d '\015' at the beginning of your pipeline:

tr -d '\015' < /path/to/input.list | awk […] > /path/to/output

(Assuming you don't have any use for \rs in your file. Very safe assumption here.)

Alternatively, simply run dos2unix /path/to/input.list once to convert the file in-place.

6
  • When I run this command, I get ', 'stringr23aphics as the output.
    – fbt
    Commented Jan 17, 2017 at 20:21
  • @fbt See my latest note.
    – phk
    Commented Jan 17, 2017 at 20:31
  • 2
    print p"'"'"'"$0"'"'"'"; p=", "—holy quotes, Batman!
    – wchargin
    Commented Jan 17, 2017 at 21:23
  • I know, right‽ :) I thought about mentioning that in many shells print p"'\''"$0"'\''"; would have also worked (it's not POSIXy though), or alternatively using bash's C quoting strings ($'') even just print p"\'"$0"\'"; (might have required doubling other backslashes though) but there's already the other method using awk's character escapes.
    – phk
    Commented Jan 17, 2017 at 21:41
  • Wow, I can't believe you figured that out. Thank you.
    – fbt
    Commented Jan 18, 2017 at 17:44
8
+50

As @don_crissti's linked answer shows, the paste option borders on incredibly fast -- the linux kernel's piping is more efficient than I would have believed if I hadn't just now tried it. Remarkably, if you can be happy with a single comma separating your list items rather than a comma+space, a paste pipeline

(paste -d\' /dev/null - /dev/null | paste -sd, -) <input

is faster than even a reasonable flex program(!)

%option 8bit main fast
%%
.*  { printf("'%s'",yytext); }
\n/(.|\n) { printf(", "); }

But if just decent performance is acceptable (and if you're not running a stress test, you won't be able to measure any constant-factor differences, they're all instant) and you want both flexibility with your separators and reasonable one-liner-y-ness,

sed "s/.*/'&'/;H;1h;"'$!d;x;s/\n/, /g'

is your ticket. Yes, it looks like line noise, but the H;1h;$!d;x idiom is the right way to slurp up everything, once you can recognize that the whole thing gets actually easy to read, it's s/.*/'&'/ followed by a slurp and a s/\n/, /g.


edit: bordering on the absurd, it's fairly easy to get flex to beat everything else hollow, just tell stdio you don't need the builtin multithread/signalhandler sync:

%option 8bit main fast
%%
.+  { putchar_unlocked('\'');
      fwrite_unlocked(yytext,yyleng,1,stdout);
      putchar_unlocked('\''); }
\n/(.|\n) { fwrite_unlocked(", ",2,1,stdout); }

and under stress that's 2-3x quicker than the paste pipelines, which are themselves at least 5x quicker than everything else.

3
  • 1
    (paste -d\ \'\' /dev/null /dev/null - /dev/null | paste -sd, -) <infile | cut -c2- would do comma+space @ pretty much the same speed though as you noted, it's not really flexible if you need some fancy string as separator Commented Jan 18, 2017 at 11:33
  • That flex stuff is pretty damn cool man... this is the first time I see someone posting flex code on this site... big upvote ! Please post more of this stuff. Commented Jan 24, 2017 at 21:39
  • @don_crissti Thanks! I'll look for good opportunities, sed/awk/whatnot are usually better options just for the convenience value but there's often a pretty easy flex answer too.
    – jthill
    Commented Jan 25, 2017 at 22:19
4

I think the following should do just fine, assuming you're data is in the file text

d3heatmap
data.table
ggplot2
htmltools
htmlwidgets
metricsgraphics
networkD3
plotly
reshape2
scales
stringr

Let's use arrays which have the substitution down cold:

#!/bin/bash
input=( $(cat text) ) 
output=( $(
for i in ${input[@]}
        do
        echo -ne "'$i',"
done
) )
output=${output:0:-1}
echo ${output//,/, }

The output of the script should be as follows:

'd3heatmap', 'data.table', 'ggplot2', 'htmltools', 'htmlwidgets', 'metricsgraphics', 'networkD3', 'plotly', 'reshape2', 'scales', 'stringr'

I believe this was what you were looking for?

1
  • 2
    Nice solution. But while OP didn't explicitly ask for bash and while it is safe to assume that someone might use it (after all AFAIK it's the most used shell) it still shouldn't be taken for granted. Also, there are parts you could so a better job at quoting (putting in double quotes). For example, while the package names are unlikely to have spaces in them it still is good convention to quote variables rather than not, you might want to run shellcheck.net over it and see the notes and explanations there.
    – phk
    Commented Jan 20, 2017 at 6:36
4

Python

Python one-liner:

$ python -c "import sys; print(','.join([repr(l.strip()) for l in sys.stdin]))" < input.txt                               
'd3heatmap','data.table','ggplot2','htmltools','htmlwidgets','metricsgraphics','networkD3','plotly','reshape2','scales','stringr'

Works in simple way - we redirect input.txt into stdin using shell's < operator, read each line into a list with .strip() removing newlines and repr() creating a quoted representation of each line. The list is then joined into one big string via .join() function, with , as separator

Alternatively we could use + to concatenate quotes to each stripped line.

 python -c "import sys;sq='\'';print(','.join([sq+l.strip()+sq for l in sys.stdin]))" < input.txt

Perl

Essentially same idea as before: read all lines,strip trailing newline, enclose in single quotes,stuff everything into array @cvs , and print out array values joined with commas.

$ perl -ne 'chomp; $sq = "\047" ; push @cvs,"$sq$_$sq";END{ print join(",",@cvs)   }'  input.txt                        
 'd3heatmap','data.table','ggplot2','htmltools','htmlwidgets','metricsgraphics','networkD3','plotly','reshape2','scales','stringr'
2
  • IIRC, pythons's join should be able to take an iterator therefore there should be no need to materialize the stdin loop to a list
    – iruvar
    Commented Jan 20, 2017 at 6:37
  • @iruvar Yes, except look at OP's desired output - they want each word quoted, and we need to remove trailing newlines to ensure output is one line. You have an idea how to do that without a list comprehension ? Commented Jan 20, 2017 at 6:44
2

I often have a very similar scenario: I copy a column from Excel and want to convert the content into a comma separated list (for later usage in a SQL query like ... WHERE col_name IN <comma-separated-list-here>).

This is what I have in my .bashrc:

function lbl {
    TMPFILE=$(mktemp)
    cat $1 > $TMPFILE
    dos2unix $TMPFILE
    (echo "("; cat $TMPFILE; echo ")") | tr '\n' ',' | sed -e 's/(,/(/' -e 's/,)/)/' -e 's/),/)/'
    rm $TMPFILE
}

I then run lbl ("line by line") on the cmd line which waits for input, paste the content from the clipboard, press <C-D> and the function returns the input surrounded with (). This looks like so:

$ lbl
1
2
3
dos2unix: converting file /tmp/tmp.OGM6UahLTE to Unix format ...
(1,2,3)

(I don't remember why I put the dos2unix in here, presumably because this often causes trouble in my company's setup.)

1

Some versions of sed act a little different, but on my mac, I can handle everything but the "uniq" in sed:

sed -n -e '
# Skip commented library lines
/#/b
# Handle library lines
/library(/{
    # Replace line with just quoted filename and comma
    # Extra quoting is due to command-line use of a quote
    s/library(\([^)]*\))/'\''\1'\'', /
    # Exchange with hold, append new entry, remove the new-line
    x; G; s/\n//
    ${
        # If last line, remove trailing comma, print, quit
        s/, $//; p; b
    }
    # Save into hold
    x
}
${
    # Last line not library
    # Exchange with hold, remove trailing comma, print
    x; s/, $//; p
}
'

Unfortunately to fix the unique part you have to do something like:

grep library Presentation.md | sort -u | sed -n -e '...'

--Paul

1
1

It is funny that to use a plain text list of R packages to install them in R, nobody proposed a solution using that list directly in R but fight with bash, perl, python, awk, sed or whatever to put quotes and commas in the list. This is not necessary at all and moreover does not solve how input and use the transformed list in R.

You can simply load the plain text file (said, packages.txt) as a dataframe with a single variable, that you can extract as a vector, directly usable by install.packages. So, convert it in a usable R object and install that list is just:

df <- read.delim("packages.txt", header=F, strip.white=T, stringsAsFactors=F)
install.packages(df$V1)

Or without an external file:

packages <-" 
d3heatmap
data.table
ggplot2
htmltools
htmlwidgets
metricsgraphics
networkD3
plotly
reshape2
scales
stringr
"
df <- read.delim(textConnection(packages), 
header=F, strip.white=T, stringsAsFactors=F)
install.packages(df$V1)

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .