I need to get thousands of snippets of text from PDFs to a spreadsheet. They are short, seldom more than 2-3 rows, but each line break creates a new cell, and I have to repair that manually, which costs lots of time.

Because I have so many of them, using the "paste into Word and do a find-and-replace" workaround is just too time-wasting for me. Is there a way to have the line break disappear on copy? Maybe there is a viewer which offers a special copy mode for this, or has a plugin?

The documents are scientific articles. The text arrangement is quite linear. You can assume that the text I'm copying is not inside a table or a float, and not rotated or anything. (If such a thing happens, I think I'll deal with it manually). The text is frequently set in two columns, but I have no trouble marking just the text I need from its column. I don't need to preserve any special formatting. I'm willing to try a solution which removes all unprintable characters, for example. The texts are in English, it is OK if the solution only works in ASCII/strips all non-alphanumeric ASCII of the copied text.

I have a very strong preference for a solution which will work on Linux, possibly some kind of Okular plugin. But if there happens to be a Windows-only solution, I want to hear about it too. I have a license for a somewhat recent Acrobat Pro on the Windows machine.

  Did you try with foxit reader?
    – Kasun
    Commented Aug 13, 2014 at 7:58
  • 2
    pdftotext is generally the best, but you'll still need some post-processing. See linuxquestions.org/questions/programming-9/…
    – Nemo
    Commented Apr 24, 2015 at 9:49
  @Kasun FoxitReader or whatever reader one uses is irrelevant: the pdf file is the one that introduces the linebreaks.

I had a similar problem while I was working on a text to speech script a while ago. My script would try to break up the text input into chunks by looking for newlines. With PDF files this would result in a mess because of the way each line ends with a newline.

So what I did was compose a few sed and tr commands to only consider newlines ending with a full stop as actual line breaks. It wasn't very pretty but it worked.

Using this snippet I wrote a small script for you that I hope will help:


# title: copy_without_linebreaks
# author: Glutanimate (github.com/glutanimate)
# license: MIT license

# Parses currently selected text and removes 
# newlines that aren't preceded by a full stop


ModifiedText="$(echo "$SelectedText" | \
    sed 's/\.$/.|/g' | sed 's/^\s*$/|/g' | tr '\n' ' ' | tr '|' '\n')"

#   - first sed command: replace end-of-line full stops with '|' delimiter and keep original periods.
#   - second sed command: replace empty lines with same delimiter (e.g.
#     to separate text headings from text)
#   - subsequent tr commands: remove existing newlines; replace delimiter with
#     newlines
# This is less than elegant but it works.

echo "$ModifiedText" | xsel -bi

The script uses xsel to parse the currently highlighted text and then modifies it with the sed and tr command-line I mentioned above. The processed text is then passed back to the clipboard via xsel -bi.

Here's how you can use the script in your scenario:

  1. Make sure you have xsel installed (sudo apt-get install xsel on (K)Ubuntu)
  2. save the script as copy_without_linebreaks or something similar and make it executable
  3. assign the script to a hotkey of your choice in your WM preferences
  4. highlight some text and press the hotkey
  5. The clipboard should automatically be filled with the modified text

This has been bugging me for years, so I figured out a general (Windows) solution using Autohotkey. Autohotkey is a lightweight, free, open-source scripting software for Windows to create hotkeys for almost anything imaginable.

When Ctrl+c is hit, the code only fires if the active window is a PDF reader, otherwise it simply copies the given selection as usual. In case of a PDF reader, it copies the selection, removes linebreaks and double spaces and puts result into the clipboard. If nothing is selected, the clipboard is practically untouched.

#IfWinActive ahk_class classFoxitReader
    old := ClipboardAll
    clipboard := ""
    send ^c
    clipwait 0.1
    if clipboard = 
        clipboard := old
    else {
        tmp := RegExReplace(clipboard, "(\S.*?)\R(.*?\S)", "$1 $2")
        clipboard := tmp
        StringReplace clipboard, clipboard, % "  ", % " ", A
        clipwait 0.1
    old := ""
    tmp := ""

The only task before applying this code is the window class name (ahk_class) of your reader. I use a single PDF reader for all cases (and I assume most people do that), FoxitReader, and its ahk_class is classFoxitReader. You can figure out the class for your own software easily by the WinGetClass command (e.g. AcrobatSDIWindow for Acrobat Reader).

If you prefer to read PDF-s in your browser, this is not your solution. Or you could simply remove the #IfWinActive ahk_class classFoxitReader line so that the code always fires, but in this case the result will always be stripped of linebreaks and double spaces.

  This used to work for me before, but now it just seems to break Ctrl + C entirely. Windows 10.
    – mic
    Commented Feb 3, 2018 at 19:06
  @MiCl It still works at my end. What machine/OS/PDF reader do you use? Did you change anything? Like updating your reader? On the other hand, who knows what was updated by Win 10...
  For Sumatra PDF users, "classFoxitReader" in the above script should be replaced with "SUMATRA_PDF_FRAME".

Another thing that worked out for me was saving the pdf file as html. Paragraphs in the html stay intact, ready for copy&paste. Other file formats work as well, such as txt or rtf... This should also work on Linux systems.

  • 2
    How do you save a PDF file as HTML?
    – Simon E.
    Commented Jan 2, 2020 at 3:27
  @Quasimodo, does someone know why html conveniently arranges those files?

A third approach using macros is shown here, but I haven't tried it. I pasted the macros here for future reference, macro 2 is by the author of the source - "Deborah Savadra" - and macro 1 by her reader "Benjamin":

macro 1:

Sub pagebreaks()
' pagebreaks Macro
    With Selection.Find
        .Text = "^p^p"
        .Replacement.Text = "¬ ¬"
        .Forward = True
        .Wrap = wdFindContinue
        .Format = False
        .MatchCase = False
        .MatchWholeWord = False
        .MatchWildcards = False
        .MatchSoundsLike = False
        .MatchAllWordForms = False
    End With
    Selection.Find.Execute Replace:=wdReplaceAll
    With Selection.Find
        .Text = "¬"
        .Replacement.Text = " "
        .Forward = True
        .Wrap = wdFindContinue
        .Format = False
        .MatchCase = False
        .MatchWholeWord = False
        .MatchWildcards = False
        .MatchSoundsLike = False
        .MatchAllWordForms = False
    End With
    Selection.Find.Execute Replace:=wdReplaceAll
End Sub

macro 2:

 Sub pagebreaks()
' pagebreaks Macro
    With Selection.Find
        .Text = "^p^p"
        .Replacement.Text = "|"
        .Forward = True
        .Wrap = wdFindContinue
        .Format = False
        .MatchCase = False
        .MatchWholeWord = False
        .MatchWildcards = False
        .MatchSoundsLike = False
        .MatchAllWordForms = False
    End With
    Selection.Find.Execute Replace:=wdReplaceAll
    With Selection.Find
        .Text = "^p"
        .Replacement.Text = " "
        .Forward = True
        .Wrap = wdFindContinue
        .Format = False
        .MatchCase = False
        .MatchWholeWord = False
        .MatchWildcards = False
        .MatchSoundsLike = False
        .MatchAllWordForms = False
    End With
    Selection.Find.Execute Replace:=wdReplaceAll
    With Selection.Find
        .Text = "|"
        .Replacement.Text = "^p^p"
        .Forward = True
        .Wrap = wdFindContinue
        .Format = False
        .MatchCase = False
        .MatchWholeWord = False
        .MatchWildcards = False
        .MatchSoundsLike = False
        .MatchAllWordForms = False
    End With
    Selection.Find.Execute Replace:=wdReplaceAll
End Sub

There is a Windows solution shown here. One has to download the file "PDF Copy-Paster.exe" and run it before the copy&paste-action. I tried it out and it works just fine, except that it removes all linebreaks. So if you copy multiply paragraphs you later have only one.

There is a related question on SU with a littlebit explanation, it may be of interest for someone...

  consider splitting your three approaches into three answers. It will be easier to vote them individually that way. (and, welcome to Superuser :-))
    – nik
    Commented Aug 13, 2014 at 9:34
  ok, I'll do that. (and thank you for the welcome)
    – Quasimodo
    Commented Aug 13, 2014 at 9:40
  Doesn't seem to remove line breaks for me, copying from Foxit Reader on Windows 10
    – mic
    Commented Mar 5, 2017 at 0:14

Actual Question : https://askubuntu.com/questions/1167026/detect-clipboard-copy-paste-event-and-modify-clipboard-contents

Credit goes to Kenn.

Based on Glutanimate's script.

Source : https://github.com/SidMan2001/Scripts/tree/master/PDF-Copy-without-Linebreaks-Linux

Remove Line Breaks when copying text from PDF (Linux) :

This bash script removes line breaks when copying text from PDF. It works for both Primary Selection and Clipboard of linux.


# title: copy_without_linebreaks
# author: Glutanimate (github.com/glutanimate)
# modifier: Siddharth (github.com/SidMan2001)
# license: MIT license

# Parses currently selected text and removes 
# newlines

while ./clipnotify;
  CopiedText="$(xsel -b)"
  if [[ $SelectedText != *"file:///"* ]]; then
    ModifiedTextPrimary="$(echo "$SelectedText" | tr -s '\n' ' ')"
    echo -n "$ModifiedTextPrimary" | xsel -i
  if [[ $CopiedText != *"file:///"* ]]; then
    ModifiedTextClipboard="$(echo "$CopiedText" | tr -s '\n' ' '  )"
    echo -n "$ModifiedTextClipboard" | xsel -bi

Dependencies :

  1. xsel
    sudo apt-get install xsel
  2. clipnotify (https://github.com/cdown/clipnotify)
    You can use the pre-compiled clipnotify provided in the repository or compile yourself.

To compile clipnotify yourself :
sudo apt install git build-essential libx11-dev libxtst-dev
git clone https://github.com/cdown/clipnotify.git
cd clipnotify
sudo make

To USE :

  1. Download this repository as zip or copy and paste the script in a text editor and save it as copy_without_linebreaks.sh.
  2. Make sure that script and clipnotify (downloaded or precompiled) are in same folder.
  3. Open terminal in script's folder and set permission
    chmod +x "copy_without_linebreaks.sh"
  4. Double click the script or run by entering in terminal :
  5. Copy text in pdf and paste it anywhere. Lines breaks will be removed.

I know this is an old question, however I felt it would be useful to answer it because no other solution was as easy to use as this one.

Use the linux app named Okular to open your pdf file. Then Tools-> Table selection tool.Then select your text as it was in table form. Then Ctrl+C and you are ready to go.

  This works very well by pasting unformatted into LibreOffice (ctrl + shift + V) so it doesn't create a table. This answer should make it closer to the top, as it is more relevant to the question than other answers (i.e. a simple Linux + Okular solution).
    – stragu
    Commented Jul 18, 2017 at 5:31
  Just tried this and I still had the line endings when I pasted special and selected unformatted text. Maybe things have changed. Okular is version 0.24.2 LibreOffice is version

If you have Acrobat, click your cursor so the cursor is blinking in the text. (It won't work if you don't do that.) Go to Advanced, Accessibility, Add tags. It will take a few minutes if you have a large document, but much faster than manually removing breaks. Voila!


Easy solution from this page; http://www.iom3.org/news/how-instantly-remove-unwanted-line-breaks-when-copying-pdf

  1. copy the text you want from the PDF
  2. paste into a new Word document
  3. click “edit” then “replace”
  4. make sure you’re in the “find what” field
  5. click “more” then “special”
  6. select “paragraph mark” (top of the list)
  7. click into the “replace with” field
  8. press the space bar once
  9. click “replace all”
  10. click “ok” then close the “find & replace” box.

Slightly faffy but once you get the shortcuts under your fingers it's much quicker

  • 1
    Copy and paste is not reliable, that's the entire point of the question. If one wants to cleanup by search and replace, thed would first convert to text with pdftotext and then use any text editor they like (with standard regex).
    – Nemo
    Commented Apr 24, 2015 at 9:55

