0

I've got a css stylesheet on which I want to perform some analysis and it seemed like a good idea to use regex inside Notepad++. Now I find I can't write the regex and maybe it wasn't a good idea but, bad idea or not, I want to know how to do it.

I have an automatically-generated set of styles, labelled (mostly) block_1 to block_149. What I want to do first to extract just the information about what margin settings each style specifies, since this appears to be one of the major differences. Some are plausible, especially the early ones for headings etc, but later ones appear to reflect the complex calculations of the original Word document. You can see both in the samples below:

[Note: I've added 2 spaces at the end of every line in order to get them to display properly here - those spaces do not exist in the original code. However, the original code (imported from Sigil) does have additional spacing at the start of every line - I'm not sure whether this will come out as spaces or as a tab character - I have been trying to use the whitespace indicator to cover all the options.]

.block_8 {  
    background-color: #FFF;  
    display: block;  
    font-family: "Calibri", sans-serif;  
    font-size: 1.125em;  
    font-weight: bold;  
    line-height: 1.2;  
    page-break-after: avoid;  
    text-align: center;  
    padding: 0;  
    margin: 0 2.25pt 0 0  
    }  
.block_9 {  
    border-bottom: 0;  
    border-top: 0;  
    display: block;  
    line-height: 1.2;  
    text-indent: 1.5em;  
    padding: 0;  
    margin: 0.3em 0  
    }  
.block_10 {  
    background-color: #FFF;  
    border-bottom: 0;  
    border-top: 0;  
    display: block;  
    font-family: serif;  
    font-size: 0.75em;  
    line-height: 12.2pt;  
    text-indent: 1.5em;  
    padding: 0;  
    margin: 0.3em 0  
    }  
...   

.block_113 {  
    background-color: #FFF;  
    border-bottom: 0;  
    border-top: 0;  
    display: block;  
    letter-spacing: -0.1pt;  
    line-height: 1.2;  
    text-indent: 1.5em;  
    padding: 0;  
    margin: 0.3em 0 0.3em 16.1pt  
    }  
.block_114 {  
    background-color: #FFF;  
    border-bottom: 0;  
    border-top: 0;  
    display: block;  
    font-family: serif;  
    font-size: 0.75em;  
    text-indent: 1.5em;  
    padding: 0;  
    margin: 0.3em 0.5pt 0.3em 0.7pt  
    }  

There are other differences and even the later ones, just for body text, have different numbers of entries.

What I would like to do is to have a regex which I could use in the first instance to reduce each of these entries just to: Block_(number) margin: (settings)

I had thought of extracting the different margin settings (T,R,B,L), but since the source can include 1,2,3 or 4 settings, sorting out those rules by regex is beyond my ambition. I have been using regex101.com to try extending from very simple recognition using just the margin settings, but managing to include all the (variable number of) extra lines between the block number and the margin settings has stumped me. Ideally I would like to be able to use a similar regex technique to extract other settings later on. I would also like to be able to cope with variable numbers of spaces and/or tabs in the layout.

Can anyone tell me how to do this? It's got to the stage where I can almost certainly do basic cut and paste more quickly, but now I want to know how to do the regex against the time when I may need it for another project.

EtA: I now have code which will do what I asked and now want more! The settings I wanted just happened to be the last ones in the block - suppose I wanted to select the line-height settings and isolate them by a similar process - as an alternative to the margin settings?

3
  • without the exact text, its impossible to tell if it will work for you, but it should be something like (\..*)\{(.*)(margin:.*)\} for the find, and \1\3 for the replace expression. since your attributes are line delimited, you have to check the box for . matches newline Commented Jul 7, 2016 at 13:54
  • @Frank Thomas Thanks that appears to do the trick - the pointer to the .matches newline box was especially helpful. Interestingly, when I ran just one single Replace operation it extracted the info from the last block first and so on up the text. Is this due to "greedy" qualifiers?? I have been encouraged by your help and now hope to extend my analysis - see the addition in the main note.
    – deeplyblue
    Commented Jul 7, 2016 at 21:28
  • Yes, Frank's comment (which should be an answer) uses the very greedy ".*", this means it will match EVERYTHING up to the end of the document. Therefore, the first match, will be everything up to the last block, AND it will be the LAST and ONLY match. The reason you find all blocks, in reverse order, is presumably because you have selected the "Wrap around" option. There is a more efficient way to do this search, which (I hate to hijack, but) I'll provide as an answer.
    – jehad
    Commented Jul 14, 2016 at 1:51

1 Answer 1

0

Answer

Go to the "Replace" dialog of Notepad++ (Ctrl+h) or menu Search -> Replace..., and select the following options:

  • Search Mode, select radio button "Regular expression".
  • Search Mode, select ". matches newline"

Use the following for "Find":

(\.\w*)[[:blank:]]*\{.*?(margin:[\w[:blank:][:punct:]]*).*?\}

And for "Replace with", use somethig like this (only the $1 and $2 are important):

$1 : $2

Explanation

Breaking the Find string into its components, from left-to-right, we have:

  • (\.\w*) : First, we need to find the name of the block. So, start with literally a "." (\.), followed by some alphanumeric/underscore characters (\w*). Placing these in parenthesis makes them into the a group, in this case, the first group $1.

  • [[:blank:]]*\{.*? : After the block's name, there may be some spaces ([[:blank:]]*) followed by an opening curly bracket (\{ - escaped with "\" because the brackets have special meaning in regex). Finally, we match ANYTHING (.*), including new lines, but as few as possible (hence ?), to get everything inside the block up to the next part (i.e. "margin"). Note, no part of this is grouped, because we're effectively throwing it away.

  • (margin:[\w[:blank:][:punct:]]*) : The next part of interest is the "margin" and its value. Therefore, this is grouped, and will become $2. First, we match literally margin:, then its value, which will be a string of alphanumerics/underscores, punctuations and spaces (but not new line like characters). The reason for the complex [\w[:blank:][:punct:]]* as opposed to something like .*?, is because a . would match any characters including new lines and anything after the margin line that may exist, up to the closing curly bracket.

  • .*?\} : Finally, we match everything remaining in the block (in this case, it's just the new line at the end of the "margin" line) and the closing curly bracket. Again, to be discarded.

2
  • Thank you, that was very clear. I can also see how it could be adapted to extract the values for e.g. "line-height" where the final group, which is not captured because it's to be discarded, would be much longer, but still covered by the same code. I know, however, that some margin values can be set using the % symbol - would that be covered by \w? And if I were using a variant on your code to get the "background-color: #FFF;" setting, would the # symbol be a problem?
    – deeplyblue
    Commented Jul 17, 2016 at 0:19
  • Yes, with a small modification, the regex I provided would work for the other lines you may want to capture. E.g. (\.\w*)[[:blank:]]*\{.*?(line-height:[\w[:blank:][:punct:]]*).*?\}, will work for "line-height", and all I did was change the word "margin" for "line-height". The \w does not capture # and % signs, but rather it's [:punt:] that will capture them. Therefore, the regex as it is will be just fine for values that use hex (# prefixed) colours and percentages (%).
    – jehad
    Commented Jul 17, 2016 at 10:19

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .