0

This is my string : file_1234_test.pdf
Task is to find the filename-without-extension and find the number.
So the result should be :

> Match 1 = file_1234_test.pdf
> Group 1 = file_1234_test
> Group 2 = 1234

I found Stack-58379142 but it does not answer my question.

I tested the following queries on regex101 and regexstorm

Step 1. as expected

> (.*)\.pdf
> Match 1 = file_1234_test.pdf
> Group 1 = file_1234_test

Step 2. as expected : greedy '+' quantifier

> (\d+)
> Match 1 = 1234
> Group 1 = 1234

Step 3. still as expected

> ((\d+).*)
> Match 1 = 1234_test.pdf
> Group 1 = 1234_test.pdf
> Group 2 = 1234

Step 4. once again as expected

> ((\d+).*)\.pdf
> Match 1 = 1234_test.pdf
> Group 1 = 1234_test
> Group 2 = 1234

Step 5. '+' quantifier suddenly became lazy

> (.*(\d+).*)\.pdf
> Match 1 = file_1234_test.pdf
> Group 1 = file_1234_test
> Group 2 = 4

Of course (.*(\d{4}).*)\.pdf or (.*_(\d+).*)\.pdf works.

> Match 1 = file_1234_test.pdf
> Group 1 = file_1234_test
> Group 2 = 1234

But then the query is (as I feel it) needless narrowing and too specific. What if I have a list of hundreds and ...

So, Question : Is there a solution ?

1 Answer 1

1

You could try this regex pattern: (.*?(\d+).*)\.pdf

It makes the first part .*? become lazy matching.

See demo here

4
  • 1
    To be more clear: in step 5 + doesn't become lazy, it is always greedy, but since .* (the one on the left) is evaluated first (a pattern is tested from left to right) and is greedy too, it consumes as many characters as possible. Making it lazy solve the problem. Commented Mar 4, 2023 at 14:01
  • 1
    An other possibility is to replace the dot with a character class that excludes the digits. Commented Mar 4, 2023 at 14:06
  • @Trung Problem solved, so simple, should have found it myself
    – biburepo
    Commented Mar 4, 2023 at 16:35
  • @Casimir Appreciated your further clarification
    – biburepo
    Commented Mar 4, 2023 at 16:38

Not the answer you're looking for? Browse other questions tagged or ask your own question.