This venerable blog post from 2009, which introduced proxy functions (wrapper functions), explains that steppable pipelines are required to implement them; the following quote suggests (but doesn't explicitly state) that they may have been created for that very purpose:
In particular, what you want to have happen is to be able to control the execution of the calling command – to control when it’s BEGINPROCESS(), PROCESSRECORD(), ENDPROCESS(), etc methods are called
Simply put, proxy functions, via steppable pipelines, allow you to implement a cmdlet (advanced function) by delegating most of the implementation to another cmdlet in a memory-efficient, streaming manner.
Specifically, a steppable pipeline allows you to delegate the implementation of your proxy function to a script block whose life cycle is kept in sync with the proxy function itself, in terms of initialization (begin
block), per-object pipeline input processing (process
block), and termination (end
block), which means that the a single instantiation of the wrapped cmdlet is in effect directly connected to the same pipeline as the proxy function itself.
Conversely, this means: you don't strictly need a proxy function to write a wrapper function in the following scenarios:
If your wrapper function doesn't need to support pipeline input.
If you don't mind collecting all pipeline input first, before passing it all to the wrapped cmdlet at once, in your wrapper function's end
block, which means that you're forgoing streaming processing
- While you may also get streaming processing if you call the wrapped cmdlet for each input object in your
process
block, doing so:
- is inefficient (a full invocation of the wrapped cmdlet in every iteration, in a nested pipeline)
- doesn't work for cmdlets that need to operate on all input as a whole, such as
Format-*
cmdlets or aggregating cmdlets such as Sort-Object
and Group-Object
The following are three different implementations of a wrapper function around Select-String
, which reports only the matching part of each matching line, as a string, to illustrate the tradeoffs:
Select-MatchProxy
is a proper proxy function, i.e. it calls Select-String
via a steppable pipeline, which amounts to streaming processing that only involves a single call instantiation of Select-String
.
Select-MatchSimple
calls a new Select-String
instance in each process
block, which also amounts to streaming processing, but performs poorly; as noted above, this implementation approach isn't always feasible, depending on what cmdlet is being wrapped.
Select-MatchCollect
collects all pipeline input up front, and then passes it to Select-String
in the end
block, which forgoes streaming processing and is memory-intensive; however, in terms of runtime it actually performs slightly better than the proxy function.
function Select-MatchProxy {
[CmdletBinding(PositionalBinding=$false)]
param(
[Parameter(Mandatory, ValueFromPipeline)]
$InputObject,
[Parameter(Mandatory, Position=0)]
[string] $Pattern
)
begin {
$steppablePipeline = {
Select-String -Pattern $Pattern | ForEach-Object { $_.Matches.Value }
}.GetSteppablePipeline($myInvocation.CommandOrigin)
$steppablePipeline.Begin($PSCmdlet)
}
process {
$steppablePipeline.Process($InputObject)
}
end {
$steppablePipeline.End()
}
}
function Select-MatchSimple {
[CmdletBinding(PositionalBinding=$false)]
param(
[Parameter(Mandatory, ValueFromPipeline)]
$InputObject,
[Parameter(Mandatory, Position=0)]
[string] $Pattern
)
process {
Select-String -InputObject $InputObject -Pattern $Pattern |
ForEach-Object {
$_.Matches.Value
}
}
}
function Select-MatchCollect {
[CmdletBinding(PositionalBinding=$false)]
param(
[Parameter(Mandatory, ValueFromPipeline)]
$InputObject,
[Parameter(Mandatory, Position=0)]
[string] $Pattern
)
begin {
$l = [System.Collections.Generic.List[object]]::new()
}
process {
$l.Add($InputObject)
}
end {
$l | Select-String -Pattern $Pattern | ForEach-Object { $_.Matches.Value }
}
}
To compare runtimes, you can use the following code:
# Sample input array of 100,000 strings.
$array = ('foo', 'bar') * 50000
# Time 15 runs of each function, and report the average.
Time-Command { $array | Select-MatchProxy 'o+' },
{ $array | Select-MatchSimple 'o+' },
{ $array | Select-MatchCollect 'o+' }
Sample timings from a macOS 12.4 M1 Mac running PowerShell Core 7.3.0-preview.6, which give a sense of relative performance:
Factor Secs (15-run avg.) Command TimeSpan
------ ------------------ ------- --------
1.00 0.916 $array | Select-MatchCollect 'o+' 00:00:00.9162298
1.12 1.025 $array | Select-MatchProxy 'o+' 00:00:01.0254835
5.38 4.930 $array | Select-MatchSimple 'o+' 00:00:04.9298495
The above uses the Time-Command
function from this Gist.
Assuming you have looked at the linked Gist's source code to ensure that it is safe (which I can personally assure you of, but you should always check), you can install it directly as follows:
irm https://gist.github.com/mklement0/9e1f13978620b09ab2d15da5535d1b27/raw/Time-Command.ps1 | iex
$someStuff |A-Command
will automatically callBegin()
once, thenProcess()
for each input item, thenEnd()
- withSteppablePipeline
you get direct control over this flow. If you don't need that then obviously you don't need one$_ | SomeCommand
inside your command, you create a nested pipeline. With steppable pipeline you can actually chain SomeCommand into the pipeline, that your command is part of. This can be a performance improvement (e. g. when SomeCommand does expensivebegin
andend
processing). In some cases you can provide correct results only by using steppable pipeline (e. g. when SomeCommand is one of theFormat-*
cmdlets, which need to see the entire input).