Indexing and searching
with Azure Functions and Search
Maarten Balliauw
“Find this type on”
“Find this type on”
In ReSharper and Rider
Search for namespaces
& types that are not yet referenced
“Find this type on”
Idea in 2013, introduced in ReSharper 9
(2015 -
Consists of
ReSharper functionality
A service that indexes packages and powers search
Azure Cloud Service (Web and Worker role)
Indexer uses NuGet OData feed$select=Id,Version,NormalizedVersion,LastEdited,Published&$

NuGet over time...
NuGet over time...
Repo-signing announced August 10, 2018
Big chunk of packages signed
over holidays 2018/2019
Re-download all metadata & binaries
Very slow over OData
Is there a better way?
NuGet server-side API
NuGet talks to a repository
Can be on disk/network share or remote over HTTP(S)
V2 – OData based (used by pretty much all NuGet servers out there)
V3 – JSON based (, TeamCity, MyGet, Azure DevOps, GitHub repos)

V2 Protocol
Started as “OData-to-LINQ-to-Entities” (V1 protocol)
Optimizations added to reduce # of random DB queries (VS2013+ & NuGet 2.x)
Search – Package manager list/search
FindPackagesById – Package restore (Does it exist? Where to download?)
GetUpdates – Package manager updates (code in
V3 Protocol
JSON based
A “resource provider” of various endpoints per purpose
Catalog ( only) – append-only event log
Registrations – materialization of newest state of a package
Flat container – .NET Core package restore (and VS autocompletion)
Report abuse URL template
… (code in
How does work?
User uploads to
Data added to database
Data added to catalog (append-only data stream)
Various jobs run over catalog using a cursor
Registrations (last state of a package/version), reference catalog entry
Flatcontainer (fast restores)
Search index (search, autocomplete, NuGet Gallery search)
Catalog seems interesting!
Append-only stream of mutations on
Updates (add/update) and Deletes
Can continue where left off (uses a timestamp cursor)
Can restore to a given point in time
+ Page
+ Leaf

“Find this type on”
Refactor from using OData to using V3?
Mostly done, one thing missing: download counts (using search now)
Build a new version?
Welcome to this talk 
Building a new version
What do we need?
Watch the catalog for package changes
For every package change
Scan all assemblies
Store relation between package id+version and namespace+type
API compatible with all ReSharper and Rider versions
Bonus points!
Easy way to re-index later (copy .nupkg binaries + dump index to JSON blobs)

What do we need?
Watch the catalog for package changes periodic check
For every package change based on a queue
Scan all assemblies
Store relation between package id+version and namespace+type
API compatible with all ReSharper and Rider versions always up, flexible scale
Bonus points!
Easy way to re-index later (copy .nupkg binaries + dump index to JSON blobs)
Sounds like functions?
Sounds like functions! catalog Watch catalog
Index command
Find type API
Find namespace API
Search index
Index package
Raw .nupkg
Index as JSON
Download packageDownload command
Collecting from catalog

Functions best practices
Each function should do only one thing
Easier error handling & scaling
Learn to use messages and queues
Asynchronous means of communicating, helps scale and avoid direct coupling
Collecting from catalog
(better version)
Help a function do only one thing
Trigger, provide input/output
Function code bridges those
Build your own!*
SQL Server binding
Dropbox binding
NuGet Catalog
*Custom triggers not officially supported (yet?)
Trigger Input Output
Timer ✔
HTTP ✔ ✔
Blob ✔ ✔ ✔
Queue ✔ ✔
Table ✔ ✔
Service Bus ✔ ✔
EventHub ✔ ✔
EventGrid ✔
CosmosDB ✔ ✔ ✔
IoT Hub ✔
SendGrid, Twilio ✔
... ✔
Creating a trigger

We’re making progress! catalog Watch catalog
Index command
Find type API
Find namespace API
Search index
Index package
Raw .nupkg
Index as JSON
Download packageDownload command
Downloading packages
Next up: indexing catalog Watch catalog
Index command
Find type API
Find namespace API
Search index
Index package
Raw .nupkg
Index as JSON
Download packageDownload command
Opening up the .nupkg and reflecting on assemblies
Does not load the assembly being reflected into application process
Provides access to Portable Executable (PE) metadata in assembly
Store relation between package id+version and namespace+type
Azure Search? A database? Redis? Other?

System.Reflection.Metadata Free decompiler
using (var portableExecutableReader = new PEReader(assemblySeekableStream))
var metadataReader = portableExecutableReader.GetMetadataReader();
foreach (var typeDefinition in metadataReader.TypeDefinitions.Select(metadataReader
if (!typeDefinition.Attributes.HasFlag(TypeAttributes.Public)) continue;
var typeNamespace = metadataReader.GetString(typeDefinition.Namespace);
var typeName = metadataReader.GetString(typeDefinition.Name);
if (typeName.StartsWith("<") || typeName.StartsWith("__Static") ||
typeName.Contains("c__DisplayClass")) continue;
Azure Search
Scales across partitions and replicas
Define an index that will hold documents consisting of fields
Fields can be searchable, facetable, filterable, sortable, retrievable
Can’t be changed easily, think upfront!
Have to define what we want to search, and what we want to display
My function will also write documents to a JSON blob
Can re-index using Azure Search importer in case needed
Indexing packages

“Do one thing well”
Our function shouldn’t care about creating a search index.
Better: return index operations, have something else handle those
Custom output binding?
Indexing packages
(better version)
Almost there… catalog Watch catalog
Index command
Find type API
Find namespace API
Search index
Index package
Raw .nupkg
Index as JSON
Download packageDownload command
HTTP trigger binding
"get", Route = "v1/find-type")] HttpRequest request
Options for trigger
Authentication (anonymous, a function/host key, a user token)
HTTP method
What the route looks like

Making search work
with ReSharper and Rider
One issue left...
Download counts - used for sorting and scoring search results
Change continuously on NuGet
Not part of V3 catalog
Could use search but that’s N(packages) queries
If that data existed, how to update search?
Merge data! new PackageDocumentDownloads(key, downloadcount)
We’re done! catalog Watch catalog
Index command
Find type API
Find namespace API
Search index
Index package
Raw .nupkg
Index as JSON
Download packageDownload command
We’re done!
Collect changes from NuGet catalog
Download binaries
Index binaries using PE Header
Make search index available in API
Trigger, input and output bindings
Each function should do only one thing catalog Watch catalog
Index command
Find type API
Find namespace API
Search index
Index package
Raw .nupkg
Index as JSON
Download packageDownload command

We’re done!
All our functions can scale (and fail)
Full index in May 2019 took ~12h on 2 B1 instances
Can be faster on more CPU’s
~ 1.7mio packages ( homepage says)
~ 2.1mio packages (the catalog says )
~ 8 400 catalog pages
with ~ 4 200 000 catalog leaves
(hint: repo signing) catalog Watch catalog
Index command
Find type API
Find namespace API
Search index
Index package
Raw .nupkg
Index as JSON
Download packageDownload command
Closing thoughts…
Would deploy in separate function apps for cost
Trigger binding collects all the time so needs dedicated capacity (and thus, cost)
Others can scale within bounds (think of $$$)
Would deploy in separate function apps for failure boundaries
Trigger, indexing, downloading should not affect health of API
Are bindings portable...?
Avoid them if (framework) lock-in matters to you
They áre nice in terms of programming model…
Thank you!

  • 1. Indexing and searching with Azure Functions and Search Maarten Balliauw @maartenballiauw
  • 2. “Find this type on”
  • 3. “Find this type on” In ReSharper and Rider Search for namespaces & types that are not yet referenced
  • 4. “Find this type on” Idea in 2013, introduced in ReSharper 9 (2015 - Consists of ReSharper functionality A service that indexes packages and powers search Azure Cloud Service (Web and Worker role) Indexer uses NuGet OData feed$select=Id,Version,NormalizedVersion,LastEdited,Published&$ orderby=LastEdited%20desc&$filter=LastEdited%20gt%20datetime%272012-01-01%27
  • 6. NuGet over time... Repo-signing announced August 10, 2018 Big chunk of packages signed over holidays 2018/2019 Re-download all metadata & binaries Very slow over OData Is there a better way?
  • 8. NuGet talks to a repository Can be on disk/network share or remote over HTTP(S) HTTP(S) API’s V2 – OData based (used by pretty much all NuGet servers out there) V3 – JSON based (, TeamCity, MyGet, Azure DevOps, GitHub repos)
  • 9. V2 Protocol Started as “OData-to-LINQ-to-Entities” (V1 protocol) Optimizations added to reduce # of random DB queries (VS2013+ & NuGet 2.x) Search – Package manager list/search FindPackagesById – Package restore (Does it exist? Where to download?) GetUpdates – Package manager updates (code in
  • 10. V3 Protocol JSON based A “resource provider” of various endpoints per purpose Catalog ( only) – append-only event log Registrations – materialization of newest state of a package Flat container – .NET Core package restore (and VS autocompletion) Report abuse URL template Statistics … (code in
  • 11. How does work? User uploads to Data added to database Data added to catalog (append-only data stream) Various jobs run over catalog using a cursor Registrations (last state of a package/version), reference catalog entry Flatcontainer (fast restores) Search index (search, autocomplete, NuGet Gallery search) …
  • 12. Catalog seems interesting! Append-only stream of mutations on Updates (add/update) and Deletes Chronological Can continue where left off (uses a timestamp cursor) Can restore to a given point in time Structure Root + Page + Leaf
  • 14. “Find this type on” Refactor from using OData to using V3? Mostly done, one thing missing: download counts (using search now) Build a new version? Welcome to this talk 
  • 15. Building a new version
  • 16. What do we need? Watch the catalog for package changes For every package change Scan all assemblies Store relation between package id+version and namespace+type API compatible with all ReSharper and Rider versions Bonus points! Easy way to re-index later (copy .nupkg binaries + dump index to JSON blobs)
  • 17. What do we need? Watch the catalog for package changes periodic check For every package change based on a queue Scan all assemblies Store relation between package id+version and namespace+type API compatible with all ReSharper and Rider versions always up, flexible scale Bonus points! Easy way to re-index later (copy .nupkg binaries + dump index to JSON blobs)
  • 19. Sounds like functions! catalog Watch catalog Index command Find type API Find namespace API Search index Index package Raw .nupkg Index as JSON Download packageDownload command
  • 21. Functions best practices @PaulDJohnston Each function should do only one thing Easier error handling & scaling Learn to use messages and queues Asynchronous means of communicating, helps scale and avoid direct coupling ...
  • 23. Bindings Help a function do only one thing Trigger, provide input/output Function code bridges those Build your own!* SQL Server binding Dropbox binding ... NuGet Catalog *Custom triggers not officially supported (yet?) Trigger Input Output Timer ✔ HTTP ✔ ✔ Blob ✔ ✔ ✔ Queue ✔ ✔ Table ✔ ✔ Service Bus ✔ ✔ EventHub ✔ ✔ EventGrid ✔ CosmosDB ✔ ✔ ✔ IoT Hub ✔ SendGrid, Twilio ✔ ... ✔
  • 25. We’re making progress! catalog Watch catalog Index command Find type API Find namespace API Search index Index package Raw .nupkg Index as JSON Download packageDownload command
  • 27. Next up: indexing catalog Watch catalog Index command Find type API Find namespace API Search index Index package Raw .nupkg Index as JSON Download packageDownload command
  • 28. Indexing Opening up the .nupkg and reflecting on assemblies System.Reflection.Metadata Does not load the assembly being reflected into application process Provides access to Portable Executable (PE) metadata in assembly Store relation between package id+version and namespace+type Azure Search? A database? Redis? Other?
  • 30. System.Reflection.Metadata using (var portableExecutableReader = new PEReader(assemblySeekableStream)) { var metadataReader = portableExecutableReader.GetMetadataReader(); foreach (var typeDefinition in metadataReader.TypeDefinitions.Select(metadataReader .GetTypeDefinition)) { if (!typeDefinition.Attributes.HasFlag(TypeAttributes.Public)) continue; var typeNamespace = metadataReader.GetString(typeDefinition.Namespace); var typeName = metadataReader.GetString(typeDefinition.Name); if (typeName.StartsWith("<") || typeName.StartsWith("__Static") || typeName.Contains("c__DisplayClass")) continue; typeNames.Add($"{typeNamespace}.{typeName}"); } }
  • 31. Azure Search “Search-as-a-Service” Scales across partitions and replicas Define an index that will hold documents consisting of fields Fields can be searchable, facetable, filterable, sortable, retrievable Can’t be changed easily, think upfront! Have to define what we want to search, and what we want to display My function will also write documents to a JSON blob Can re-index using Azure Search importer in case needed
  • 33. “Do one thing well” Our function shouldn’t care about creating a search index. Better: return index operations, have something else handle those Custom output binding?
  • 35. Almost there… catalog Watch catalog Index command Find type API Find namespace API Search index Index package Raw .nupkg Index as JSON Download packageDownload command
  • 36. HTTP trigger binding [HttpTrigger(AuthorizationLevel.Anonymous, "get", Route = "v1/find-type")] HttpRequest request Options for trigger Authentication (anonymous, a function/host key, a user token) HTTP method What the route looks like
  • 37. Making search work with ReSharper and Rider demo
  • 38. One issue left... Download counts - used for sorting and scoring search results Change continuously on NuGet Not part of V3 catalog Could use search but that’s N(packages) queries If that data existed, how to update search? Merge data! new PackageDocumentDownloads(key, downloadcount)
  • 39. We’re done! catalog Watch catalog Index command Find type API Find namespace API Search index Index package Raw .nupkg Index as JSON Download packageDownload command
  • 40. We’re done! Functions Collect changes from NuGet catalog Download binaries Index binaries using PE Header Make search index available in API Trigger, input and output bindings Each function should do only one thing catalog Watch catalog Index command Find type API Find namespace API Search index Index package Raw .nupkg Index as JSON Download packageDownload command
  • 41. We’re done! All our functions can scale (and fail) independently Full index in May 2019 took ~12h on 2 B1 instances Can be faster on more CPU’s ~ 1.7mio packages ( homepage says) ~ 2.1mio packages (the catalog says ) ~ 8 400 catalog pages with ~ 4 200 000 catalog leaves (hint: repo signing) catalog Watch catalog Index command Find type API Find namespace API Search index Index package Raw .nupkg Index as JSON Download packageDownload command
  • 42. Closing thoughts… Would deploy in separate function apps for cost Trigger binding collects all the time so needs dedicated capacity (and thus, cost) Others can scale within bounds (think of $$$) Would deploy in separate function apps for failure boundaries Trigger, indexing, downloading should not affect health of API Are bindings portable...? Avoid them if (framework) lock-in matters to you They áre nice in terms of programming model…

Editor's Notes

  2. Show feature in action in Visual Studio (and show you can see basic metadata etc.)
  3. Copied in 2017 in VS - Demo the feed quickly?
  4. Around 3 TB in May 2019
  5. Demo ODataDump quickly
  6. Demo: click around in the API to show some base things
  7. Raw API - click around in the API to show some base things, explain how a cursor could go over it Root Page Leaf Explain CatalogDump NuGet.Protocol.Catalog comes from GitHub CatalogProcessor feches all pages between min and max timestamp My implementation BatchCatalogProcessor fetches multiple pages at the same time and build a “latest state” – much faster! Fetches leaves, for every leaf calls into a simple method Much faster, easy to pause (keep track of min/max timestamp)
  8. LOL input, process, output More serious: events trigger code Periodic check for packages Queue message to index things API request runs a search No server management or capacity planning
  9. Will use storage queues n demo’s to be able to run things locally. Ideally use SB topics or event grid (transactional)
  10. Create a new TimerTrigger function We will need a function to index things from NuGet Timer will trigger every X amount of time Timer provides last timestamp and next timestamp, so we can run our collector for that period Snippet: demo-timertrigger Mention HttpClient not used correctly: not disposed, so will starve TCP connections at some point Go over code example and run it var httpClient = new HttpClient(); var cursor = new InMemoryCursor(timer.ScheduleStatus?.Last ?? DateTimeOffset.UtcNow); var processor = new CatalogProcessor( cursor, new CatalogClient(httpClient, new NullLogger<CatalogClient>()), new DelegatingCatalogLeafProcessor( added => { log.LogInformation("[ADDED] " + added.PackageId + "@" + added.PackageVersion); return Task.FromResult(true); }, deleted => { log.LogInformation("[DELETED] " + deleted.PackageId + "@" + deleted.PackageVersion); return Task.FromResult(true); }), new CatalogProcessorSettings { MinCommitTimestamp = timer.ScheduleStatus?.Last ?? DateTimeOffset.UtcNow, MaxCommitTimestamp = timer.ScheduleStatus?.Next ?? DateTimeOffset.UtcNow, ServiceIndexUrl = "" }, new NullLogger<CatalogProcessor>()); await processor.ProcessAsync(CancellationToken.None);
  11. Each function should only do one thing! We are violating this.
  12. Go over Approach1 code – Enqueuer class Mention we are using roughly the same code as before Differences are that our function is now no longer doing things itself, instead it’s adding messages to a queue for processing later on That Queue binding is interesting. This is where the input/output comes from. Instead of managing our own queue connection, we let the framework handle all plumbing so we can focus on adding messages. In Indexer, we use the Queue as an input binding, and read messages. We can now scale enqueuing and scaling separately! But are we there yet?
  14. Go over Approach2 code Show this is MUCH simpler – trigger binding that provides input, queue output bindign to write that input to a queue Let’s go over what it takes to build a trigger binding NuGetCatalogTriggerAttribute – the data needed for the trigger to work – go over properties and attributes Hooking it up requires a binding configuration – NuGetCatalogTriggerExtensionConfigProvider It says: if you see this specific binding, register it as a trigger that maps to some provider So we need that provider – NuGetCatalogTriggerAttributeBindingProvider Provider is there to create an object that provides data. In our case we need to store the NuGet catalog timestamp cursor, so we do that on storage, and then return the actual binding – NuGetCatalogTriggerBinding In NuGetCatalogTriggerBinding, we have to specify how data can be bound. What if I use a differnt type of object than PackageOperation? What if someone used a node.js or Python function instead of .NET. Need to define the shape of the data our trigger provides. PackageOperationValueProvider is also interesting, this provides data shown in the portal diagnostics CreateListenerAsync is where the actual triger code will be created – NuGetCatalogListener NuGetCatalogListener uses the BatchCatalogProcessor we had previously, and when a package is added or deleted it will call into the injected ITriggeredFunctionExecutor ITriggeredFunctionExecutor is Azure Functions framework specific, but it’s the glue that will clal into our function with the data we provide Note StartAsync/StopAsync where you can add startup/shutdown code ONE THING LEFT THAT IS NOT DOCUMENTED – Startup.cs to register the binding. And since we are in a different class library, also need Microsoft.Azure.WebJobs.Extensions referenced to generate \bin\Debug\netcoreapp2.1\bin\extensions.json As a result our code is now MUCH cleaner, show it again and maybe also show it in action Mention [Singleton(Mode = SingletonMode.Listener)] – we need to ensure this binding only runs single-instance (cursor clashes otherwise). This is due to ho the catalog works, parallel processing is harder to do. But we can fix that by scaling the Indexer later on. Show Approach3 PopulateQueueAndTable Same code, but a bit more production worthy Sending data to two queues (indexing and downloading) Storing data in a table (and yes, violating “do one thing” again but I call it architectural freedom)
  15. Next up will be downloading and indexing. Let’s start with downloading. Grab a copy of the .nupkg from NuGet and store it in a blob Redundancy - no need to re-download/stress NuGet on a re-index
  16. Go over Approach3 code DownloadToStorage uses a QueueTrigger to run whenever a message appears in queue Note no singleton: we can scale this across multiple instances/multiple servers Uses a Blob input binding that provides access to a blob Note the parameters, name of the blob is resolved based on data from other inputs which is prety nifty Our code checks whether it’s an add or a delete, and either downloads+uploads to the blob reference, or delets the blob reference
  17. Next up will be indexing itself. There are a couple of things here…
  18. Go over Approach3 code PackageIndexer uses a QueueTrigger to run whenever a message appears in queue Uses a Blob input binding that provides access to a blob where we can write our indexed entity – will show this later Based on package operation, we will add or delete from the index RunAddPackageAsync has some plumbing, probably too much, to dowload the .nupkg file and store it on disk Note: we store it on disk as we need a seekable stream. So why no memoy stream? Some NuGet packages are HUGE. Find PEReader usage and show how it will index a given package’s public types and namespaces All goes into a typeNames collection. Now: how do we add this info to the index? Show PackageDocument class, has MANY properties First important: the Identifier property has [Key] applied. Azure Search needs a key for teh document so we can retrieve by key, which could be useful when updating existing content or to find a specific document and delete it from the index. Second important: TypeNames is searchable. Also mention “simpleanalyzer”: “Divides text at non-letters and converts them to lower case.” Other analyzers remove stopwords and do other things, this one should be as searchable as possible. Other fields are sometimes searchable, sometimes facetable – a bit of leftover from me thinking about search use cases. The R# API ony searches on typename so could make everything else just retrievable as well. Of course, index is not there by default, so need to create it. We do this when our function is instantiated (static constructor, so only once per launch of our functions) Is this good? Yes, because only once per server instance our function runs on. No because we do it at one point, what if the index is deleted in between and needs to be recreated? Edge case, but a retry strategy could be a good idea... Next, we create our package document, and at one point we add it to a list of index actions, and to blob storage indexActions.Add(IndexAction.MergeOrUpload(packageToIndex)); JsonSerializer.Serialize(jsonWriter, packagesToIndex); Writing to index using batch - var indexBatch = IndexBatch.New(actions); Leftover code from earlier, batch makes no sense for one document, but in case you want to do multiple in one go this is the way. Do beware a batch can only be several MB in size, for this NuGet indexing I can only do ~25 in a batch before payload is too large. That’s… it! Run approach 3 (for last hour) and see functions being hit / packages added to index Go to Azure Search portal as well, show how importer would work in case of fire
  19. Go over Approach3 code PackageIndexerWithCustomBinding is mostly the same code One difference: it uses the [AzureSearchIndex] binding to write add/delete operations to the index instead Go over how it works. Again, an attribute with settings – AzureSearchIndexAttribute Also a configuration that registers the binding as an output binding using BindToCollector – AzureSearchExtensionConfigProvider Now, what’s this OpenType? It’s some sort of dynamic type. If we want to create an AzureSearch output binding, we better support more than just our PackageDocument use case! So we need a collector builder that can create the actual binding implementation based on the real type requested by our function parameter – AzureSearchAsyncCollectorBuilder In AzureSearchAsyncCollectorBuilder, we do that. Very simple bootstrap code in this case, but could be more complex depending on the type of binding you are creating. Our AzureSearchAsyncCollector uses the attribute to check for Azure Search connection details, as well as the type of operation we expect it to handle. Why not all? Well, IAsyncCollector only has Add and Flush. Note: add called manually, flush at function complete – could use flush to send things in a batch... Code itself pretty straightforward. On Add, we add an action to search. With a retry in case the index does not exist – we then create it. Creation code kind of interesting as we use some reflection in case we specify a given type of coument to index. Why? Cause when we do Upserts, we may want to update just one or two properties, and can use a different Dto in that case (but still have the index shaped to the full document shape) Run when time left, but nothing fancy here...
  20. Now we need to make ReSharper talk to our search. We have the index, so that should be a breeze, right?
  21. Go over Web code RunFindTypeApiAsync and RunFindNamespaceAsync Both use “name” as their query parameter to search for RunInternalAsync does the heavy lifting Grabs other parameters Runs search, and collects several pages of results Why is this ForEachAsync there? Search index has multiple versions for every package id, yet ReSharper expects only the latest matching all parameters Azure Search has no group by / distinct by, so need to do this in memory. Doing it here by fetching a maximum number of results and doing the grouping manually. Use the collected data to build result. Add matching type names etc. Example requests: http://localhost:7071/api/v1/find-type?name=JsonConvert http://localhost:7071/api/v1/find-type?name=CamoServer&allowPrerelease=true&latestVersion=false In ReSharper (devenv /ReSharper.Internal, go to NuGet tool window, set base URL to Write some code that uses JsonConvert / JObject and try it out.