CloudBurst 2019 - Indexing and searching NuGet.org with Azure Functions and Search

Indexing and searching NuGet.org
with Azure Functions and Search
Maarten Balliauw
@maartenballiauw

“Find this type on NuGet.org”

In ReSharper and Rider
Search for namespaces
& types that are not yet referenced

Idea in 2013, introduced in ReSharper 9
(2015 - https://www.jetbrains.com/resharper/whatsnew/whatsnew_9.html)
Consists of
ReSharper functionality
A service that indexes packages and powers search
Azure Cloud Service (Web and Worker role)
Indexer uses NuGet OData feed
https://www.nuget.org/api/v2/Packages?$select=Id,Version,NormalizedVersion,LastEdited,Published&$
orderby=LastEdited%20desc&$filter=LastEdited%20gt%20datetime%272012-01-01%27

NuGet over time...
https://twitter.com/controlflow/status/1067724815958777856

NuGet over time...
Repo-signing announced August 10, 2018
Big chunk of packages signed
over holidays 2018/2019
Re-download all metadata & binaries
Very slow over OData
Is there a better way?
https://blog.nuget.org/20180810/Introducing-Repository-Signatures.html

NuGet talks to a repository
Can be on disk/network share or remote over HTTP(S)
HTTP(S) API’s
V2 – OData based (used by pretty much all NuGet servers out there)
V3 – JSON based (NuGet.org, TeamCity, MyGet, Azure DevOps, GitHub repos)

V2 Protocol
Started as “OData-to-LINQ-to-Entities” (V1 protocol)
Optimizations added to reduce # of random DB queries (VS2013+ & NuGet 2.x)
Search – Package manager list/search
FindPackagesById – Package restore (Does it exist? Where to download?)
GetUpdates – Package manager updates
https://www.nuget.org/api/v2 (code in https://github.com/NuGet/NuGetGallery)

V3 Protocol
JSON based
A “resource provider” of various endpoints per purpose
Catalog (NuGet.org only) – append-only event log
Registrations – materialization of newest state of a package
Flat container – .NET Core package restore (and VS autocompletion)
Report abuse URL template
Statistics
…
https://api.nuget.org/v3/index.json (code in https://github.com/NuGet/NuGet.Services.Metadata)

How does NuGet.org work?
User uploads to NuGet.org
Data added to database
Data added to catalog (append-only data stream)
Various jobs run over catalog using a cursor
Registrations (last state of a package/version), reference catalog entry
Flatcontainer (fast restores)
Search index (search, autocomplete, NuGet Gallery search)
…

Catalog seems interesting!
Append-only stream of mutations on NuGet.org
Updates (add/update) and Deletes
Chronological
Can continue where left off (uses a timestamp cursor)
Can restore NuGet.org to a given point in time
Structure
Root https://api.nuget.org/v3/catalog0/index.json
+ Page https://api.nuget.org/v3/catalog0/page0.json
+ Leaf https://api.nuget.org/v3/catalog0/data/2015.02.01.06.22.45/adam.jsgenerator.1.1.0.json

Refactor from using OData to using V3?
Mostly done, one thing missing: download counts (using search now)
https://github.com/NuGet/NuGetGallery/issues/3532
Build a new version?
Welcome to this talk 

What do we need?
Watch the NuGet.org catalog for package changes
For every package change
Scan all assemblies
Store relation between package id+version and namespace+type
API compatible with all ReSharper and Rider versions
Bonus points!
Easy way to re-index later (copy .nupkg binaries + dump index to JSON blobs)

What do we need?
Watch the NuGet.org catalog for package changes periodic check
For every package change based on a queue
Scan all assemblies
API compatible with all ReSharper and Rider versions always up, flexible scale
Bonus points!
Easy way to re-index later (copy .nupkg binaries + dump index to JSON blobs)

Sounds like functions!
NuGet.org catalog Watch catalog
Index command
Find type API
Find namespace API
Search index
Index package
Raw .nupkg
Index as JSON
Download packageDownload command

Functions best practices
@PaulDJohnston https://medium.com/@PaulDJohnston/serverless-best-practices-b3c97d551535
Each function should do only one thing
Easier error handling & scaling
Learn to use messages and queues
Asynchronous means of communicating, helps scale and avoid direct coupling
...

Collecting from catalog
(better version)
demo

Bindings
Help a function do only one thing
Trigger, provide input/output
Function code bridges those
Build your own!*
SQL Server binding
Dropbox binding
...
NuGet Catalog
*Custom triggers not officially supported (yet?)
Trigger Input Output
Timer ✔
HTTP ✔ ✔
Blob ✔ ✔ ✔
Queue ✔ ✔
Table ✔ ✔
Service Bus ✔ ✔
EventHub ✔ ✔
EventGrid ✔
CosmosDB ✔ ✔ ✔
IoT Hub ✔
SendGrid, Twilio ✔
... ✔

Creating a trigger
binding
demo

We’re making progress!
Index command
Find type API
Find namespace API
Search index
Index package
Raw .nupkg
Index as JSON

Next up: indexing
Index command
Find type API
Find namespace API
Search index
Index package
Raw .nupkg
Index as JSON

Indexing
Opening up the .nupkg and reflecting on assemblies
System.Reflection.Metadata
Does not load the assembly being reflected into application process
Provides access to Portable Executable (PE) metadata in assembly
Azure Search? A database? Redis? Other?

System.Reflection.Metadata Free decompiler
www.jetbrains.com/dotpeek

System.Reflection.Metadata
using (var portableExecutableReader = new PEReader(assemblySeekableStream))
{
var metadataReader = portableExecutableReader.GetMetadataReader();
foreach (var typeDefinition in metadataReader.TypeDefinitions.Select(metadataReader
.GetTypeDefinition))
{
if (!typeDefinition.Attributes.HasFlag(TypeAttributes.Public)) continue;
var typeNamespace = metadataReader.GetString(typeDefinition.Namespace);
var typeName = metadataReader.GetString(typeDefinition.Name);
if (typeName.StartsWith("<") || typeName.StartsWith("__Static") ||
typeName.Contains("c__DisplayClass")) continue;
typeNames.Add($"{typeNamespace}.{typeName}");
}
}

Azure Search
“Search-as-a-Service”
Scales across partitions and replicas
Define an index that will hold documents consisting of fields
Fields can be searchable, facetable, filterable, sortable, retrievable
Can’t be changed easily, think upfront!
Have to define what we want to search, and what we want to display
My function will also write documents to a JSON blob
Can re-index using Azure Search importer in case needed

“Do one thing well”
Our function shouldn’t care about creating a search index.
Better: return index operations, have something else handle those
Custom output binding?

Indexing packages
(better version)
demo

Almost there…
Index command
Find type API
Find namespace API
Search index
Index package
Raw .nupkg
Index as JSON

HTTP trigger binding
[HttpTrigger(AuthorizationLevel.Anonymous,
"get", Route = "v1/find-type")] HttpRequest request
Options for trigger
Authentication (anonymous, a function/host key, a user token)
HTTP method
What the route looks like

Making search work
with ReSharper and Rider
demo

One issue left...
Download counts - used for sorting and scoring search results
Change continuously on NuGet
Not part of V3 catalog
Could use search but that’s N(packages) queries
https://github.com/NuGet/NuGetGallery/issues/3532
If that data existed, how to update search?
Merge data! new PackageDocumentDownloads(key, downloadcount)

We’re done!
Index command
Find type API
Find namespace API
Search index
Index package
Raw .nupkg
Index as JSON

We’re done!
Functions
Collect changes from NuGet catalog
Download binaries
Index binaries using PE Header
Make search index available in API
Trigger, input and output bindings
Each function should do only one thing
Index command
Find type API
Find namespace API
Search index
Index package
Raw .nupkg
Index as JSON

We’re done!
All our functions can scale (and fail)
independently
Full index in May 2019 took ~12h on 2 B1 instances
Can be faster on more CPU’s
~ 1.7mio packages (NuGet.org homepage says)
~ 2.1mio packages (the catalog says )
~ 8 400 catalog pages
with ~ 4 200 000 catalog leaves
(hint: repo signing)
Index command
Find type API
Find namespace API
Search index
Index package
Raw .nupkg
Index as JSON

Closing thoughts…
Would deploy in separate function apps for cost
Trigger binding collects all the time so needs dedicated capacity (and thus, cost)
Others can scale within bounds (think of $$$)
Would deploy in separate function apps for failure boundaries
Trigger, indexing, downloading should not affect health of API
Are bindings portable...?
Avoid them if (framework) lock-in matters to you
They áre nice in terms of programming model…

Thank you!
https://blog.maartenballiauw.be
@maartenballiauw

CloudBurst 2019 - Indexing and searching NuGet.org with Azure Functions and Search

Related slideshows

Recommended for you

Recommended for you

Recommended for you

Recommended for you

Recommended for you

Recommended for you

Recommended for you

Recommended for you

Recommended for you

Recommended for you

More Related Content

What's hot

What's hot (20)

Similar to CloudBurst 2019 - Indexing and searching NuGet.org with Azure Functions and Search

Similar to CloudBurst 2019 - Indexing and searching NuGet.org with Azure Functions and Search (20)

More from Maarten Balliauw

More from Maarten Balliauw (20)

Recently uploaded

Recently uploaded (20)

CloudBurst 2019 - Indexing and searching NuGet.org with Azure Functions and Search

Editor's Notes