Which NuGet package was that type in again? In this session, let's build a "reverse package search" that helps finding the correct NuGet package based on a public type.
Together, we will create a highly-scalable serverless search engine using Azure Functions and Azure Search that performs 3 tasks: listening for new packages on NuGet.org (using a custom binding), indexing packages in a distributed way, and exposing an API that accepts queries and gives our clients the best result.
Logstash is a tool for managing logs that allows for input, filter, and output plugins to collect, parse, and deliver logs and log data. It works by treating logs as events that are passed through the input, filter, and output phases, with popular plugins including file, redis, grok, elasticsearch and more. The document also provides guidance on using Logstash in a clustered configuration with an agent and server model to optimize log collection, processing, and storage.
Logstash + Elasticsearch + Kibana Presentation on Startit Tech Meetup
1. Logstash is an open source tool for collecting, processing, and storing logs and other event data. It allows centralized collection and parsing of logs from various sources before sending them to Elasticsearch for storage and indexing.
2. Kibana provides visualization and search capabilities on top of the logs stored in Elasticsearch, allowing users to easily explore and analyze log data.
3. The combination of Logstash, Elasticsearch, and Kibana provides a replacement for commercial log management tools like Splunk, with the ability to collect, parse, store, search, and visualize logs from many different sources in a centralized way.
From Zero to Production Hero: Log Analysis with Elasticsearch (from Velocity ...
This talk covers the basics of centralizing logs in Elasticsearch and all the strategies that make it scale with billions of documents in production. Topics include:
- Time-based indices and index templates to efficiently slice your data
- Different node tiers to de-couple reading from writing, heavy traffic from low traffic
- Tuning various Elasticsearch and OS settings to maximize throughput and search performance
- Configuring tools such as logstash and rsyslog to maximize throughput and minimize overhead
Reactive Functional Programming with Java 8 on Android N
The document discusses reactive functional programming with Java 8 on Android N. It introduces reactive programming concepts like Observables and Subscribers. It provides an example of using RxJava to find PNG images in a folder and load them into a gallery, as compared to the vanilla Java approach. It also demonstrates creating Observables, Subscribers, transforming streams, handling REST responses, and subscribing to streams. Specifically, it shows an example of clicking a button to get a user's followers from GitHub, get details on each follower, filter by company, and update the UI with results.
This document introduces the ELK stack, which consists of Elasticsearch, Logstash, and Kibana. It provides instructions on setting up each component and using them together. Elasticsearch is a search engine that stores and searches data in JSON format. Logstash is an agent that collects logs from various sources, applies filters, and outputs to Elasticsearch. Kibana visualizes and explores the logs stored in Elasticsearch. The document demonstrates setting up each component and running a proof of concept to analyze sample log data.
This document provides an overview of the ELK stack, including Logstash for collecting and parsing logs, Elasticsearch for indexing logs, and Kibana for visualizing logs. It discusses using the open source ELK stack as an alternative to Splunk and provides instructions for getting started with a basic ELK implementation.
This document provides an overview of Retrofit, an open source library for Android and Java that allows making REST API calls in a simple and efficient manner. It discusses how to initialize Retrofit with an endpoint URL and adapter, define API methods using annotations, handle requests and responses both synchronously and asynchronously, and convert JSON responses to Java objects using Gson. Code samples are provided throughout to demonstrate common Retrofit tasks like making GET requests, handling API parameters and headers, and subscribing to asynchronous Observable responses.
This document discusses the ELK stack, which consists of Elasticsearch, Logstash, and Kibana. It provides an overview of each component, including that Elasticsearch is a search and analytics engine, Logstash is a data collection engine, and Kibana is a data visualization platform. The document then discusses setting up an ELK stack to index and visualize application logs.
LogStash is a tool for ingesting, processing, and storing data from various sources into Elasticsearch. It includes plugins for input, filter, and output functionality. Common uses of LogStash include parsing log files, enriching events, and loading data into Elasticsearch for search and analysis. The document provides an overview of LogStash and demonstrates how to install it, configure input and output plugins, and create simple and advanced processing pipelines.
Experiences in ELK with D3.js for Large Log Analysis and Visualization
This document discusses experiences using the ELK stack (Elasticsearch, Logstash, Kibana) and D3.js for large log analysis and visualization. It begins with an overview of network traffic logging at Kasetsart University, which generates over 30 terabytes of log data per day. It then demonstrates setting up an ELK testbed to index these logs in real-time for fast search and exploration in Kibana. Finally, it shows how D3.js can be used to create dynamic, real-time visualizations of the logged data.
This document discusses Logstash, an open source tool for collecting, parsing, and storing log files. It can ingest logs from various sources using inputs, apply filters to parse and transform log events, and output the structured data to destinations like Elasticsearch for search and analysis. The document provides an overview of Logstash's core functionality and components, demonstrates simple usage examples, and discusses integrating it with Kibana for visualizing and exploring log data. It also shares some lessons learned in production usage and points to additional resources.
Retrofit is a type-safe REST client library for Android and Java that allows defining REST APIs as Java interfaces. It simplifies HTTP communication by converting remote APIs into declarative interfaces. It supports synchronous, asynchronous, and observable API consumption. The Retrofit library was created by Square.
A case study of the usage of Gradle in the Ratpack web framework. First, we'll examine the Ratpack Gradle plugins, including their functionality, implementation, and testing. Next, we'll examine the build script for the Ratpack project itself. Here, we'll discuss various details of the project's build, including handling multiple projects, multiple types of testing, support for multiple styles of target hardware (developer workstations, cloud CI), and more. For each, we'll go over the desired behavior, how it was achieved, and why it was necessary.
ELK Stack workshop covers real-world use cases and works with the participants to - implement them. This includes Elastic overview, Logstash configuration, creation of dashboards in Kibana, guidelines and tips on processing custom log formats, designing a system to scale, choosing hardware, and managing the lifecycle of your logs.
Indexing and searching NuGet.org with Azure Functions and Search - .NET fwday...
Which NuGet package was that type in again? In this session, let's build a "reverse package search" that helps finding the correct NuGet package based on a public type.
Together, we will create a highly-scalable serverless search engine using Azure Functions and Azure Search that performs 3 tasks: listening for new packages on NuGet.org (using a custom binding), indexing packages in a distributed way, and exposing an API that accepts queries and gives our clients the best result.
Maarten Balliauw "Indexing and searching NuGet.org with Azure Functions and S...
Which NuGet package was that type in again? In this session, let's build a "reverse package search" that helps to find the correct NuGet package based on a public type name.
Together, we will create a highly-scalable serverless search engine using Azure Functions and Azure Search that performs 3 tasks: listening for new packages on NuGet.org (using a custom binding), indexing packages in a distributed way, and exposing an API that accepts queries and gives our clients the best result. Expect code, insights into the service design process, and more!
.NET Conf 2019 - Indexing and searching NuGet.org with Azure Functions and Se...
Which NuGet package was that type in again? In this session, let's build a "reverse package search" that helps finding the correct NuGet package based on a public type.
Together, we will create a highly-scalable serverless search engine using Azure Functions and Azure Search that performs 3 tasks: listening for new packages on NuGet.org (using a custom binding), indexing packages in a distributed way, and exposing an API that accepts queries and gives our clients the best result.
https://blog.maartenballiauw.be/post/2019/07/30/indexing-searching-nuget-with-azure-functions-and-search.html
Everybody is consuming NuGet packages these days. It’s easy, right? But how can we create and share our own packages? What is .NET Standard? How should we version, create, publish and share our package?
Once we have those things covered, we’ll look beyond what everyone is doing. How can we use the NuGet client API to fetch data from NuGet? Can we build an application plugin system based on NuGet? What hidden gems are there in the NuGet server API? Can we create a full copy of NuGet.org?
Good questions! In this talk, we will get them answered.
Everybody is consuming or producing NuGet packages these days. It’s easy, right? We’ll look beyond what everyone is doing. How can we use the NuGet client API to fetch data from NuGet? Can we build an application plugin system based on NuGet? What hidden gems are there in the NuGet server API? Can we create a full copy of NuGet.org?
- What are Internal Developer Portal (IDP) and Platform Engineering?
- What is Backstage?
- How Backstage can help dev to build developer portal to make their job easier
Jirayut Nimsaeng
Founder & CEO
Opsta (Thailand) Co., Ltd.
Youtube Record: https://youtu.be/u_nLbgWDwsA?t=850
Dev Mountain Tech Festival @ Chiang Mai
November 12, 2022
This document summarizes Nuxeo's Release 8.1 including new tools for launching performance tests on Nuxeo clusters, an instant share feature for temporarily granting access without account creation, Live Connect integration for Box file sharing, and expanded Elasticsearch integration. It also discusses Nuxeo Docker images, a Nuxeo code generator, a Polymer sample app, updated REST and automation clients, and upcoming branch management features.
This presentation was given at the Boston Django meetup on November 16, and surveyed several leading PaaS providers including Stackato, Dotcloud, OpenShift and Heroku.
For each PaaS provider, I documented the steps necessary to deploy Mezzanine, a popular Django-based CMS and blogging platform.
At the end of the presentation, I do a wrap-up of the different providers and provide a comparison matrix showing which providers have which features. This matrix is likely to go out-of-date quickly because these providers are adding new features all the time.
The document provides an overview of OGCE (Open Grid Computing Environment), which develops and packages reusable software components for science portals. Key components described include services, gadgets, tags, and how they fit together. Installation and usage of the various OGCE components is discussed at a high level.
With distributed tracing, we can track requests as they pass through multiple services, emitting timing and other metadata throughout, and this information can then be reassembled to provide a complete picture of the application’s behavior at runtime - Read more in https://blog.buoyant.io/2016/05/17/distributed-tracing-for-polyglot-microservices/ and https://www.rookout.com/
This document discusses Elsevier's SciVerse platform and developer network. It introduces SciVerse as a social network for scientific search and content that uses OpenSocial standards. It describes how SciVerse extends Apache Shindig to make apps contextual. It also discusses SciVerse's framework and content APIs that allow apps to access scientific content and metadata. Finally, it provides examples of object-oriented JavaScript coding and using the APIs to build mashups with third-party services.
This document discusses Elsevier's SciVerse platform and developer network. It introduces SciVerse as a social network for scientific search and content that uses OpenSocial standards. It describes how SciVerse extends Apache Shindig to make apps contextual. It also discusses SciVerse's framework and content APIs that allow apps to access scientific content and metadata. Finally, it provides examples of object-oriented JavaScript coding and using the APIs to build mashups with third-party services.
This document discusses using Nutch, an open source web crawler, with Scala. It provides an overview of Nutch's architecture and how plugins can be written in Scala to extend its functionality. As an example, it describes how Scala was used to build a plugin for an aggregator application that crawls multiple suppliers, parses content to extract details, and passes this data to an actor for processing. The solution was able to crawl 5 suppliers and collect over 500k records using Nutch and 823 lines of Scala code.
Jilles van Gurp presents on the ELK stack and how it is used at Linko to analyze logs from applications servers, Nginx, and Collectd. The ELK stack consists of Elasticsearch for storage and search, Logstash for processing and transporting logs, and Kibana for visualization. At Linko, Logstash collects logs and sends them to Elasticsearch for storage and search. Logs are filtered and parsed by Logstash using grok patterns before being sent to Elasticsearch. Kibana dashboards then allow users to explore and analyze logs in real-time from Elasticsearch. While the ELK stack is powerful, there are some operational gotchas to watch out for like node restarts impacting availability and field data caching
This document describes how to use the ELK (Elasticsearch, Logstash, Kibana) stack to centrally manage and analyze logs from multiple servers and applications. It discusses setting up Logstash to ship logs from files and servers to Redis, then having a separate Logstash process read from Redis and index the logs to Elasticsearch. Kibana is then used to visualize and analyze the logs indexed in Elasticsearch. The document provides configuration examples for Logstash to parse different log file types like Apache access/error logs and syslog.
This slides are used to present the following Twitter pipeline using the ELK stack (Elasticsearch, Logstash, Kibana): https://github.com/melvynator/ELK_twitter It shows how to integrate Machine Learning into your Twitter pipeline.
Logstash is a tool for managing logs that allows for input, filter, and output plugins to collect, parse, and deliver logs and log data. It works by treating logs as events that are passed through the input, filter, and output phases, with popular plugins including file, redis, grok, elasticsearch and more. The document also provides guidance on using Logstash in a clustered configuration with an agent and server model to optimize log collection, processing, and storage.
Logstash + Elasticsearch + Kibana Presentation on Startit Tech MeetupStartit
1. Logstash is an open source tool for collecting, processing, and storing logs and other event data. It allows centralized collection and parsing of logs from various sources before sending them to Elasticsearch for storage and indexing.
2. Kibana provides visualization and search capabilities on top of the logs stored in Elasticsearch, allowing users to easily explore and analyze log data.
3. The combination of Logstash, Elasticsearch, and Kibana provides a replacement for commercial log management tools like Splunk, with the ability to collect, parse, store, search, and visualize logs from many different sources in a centralized way.
From Zero to Production Hero: Log Analysis with Elasticsearch (from Velocity ...Sematext Group, Inc.
This talk covers the basics of centralizing logs in Elasticsearch and all the strategies that make it scale with billions of documents in production. Topics include:
- Time-based indices and index templates to efficiently slice your data
- Different node tiers to de-couple reading from writing, heavy traffic from low traffic
- Tuning various Elasticsearch and OS settings to maximize throughput and search performance
- Configuring tools such as logstash and rsyslog to maximize throughput and minimize overhead
Reactive Functional Programming with Java 8 on Android NShipeng Xu
The document discusses reactive functional programming with Java 8 on Android N. It introduces reactive programming concepts like Observables and Subscribers. It provides an example of using RxJava to find PNG images in a folder and load them into a gallery, as compared to the vanilla Java approach. It also demonstrates creating Observables, Subscribers, transforming streams, handling REST responses, and subscribing to streams. Specifically, it shows an example of clicking a button to get a user's followers from GitHub, get details on each follower, filter by company, and update the UI with results.
This document introduces the ELK stack, which consists of Elasticsearch, Logstash, and Kibana. It provides instructions on setting up each component and using them together. Elasticsearch is a search engine that stores and searches data in JSON format. Logstash is an agent that collects logs from various sources, applies filters, and outputs to Elasticsearch. Kibana visualizes and explores the logs stored in Elasticsearch. The document demonstrates setting up each component and running a proof of concept to analyze sample log data.
This document provides an overview of the ELK stack, including Logstash for collecting and parsing logs, Elasticsearch for indexing logs, and Kibana for visualizing logs. It discusses using the open source ELK stack as an alternative to Splunk and provides instructions for getting started with a basic ELK implementation.
This document provides an overview of Retrofit, an open source library for Android and Java that allows making REST API calls in a simple and efficient manner. It discusses how to initialize Retrofit with an endpoint URL and adapter, define API methods using annotations, handle requests and responses both synchronously and asynchronously, and convert JSON responses to Java objects using Gson. Code samples are provided throughout to demonstrate common Retrofit tasks like making GET requests, handling API parameters and headers, and subscribing to asynchronous Observable responses.
This document discusses the ELK stack, which consists of Elasticsearch, Logstash, and Kibana. It provides an overview of each component, including that Elasticsearch is a search and analytics engine, Logstash is a data collection engine, and Kibana is a data visualization platform. The document then discusses setting up an ELK stack to index and visualize application logs.
LogStash is a tool for ingesting, processing, and storing data from various sources into Elasticsearch. It includes plugins for input, filter, and output functionality. Common uses of LogStash include parsing log files, enriching events, and loading data into Elasticsearch for search and analysis. The document provides an overview of LogStash and demonstrates how to install it, configure input and output plugins, and create simple and advanced processing pipelines.
Experiences in ELK with D3.js for Large Log Analysis and VisualizationSurasak Sanguanpong
This document discusses experiences using the ELK stack (Elasticsearch, Logstash, Kibana) and D3.js for large log analysis and visualization. It begins with an overview of network traffic logging at Kasetsart University, which generates over 30 terabytes of log data per day. It then demonstrates setting up an ELK testbed to index these logs in real-time for fast search and exploration in Kibana. Finally, it shows how D3.js can be used to create dynamic, real-time visualizations of the logged data.
This document discusses Logstash, an open source tool for collecting, parsing, and storing log files. It can ingest logs from various sources using inputs, apply filters to parse and transform log events, and output the structured data to destinations like Elasticsearch for search and analysis. The document provides an overview of Logstash's core functionality and components, demonstrates simple usage examples, and discusses integrating it with Kibana for visualizing and exploring log data. It also shares some lessons learned in production usage and points to additional resources.
Retrofit is a type-safe REST client library for Android and Java that allows defining REST APIs as Java interfaces. It simplifies HTTP communication by converting remote APIs into declarative interfaces. It supports synchronous, asynchronous, and observable API consumption. The Retrofit library was created by Square.
A case study of the usage of Gradle in the Ratpack web framework. First, we'll examine the Ratpack Gradle plugins, including their functionality, implementation, and testing. Next, we'll examine the build script for the Ratpack project itself. Here, we'll discuss various details of the project's build, including handling multiple projects, multiple types of testing, support for multiple styles of target hardware (developer workstations, cloud CI), and more. For each, we'll go over the desired behavior, how it was achieved, and why it was necessary.
ELK Stack workshop covers real-world use cases and works with the participants to - implement them. This includes Elastic overview, Logstash configuration, creation of dashboards in Kibana, guidelines and tips on processing custom log formats, designing a system to scale, choosing hardware, and managing the lifecycle of your logs.
Indexing and searching NuGet.org with Azure Functions and Search - .NET fwday...Maarten Balliauw
Which NuGet package was that type in again? In this session, let's build a "reverse package search" that helps finding the correct NuGet package based on a public type.
Together, we will create a highly-scalable serverless search engine using Azure Functions and Azure Search that performs 3 tasks: listening for new packages on NuGet.org (using a custom binding), indexing packages in a distributed way, and exposing an API that accepts queries and gives our clients the best result.
Maarten Balliauw "Indexing and searching NuGet.org with Azure Functions and S...Fwdays
Which NuGet package was that type in again? In this session, let's build a "reverse package search" that helps to find the correct NuGet package based on a public type name.
Together, we will create a highly-scalable serverless search engine using Azure Functions and Azure Search that performs 3 tasks: listening for new packages on NuGet.org (using a custom binding), indexing packages in a distributed way, and exposing an API that accepts queries and gives our clients the best result. Expect code, insights into the service design process, and more!
.NET Conf 2019 - Indexing and searching NuGet.org with Azure Functions and Se...Maarten Balliauw
Which NuGet package was that type in again? In this session, let's build a "reverse package search" that helps finding the correct NuGet package based on a public type.
Together, we will create a highly-scalable serverless search engine using Azure Functions and Azure Search that performs 3 tasks: listening for new packages on NuGet.org (using a custom binding), indexing packages in a distributed way, and exposing an API that accepts queries and gives our clients the best result.
https://blog.maartenballiauw.be/post/2019/07/30/indexing-searching-nuget-with-azure-functions-and-search.html
Everybody is consuming NuGet packages these days. It’s easy, right? But how can we create and share our own packages? What is .NET Standard? How should we version, create, publish and share our package?
Once we have those things covered, we’ll look beyond what everyone is doing. How can we use the NuGet client API to fetch data from NuGet? Can we build an application plugin system based on NuGet? What hidden gems are there in the NuGet server API? Can we create a full copy of NuGet.org?
Good questions! In this talk, we will get them answered.
Everybody is consuming or producing NuGet packages these days. It’s easy, right? We’ll look beyond what everyone is doing. How can we use the NuGet client API to fetch data from NuGet? Can we build an application plugin system based on NuGet? What hidden gems are there in the NuGet server API? Can we create a full copy of NuGet.org?
- What are Internal Developer Portal (IDP) and Platform Engineering?
- What is Backstage?
- How Backstage can help dev to build developer portal to make their job easier
Jirayut Nimsaeng
Founder & CEO
Opsta (Thailand) Co., Ltd.
Youtube Record: https://youtu.be/u_nLbgWDwsA?t=850
Dev Mountain Tech Festival @ Chiang Mai
November 12, 2022
This document summarizes Nuxeo's Release 8.1 including new tools for launching performance tests on Nuxeo clusters, an instant share feature for temporarily granting access without account creation, Live Connect integration for Box file sharing, and expanded Elasticsearch integration. It also discusses Nuxeo Docker images, a Nuxeo code generator, a Polymer sample app, updated REST and automation clients, and upcoming branch management features.
This presentation was given at the Boston Django meetup on November 16, and surveyed several leading PaaS providers including Stackato, Dotcloud, OpenShift and Heroku.
For each PaaS provider, I documented the steps necessary to deploy Mezzanine, a popular Django-based CMS and blogging platform.
At the end of the presentation, I do a wrap-up of the different providers and provide a comparison matrix showing which providers have which features. This matrix is likely to go out-of-date quickly because these providers are adding new features all the time.
The document provides an overview of OGCE (Open Grid Computing Environment), which develops and packages reusable software components for science portals. Key components described include services, gadgets, tags, and how they fit together. Installation and usage of the various OGCE components is discussed at a high level.
With distributed tracing, we can track requests as they pass through multiple services, emitting timing and other metadata throughout, and this information can then be reassembled to provide a complete picture of the application’s behavior at runtime - Read more in https://blog.buoyant.io/2016/05/17/distributed-tracing-for-polyglot-microservices/ and https://www.rookout.com/
This document discusses Elsevier's SciVerse platform and developer network. It introduces SciVerse as a social network for scientific search and content that uses OpenSocial standards. It describes how SciVerse extends Apache Shindig to make apps contextual. It also discusses SciVerse's framework and content APIs that allow apps to access scientific content and metadata. Finally, it provides examples of object-oriented JavaScript coding and using the APIs to build mashups with third-party services.
This document discusses Elsevier's SciVerse platform and developer network. It introduces SciVerse as a social network for scientific search and content that uses OpenSocial standards. It describes how SciVerse extends Apache Shindig to make apps contextual. It also discusses SciVerse's framework and content APIs that allow apps to access scientific content and metadata. Finally, it provides examples of object-oriented JavaScript coding and using the APIs to build mashups with third-party services.
Harnessing the power of Nutch with ScalaKnoldus Inc.
This document discusses using Nutch, an open source web crawler, with Scala. It provides an overview of Nutch's architecture and how plugins can be written in Scala to extend its functionality. As an example, it describes how Scala was used to build a plugin for an aggregator application that crawls multiple suppliers, parses content to extract details, and passes this data to an actor for processing. The solution was able to crawl 5 suppliers and collect over 500k records using Nutch and 823 lines of Scala code.
Beyond Wordcount with spark datasets (and scalaing) - Nide PDX Jan 2018Holden Karau
The document discusses Apache Spark Datasets and how they compare to RDDs and DataFrames. Some key points:
- Datasets provide better performance than RDDs due to a smarter optimizer, more efficient storage formats, and faster serialization. They also offer simplicity advantages over RDDs for things like windowed operations and multi-column aggregates.
- Datasets allow mixing of functional and relational styles more easily than RDDs or DataFrames. The optimizer has more information from Datasets' schemas and can perform optimizations like partial aggregation.
- Datasets address some of the limitations of DataFrames, making it easier to write UDFs and handle iterative algorithms. They provide a typed API compared to the untyped
Apache Calcite (a tutorial given at BOSS '21)Julian Hyde
The document provides instructions for setting up the environment and coding tutorial for the BOSS'21 Copenhagen tutorial on Apache Calcite.
It includes the following steps:
1. Clone the GitHub repository containing sample code and dependencies.
2. Compile the project.
3. It outlines the draft schedule for the tutorial, which will cover topics like Calcite introduction, demonstration of SQL queries on CSV files, setting up the coding environment, using Lucene for indexing, and coding exercises to build parts of the logical and physical query plans in Calcite.
4. The tutorial will be led by Stamatis Zampetakis from Cloudera and Julian Hyde from Google, who are both committers to
This document provides an overview and agenda for a meetup on distributed tracing using Jaeger. It begins with introducing the speaker and their background. The agenda then covers an introduction to distributed tracing, open tracing, and Jaeger. It details a hello world example, Jaeger terminology, and building a full distributed application with Jaeger. It concludes with wrapping up the demo, reviewing Jaeger architecture, and discussing open tracing's ability to propagate context across services.
Talk at RubyKaigi 2015.
Plugin architecture is known as a technique that brings extensibility to a program. Ruby has good language features for plugins. RubyGems.org is an excellent platform for plugin distribution. However, creating plugin architecture is not as easy as writing code without it: plugin loader, packaging, loosely-coupled API, and performance. Loading two versions of a gem is a unsolved challenge that is solved in Java on the other hand.
I have designed some open-source software such as Fluentd and Embulk. They provide most of functions by plugins. I will talk about their plugin-based architecture.
Examiness hints and tips from the trenchesIsmail Mayat
This document provides an overview of tools and techniques for working with the Examine search engine in Umbraco, including:
- Tools like Luke and the Examine Dashboard for debugging indexes.
- Using the GatheringNodeData event to merge fields, add fields like node type aliases, and handle errors during indexing.
- Indexing different media types like PDFs using Tika.
- Techniques for search highlighting, boosting documents, and deploying index changes across environments.
- Faceted search capabilities and using the index as an object database.
The presenter encourages exploring the full capabilities of Examine and provides examples of how to optimize indexing and searching.
Large Scale Crawling with Apache Nutch and Friendslucenerevolution
Presented by Julien Nioche, Director, DigitalPebble
This session will give an overview of Apache Nutch. I will describe its main components and how it fits with other Apache projects such as Hadoop, SOLR, Tika or HBase. The second part of the presentation will be focused on the latest developments in Nutch, the differences between the 1.x and 2.x branch and what we can expect to see in Nutch in the future. This session will cover many practical aspects and should be a good starting point to crawling on a large scale with Apache Nutch and SOLR.
Large Scale Crawling with Apache Nutch and FriendsJulien Nioche
This session will give an overview of Apache Nutch. I will describe its main components and how it fits with other Apache projects such as Hadoop, SOLR, Tika or HBase. The second part of the presentation will be focused on the latest developments in Nutch, the differences between the 1.x and 2.x branch and what we can expect to see in Nutch in the future. This session will cover many practical aspects and should be a good starting point to crawling on a large scale with Apache Nutch and SOLR.
Similar to CloudBurst 2019 - Indexing and searching NuGet.org with Azure Functions and Search (20)
Bringing nullability into existing code - dammit is not the answer.pptxMaarten Balliauw
The C# nullability features help you minimize the likelihood of encountering that dreaded System.NullReferenceException. Nullability syntax and annotations give hints as to whether a type can be nullable or not, and better static analysis is available to catch unhandled nulls while developing your code. What's not to like?
Introducing explicit nullability into an existing code bases is a Herculean effort. There's much more to it than just sprinkling some `?` and `!` throughout your code. It's not a silver bullet either: you'll still need to check non-nullable variables for null.
In this talk, we'll see some techniques and approaches that worked for me, and explore how you can migrate an existing code base to use the full potential of C# nullability.
Nerd sniping myself into a rabbit hole... Streaming online audio to a Sonos s...Maarten Balliauw
After buying a set of Sonos-compatible speakers at IKEA, I was disappointed there's no support for playing audio from a popular video streaming service. They stream Internet radio, podcasts and what not. Well, not that service I want it to play!
Determined - and not knowing how deep the rabbit hole would be - I ventured on a trip that included network sniffing on my access point, learning about UPnP and running a web server on my phone (without knowing how to write anything Android), learning how MP4 audio is packaged (and has to be re-packaged). This ultimately resulted in an Android app for personal use, which does what I initially wanted: play audio from that popular video streaming service on Sonos.
Join me for this story about an adventure that has no practical use, probably violates Terms of Service, but was fun to build!
Building a friendly .NET SDK to connect to SpaceMaarten Balliauw
Space is a team tool that integrates chats, meetings, git hosting, automation, and more. It has an HTTP API to integrate third party apps and workflows, but it's massive! And slightly opinionated.
In this session, we will see how we built the .NET SDK for Space, and how we make that massive API more digestible. We will see how we used code generation, and incrementally made the API feel more like a real .NET SDK.
Microservices for building an IDE - The innards of JetBrains Rider - NDC Oslo...Maarten Balliauw
Ever wondered how IDE’s are built? In this talk, we’ll skip the marketing bit and dive into the architecture and implementation of JetBrains Rider. We’ll look at how and why we have built (and open sourced) a reactive protocol, and how the IDE uses a “microservices” architecture to communicate with the debugger, Roslyn, a WPF renderer and even other tools like Unity3D. We’ll explore how things are wired together, both in-process and across those microservices.
NDC Sydney 2019 - Microservices for building an IDE – The innards of JetBrain...Maarten Balliauw
Ever wondered how IDE’s are built? In this talk, we’ll skip the marketing bit and dive into the architecture and implementation of JetBrains Rider.
We’ll look at how and why we have built (and open sourced) a reactive protocol, and how the IDE uses a “microservices” architecture to communicate with the debugger, Roslyn, a WPF renderer and even other tools like Unity3D. We’ll explore how things are wired together, both in-process and across those microservices. Let’s geek out!
JetBrains Australia 2019 - Exploring .NET’s memory management – a trip down m...Maarten Balliauw
This document discusses .NET memory management and the garbage collector. It explains that the CLR manages memory in a heap and the garbage collector reclaims unused memory. It describes how objects are allocated in generations and discusses how to help the garbage collector perform better by reducing allocations, using value types when possible, and properly disposing of objects. The document also provides examples of hidden allocations and demonstrates tools for analyzing memory usage like ClrMD and dotMemory Unit.
Approaches for application request throttling - Cloud Developer Days PolandMaarten Balliauw
Speaking from experience building a SaaS: users are insane. If you are lucky, they use your service, but in reality, they probably abuse. Crazy usage patterns resulting in more requests than expected, request bursts when users come back to the office after the weekend, and more! These all pose a potential threat to the health of our web application and may impact other users or the service as a whole. Ideally, we can apply some filtering at the front door: limit the number of requests over a given timespan, limiting bandwidth, ...
In this talk, we’ll explore the simple yet complex realm of rate limiting. We’ll go over how to decide on which resources to limit, what the limits should be and where to enforce these limits – in our app, on the server, using a reverse proxy like Nginx or even an external service like CloudFlare or Azure API management. The takeaway? Know when and where to enforce rate limits so you can have both a happy application as well as happy customers.
Approaches for application request throttling - dotNetCologneMaarten Balliauw
Speaking from experience building a SaaS: users are insane. If you are lucky, they use your service, but in reality, they probably abuse. Crazy usage patterns resulting in more requests than expected, request bursts when users come back to the office after the weekend, and more! These all pose a potential threat to the health of our web application and may impact other users or the service as a whole. Ideally, we can apply some filtering at the front door: limit the number of requests over a given timespan, limiting bandwidth, ...
In this talk, we’ll explore the simple yet complex realm of rate limiting. We’ll go over how to decide on which resources to limit, what the limits should be and where to enforce these limits – in our app, on the server, using a reverse proxy like Nginx or even an external service like CloudFlare or Azure API management. The takeaway? Know when and where to enforce rate limits so you can have both a happy application as well as happy customers.
CodeStock - Exploring .NET memory management - a trip down memory laneMaarten Balliauw
The .NET Garbage Collector (GC) is really cool. It helps providing our applications with virtually unlimited memory, so we can focus on writing code instead of manually freeing up memory. But how does .NET manage that memory? What are hidden allocations? Are strings evil? It still matters to understand when and where memory is allocated. In this talk, we’ll go over the base concepts of .NET memory management and explore how .NET helps us and how we can help .NET – making our apps better. Expect profiling, Intermediate Language (IL), ClrMD and more!
ConFoo Montreal - Microservices for building an IDE - The innards of JetBrain...Maarten Balliauw
Ever wondered how IDE’s are built? In this talk, we’ll skip the marketing bit and dive into the architecture and implementation of JetBrains Rider. We’ll look at how and why we have built (and open sourced) a reactive protocol, and how the IDE uses a “microservices” architecture to communicate with the debugger, Roslyn, a WPF renderer and even other tools like Unity3D. We’ll explore how things are wired together, both in-process and across those microservices. Let’s geek out!
ConFoo Montreal - Approaches for application request throttlingMaarten Balliauw
Speaking from experience building a SaaS: users are insane. If you are lucky, they use your service, but in reality, they probably abuse. Crazy usage patterns resulting in more requests than expected, request bursts when users come back to the office after the weekend, and more! These all pose a potential threat to the health of our web application and may impact other users or the service as a whole. Ideally, we can apply some filtering at the front door: limit the number of requests over a given timespan, limiting bandwidth, ...
In this talk, we’ll explore the simple yet complex realm of rate limiting. We’ll go over how to decide on which resources to limit, what the limits should be and where to enforce these limits – in our app, on the server, using a reverse proxy like Nginx or even an external service like CloudFlare or Azure API management. The takeaway? Know when and where to enforce rate limits so you can have both a happy application as well as happy customers.
Microservices for building an IDE – The innards of JetBrains Rider - TechDays...Maarten Balliauw
Ever wondered how IDE’s are built? In this talk, we’ll skip the marketing bit and dive into the architecture and implementation of JetBrains Rider. We’ll look at how and why we have built (and open sourced) a reactive protocol, and how the IDE uses a “microservices” architecture to communicate with the debugger, Roslyn, a WPF renderer and even other tools like Unity3D. We’ll explore how things are wired together, both in-process and across those microservices. Let’s geek out!
JetBrains Day Seoul - Exploring .NET’s memory management – a trip down memory...Maarten Balliauw
The .NET Garbage Collector (GC) is really cool. It helps providing our applications with virtually unlimited memory, so we can focus on writing code instead of manually freeing up memory. But how does .NET manage that memory? What are hidden allocations? Are strings evil? It still matters to understand when and where memory is allocated. In this talk, we’ll go over the base concepts of .NET memory management and explore how .NET helps us and how we can help .NET – making our apps better. Expect profiling, Intermediate Language (IL), ClrMD and more!
The .NET Garbage Collector (GC) is really cool. It helps providing our applications with virtually unlimited memory, so we can focus on writing code instead of manually freeing up memory. But how does .NET manage that memory? What are hidden allocations? Are strings evil? It still matters to understand when and where memory is allocated. In this talk, we’ll go over the base concepts of .NET memory management and explore how .NET helps us and how we can help .NET – making our apps better. Expect profiling, Intermediate Language (IL), ClrMD and more!
VISUG - Approaches for application request throttlingMaarten Balliauw
Speaking from experience building a SaaS: users are insane. If you are lucky, they use your service, but in reality, they probably abuse. Crazy usage patterns resulting in more requests than expected, request bursts when users come back to the office after the weekend, and more! These all pose a potential threat to the health of our web application and may impact other users or the service as a whole. Ideally, we can apply some filtering at the front door: limit the number of requests over a given timespan, limiting bandwidth, ...
In this talk, we’ll explore the simple yet complex realm of rate limiting. We’ll go over how to decide on which resources to limit, what the limits should be and where to enforce these limits – in our app, on the server, using a reverse proxy like Nginx or even an external service like CloudFlare or Azure API management. The takeaway? Know when and where to enforce rate limits so you can have both a happy application as well as happy customers.
What is going on - Application diagnostics on Azure - TechDays FinlandMaarten Balliauw
We all like building and deploying cloud applications. But what happens once that’s done? How do we know if our application behaves like we expect it to behave? Of course, logging! But how do we get that data off of our machines? How do we sift through a bunch of seemingly meaningless diagnostics? In this session, we’ll look at how we can keep track of our Azure application using structured logging, AppInsights and AppInsights analytics to make all that data more meaningful.
ConFoo - Exploring .NET’s memory management – a trip down memory laneMaarten Balliauw
The .NET Garbage Collector (GC) is really cool. It helps providing our applications with virtually unlimited memory, so we can focus on writing code instead of manually freeing up memory. But how does .NET manage that memory? What are hidden allocations? Are strings evil? It still matters to understand when and where memory is allocated. In this talk, we’ll go over the base concepts of .NET memory management and explore how .NET helps us and how we can help .NET – making our apps better. Expect profiling, Intermediate Language (IL), ClrMD and more!
Speaking from experience building MyGet.org: users are insane. If you are lucky, they use your service, but in reality, they probably abuse. Crazy usage patterns resulting in more requests than expected, request bursts when users come back to the office after the weekend, and more! These all pose a potential threat to the health of our web application and may impact other users or the service as a whole. Ideally, we can apply some filtering at the front door: limit the number of requests over a given timespan, limiting bandwidth, ...
In this talk, we’ll explore the simple yet complex realm of rate limiting. We’ll go over how to decide on which resources to limit, what the limits should be and where to enforce these limits – in our app, on the server, using a reverse proxy like Nginx or even an external service like CloudFlare or Azure API management. The takeaway? Know when and where to enforce rate limits so you can have both a happy application as well as happy customers.
The .NET Garbage Collector (GC) is really cool. It helps providing our applications with virtually unlimited memory, so we can focus on writing code instead of manually freeing up memory. But how does .NET manage that memory? What are hidden allocations? Are strings evil? It still matters to understand when and where memory is allocated. In this talk, we’ll go over the base concepts of .NET memory management and explore how .NET helps us and how we can help .NET – making our apps better. Expect profiling, Intermediate Language (IL), ClrMD and more!
The .NET Garbage Collector (GC) is really cool. It helps providing our applications with virtually unlimited memory, so we can focus on writing code instead of manually freeing up memory. But how does .NET manage that memory? What are hidden allocations? Are strings evil? It still matters to understand when and where memory is allocated. In this talk, we’ll go over the base concepts of .NET memory management and explore how .NET helps us and how we can help .NET – making our apps better. Expect profiling, Intermediate Language (IL), ClrMD and more!
Fluttercon 2024: Showing that you care about security - OpenSSF Scorecards fo...Chris Swan
Have you noticed the OpenSSF Scorecard badges on the official Dart and Flutter repos? It's Google's way of showing that they care about security. Practices such as pinning dependencies, branch protection, required reviews, continuous integration tests etc. are measured to provide a score and accompanying badge.
You can do the same for your projects, and this presentation will show you how, with an emphasis on the unique challenges that come up when working with Dart and Flutter.
The session will provide a walkthrough of the steps involved in securing a first repository, and then what it takes to repeat that process across an organization with multiple repos. It will also look at the ongoing maintenance involved once scorecards have been implemented, and how aspects of that maintenance can be better automated to minimize toil.
Advanced Techniques for Cyber Security Analysis and Anomaly DetectionBert Blevins
Cybersecurity is a major concern in today's connected digital world. Threats to organizations are constantly evolving and have the potential to compromise sensitive information, disrupt operations, and lead to significant financial losses. Traditional cybersecurity techniques often fall short against modern attackers. Therefore, advanced techniques for cyber security analysis and anomaly detection are essential for protecting digital assets. This blog explores these cutting-edge methods, providing a comprehensive overview of their application and importance.
Best Programming Language for Civil EngineersAwais Yaseen
The integration of programming into civil engineering is transforming the industry. We can design complex infrastructure projects and analyse large datasets. Imagine revolutionizing the way we build our cities and infrastructure, all by the power of coding. Programming skills are no longer just a bonus—they’re a game changer in this era.
Technology is revolutionizing civil engineering by integrating advanced tools and techniques. Programming allows for the automation of repetitive tasks, enhancing the accuracy of designs, simulations, and analyses. With the advent of artificial intelligence and machine learning, engineers can now predict structural behaviors under various conditions, optimize material usage, and improve project planning.
Coordinate Systems in FME 101 - Webinar SlidesSafe Software
If you’ve ever had to analyze a map or GPS data, chances are you’ve encountered and even worked with coordinate systems. As historical data continually updates through GPS, understanding coordinate systems is increasingly crucial. However, not everyone knows why they exist or how to effectively use them for data-driven insights.
During this webinar, you’ll learn exactly what coordinate systems are and how you can use FME to maintain and transform your data’s coordinate systems in an easy-to-digest way, accurately representing the geographical space that it exists within. During this webinar, you will have the chance to:
- Enhance Your Understanding: Gain a clear overview of what coordinate systems are and their value
- Learn Practical Applications: Why we need datams and projections, plus units between coordinate systems
- Maximize with FME: Understand how FME handles coordinate systems, including a brief summary of the 3 main reprojectors
- Custom Coordinate Systems: Learn how to work with FME and coordinate systems beyond what is natively supported
- Look Ahead: Gain insights into where FME is headed with coordinate systems in the future
Don’t miss the opportunity to improve the value you receive from your coordinate system data, ultimately allowing you to streamline your data analysis and maximize your time. See you there!
Are you interested in dipping your toes in the cloud native observability waters, but as an engineer you are not sure where to get started with tracing problems through your microservices and application landscapes on Kubernetes? Then this is the session for you, where we take you on your first steps in an active open-source project that offers a buffet of languages, challenges, and opportunities for getting started with telemetry data.
The project is called openTelemetry, but before diving into the specifics, we’ll start with de-mystifying key concepts and terms such as observability, telemetry, instrumentation, cardinality, percentile to lay a foundation. After understanding the nuts and bolts of observability and distributed traces, we’ll explore the openTelemetry community; its Special Interest Groups (SIGs), repositories, and how to become not only an end-user, but possibly a contributor.We will wrap up with an overview of the components in this project, such as the Collector, the OpenTelemetry protocol (OTLP), its APIs, and its SDKs.
Attendees will leave with an understanding of key observability concepts, become grounded in distributed tracing terminology, be aware of the components of openTelemetry, and know how to take their first steps to an open-source contribution!
Key Takeaways: Open source, vendor neutral instrumentation is an exciting new reality as the industry standardizes on openTelemetry for observability. OpenTelemetry is on a mission to enable effective observability by making high-quality, portable telemetry ubiquitous. The world of observability and monitoring today has a steep learning curve and in order to achieve ubiquity, the project would benefit from growing our contributor community.
The DealBook is our annual overview of the Ukrainian tech investment industry. This edition comprehensively covers the full year 2023 and the first deals of 2024.
The Rise of Supernetwork Data Intensive ComputingLarry Smarr
Invited Remote Lecture to SC21
The International Conference for High Performance Computing, Networking, Storage, and Analysis
St. Louis, Missouri
November 18, 2021
Quality Patents: Patents That Stand the Test of TimeAurora Consulting
Is your patent a vanity piece of paper for your office wall? Or is it a reliable, defendable, assertable, property right? The difference is often quality.
Is your patent simply a transactional cost and a large pile of legal bills for your startup? Or is it a leverageable asset worthy of attracting precious investment dollars, worth its cost in multiples of valuation? The difference is often quality.
Is your patent application only good enough to get through the examination process? Or has it been crafted to stand the tests of time and varied audiences if you later need to assert that document against an infringer, find yourself litigating with it in an Article 3 Court at the hands of a judge and jury, God forbid, end up having to defend its validity at the PTAB, or even needing to use it to block pirated imports at the International Trade Commission? The difference is often quality.
Quality will be our focus for a good chunk of the remainder of this season. What goes into a quality patent, and where possible, how do you get it without breaking the bank?
** Episode Overview **
In this first episode of our quality series, Kristen Hansen and the panel discuss:
⦿ What do we mean when we say patent quality?
⦿ Why is patent quality important?
⦿ How to balance quality and budget
⦿ The importance of searching, continuations, and draftsperson domain expertise
⦿ Very practical tips, tricks, examples, and Kristen’s Musts for drafting quality applications
https://www.aurorapatents.com/patently-strategic-podcast.html
Understanding Insider Security Threats: Types, Examples, Effects, and Mitigat...Bert Blevins
Today’s digitally connected world presents a wide range of security challenges for enterprises. Insider security threats are particularly noteworthy because they have the potential to cause significant harm. Unlike external threats, insider risks originate from within the company, making them more subtle and challenging to identify. This blog aims to provide a comprehensive understanding of insider security threats, including their types, examples, effects, and mitigation techniques.
Blockchain technology is transforming industries and reshaping the way we conduct business, manage data, and secure transactions. Whether you're new to blockchain or looking to deepen your knowledge, our guidebook, "Blockchain for Dummies", is your ultimate resource.
Paradigm Shifts in User Modeling: A Journey from Historical Foundations to Em...Erasmo Purificato
Slide of the tutorial entitled "Paradigm Shifts in User Modeling: A Journey from Historical Foundations to Emerging Trends" held at UMAP'24: 32nd ACM Conference on User Modeling, Adaptation and Personalization (July 1, 2024 | Cagliari, Italy)
TrustArc Webinar - 2024 Data Privacy Trends: A Mid-Year Check-InTrustArc
Six months into 2024, and it is clear the privacy ecosystem takes no days off!! Regulators continue to implement and enforce new regulations, businesses strive to meet requirements, and technology advances like AI have privacy professionals scratching their heads about managing risk.
What can we learn about the first six months of data privacy trends and events in 2024? How should this inform your privacy program management for the rest of the year?
Join TrustArc, Goodwin, and Snyk privacy experts as they discuss the changes we’ve seen in the first half of 2024 and gain insight into the concrete, actionable steps you can take to up-level your privacy program in the second half of the year.
This webinar will review:
- Key changes to privacy regulations in 2024
- Key themes in privacy and data governance in 2024
- How to maximize your privacy program in the second half of 2024
UiPath Community Day Kraków: Devs4Devs ConferenceUiPathCommunity
We are honored to launch and host this event for our UiPath Polish Community, with the help of our partners - Proservartner!
We certainly hope we have managed to spike your interest in the subjects to be presented and the incredible networking opportunities at hand, too!
Check out our proposed agenda below 👇👇
08:30 ☕ Welcome coffee (30')
09:00 Opening note/ Intro to UiPath Community (10')
Cristina Vidu, Global Manager, Marketing Community @UiPath
Dawid Kot, Digital Transformation Lead @Proservartner
09:10 Cloud migration - Proservartner & DOVISTA case study (30')
Marcin Drozdowski, Automation CoE Manager @DOVISTA
Pawel Kamiński, RPA developer @DOVISTA
Mikolaj Zielinski, UiPath MVP, Senior Solutions Engineer @Proservartner
09:40 From bottlenecks to breakthroughs: Citizen Development in action (25')
Pawel Poplawski, Director, Improvement and Automation @McCormick & Company
Michał Cieślak, Senior Manager, Automation Programs @McCormick & Company
10:05 Next-level bots: API integration in UiPath Studio (30')
Mikolaj Zielinski, UiPath MVP, Senior Solutions Engineer @Proservartner
10:35 ☕ Coffee Break (15')
10:50 Document Understanding with my RPA Companion (45')
Ewa Gruszka, Enterprise Sales Specialist, AI & ML @UiPath
11:35 Power up your Robots: GenAI and GPT in REFramework (45')
Krzysztof Karaszewski, Global RPA Product Manager
12:20 🍕 Lunch Break (1hr)
13:20 From Concept to Quality: UiPath Test Suite for AI-powered Knowledge Bots (30')
Kamil Miśko, UiPath MVP, Senior RPA Developer @Zurich Insurance
13:50 Communications Mining - focus on AI capabilities (30')
Thomasz Wierzbicki, Business Analyst @Office Samurai
14:20 Polish MVP panel: Insights on MVP award achievements and career profiling
Measuring the Impact of Network Latency at TwitterScyllaDB
Widya Salim and Victor Ma will outline the causal impact analysis, framework, and key learnings used to quantify the impact of reducing Twitter's network latency.
3. “Find this type on NuGet.org”
In ReSharper and Rider
Search for namespaces
& types that are not yet referenced
4. “Find this type on NuGet.org”
Idea in 2013, introduced in ReSharper 9
(2015 - https://www.jetbrains.com/resharper/whatsnew/whatsnew_9.html)
Consists of
ReSharper functionality
A service that indexes packages and powers search
Azure Cloud Service (Web and Worker role)
Indexer uses NuGet OData feed
https://www.nuget.org/api/v2/Packages?$select=Id,Version,NormalizedVersion,LastEdited,Published&$
orderby=LastEdited%20desc&$filter=LastEdited%20gt%20datetime%272012-01-01%27
6. NuGet over time...
Repo-signing announced August 10, 2018
Big chunk of packages signed
over holidays 2018/2019
Re-download all metadata & binaries
Very slow over OData
Is there a better way?
https://blog.nuget.org/20180810/Introducing-Repository-Signatures.html
8. NuGet talks to a repository
Can be on disk/network share or remote over HTTP(S)
HTTP(S) API’s
V2 – OData based (used by pretty much all NuGet servers out there)
V3 – JSON based (NuGet.org, TeamCity, MyGet, Azure DevOps, GitHub repos)
9. V2 Protocol
Started as “OData-to-LINQ-to-Entities” (V1 protocol)
Optimizations added to reduce # of random DB queries (VS2013+ & NuGet 2.x)
Search – Package manager list/search
FindPackagesById – Package restore (Does it exist? Where to download?)
GetUpdates – Package manager updates
https://www.nuget.org/api/v2 (code in https://github.com/NuGet/NuGetGallery)
10. V3 Protocol
JSON based
A “resource provider” of various endpoints per purpose
Catalog (NuGet.org only) – append-only event log
Registrations – materialization of newest state of a package
Flat container – .NET Core package restore (and VS autocompletion)
Report abuse URL template
Statistics
…
https://api.nuget.org/v3/index.json (code in https://github.com/NuGet/NuGet.Services.Metadata)
11. How does NuGet.org work?
User uploads to NuGet.org
Data added to database
Data added to catalog (append-only data stream)
Various jobs run over catalog using a cursor
Registrations (last state of a package/version), reference catalog entry
Flatcontainer (fast restores)
Search index (search, autocomplete, NuGet Gallery search)
…
12. Catalog seems interesting!
Append-only stream of mutations on NuGet.org
Updates (add/update) and Deletes
Chronological
Can continue where left off (uses a timestamp cursor)
Can restore NuGet.org to a given point in time
Structure
Root https://api.nuget.org/v3/catalog0/index.json
+ Page https://api.nuget.org/v3/catalog0/page0.json
+ Leaf https://api.nuget.org/v3/catalog0/data/2015.02.01.06.22.45/adam.jsgenerator.1.1.0.json
14. “Find this type on NuGet.org”
Refactor from using OData to using V3?
Mostly done, one thing missing: download counts (using search now)
https://github.com/NuGet/NuGetGallery/issues/3532
Build a new version?
Welcome to this talk
16. What do we need?
Watch the NuGet.org catalog for package changes
For every package change
Scan all assemblies
Store relation between package id+version and namespace+type
API compatible with all ReSharper and Rider versions
Bonus points!
Easy way to re-index later (copy .nupkg binaries + dump index to JSON blobs)
17. What do we need?
Watch the NuGet.org catalog for package changes periodic check
For every package change based on a queue
Scan all assemblies
Store relation between package id+version and namespace+type
API compatible with all ReSharper and Rider versions always up, flexible scale
Bonus points!
Easy way to re-index later (copy .nupkg binaries + dump index to JSON blobs)
19. Sounds like functions!
NuGet.org catalog Watch catalog
Index command
Find type API
Find namespace API
Search index
Index package
Raw .nupkg
Index as JSON
Download packageDownload command
21. Functions best practices
@PaulDJohnston https://medium.com/@PaulDJohnston/serverless-best-practices-b3c97d551535
Each function should do only one thing
Easier error handling & scaling
Learn to use messages and queues
Asynchronous means of communicating, helps scale and avoid direct coupling
...
25. We’re making progress!
NuGet.org catalog Watch catalog
Index command
Find type API
Find namespace API
Search index
Index package
Raw .nupkg
Index as JSON
Download packageDownload command
27. Next up: indexing
NuGet.org catalog Watch catalog
Index command
Find type API
Find namespace API
Search index
Index package
Raw .nupkg
Index as JSON
Download packageDownload command
28. Indexing
Opening up the .nupkg and reflecting on assemblies
System.Reflection.Metadata
Does not load the assembly being reflected into application process
Provides access to Portable Executable (PE) metadata in assembly
Store relation between package id+version and namespace+type
Azure Search? A database? Redis? Other?
30. System.Reflection.Metadata
using (var portableExecutableReader = new PEReader(assemblySeekableStream))
{
var metadataReader = portableExecutableReader.GetMetadataReader();
foreach (var typeDefinition in metadataReader.TypeDefinitions.Select(metadataReader
.GetTypeDefinition))
{
if (!typeDefinition.Attributes.HasFlag(TypeAttributes.Public)) continue;
var typeNamespace = metadataReader.GetString(typeDefinition.Namespace);
var typeName = metadataReader.GetString(typeDefinition.Name);
if (typeName.StartsWith("<") || typeName.StartsWith("__Static") ||
typeName.Contains("c__DisplayClass")) continue;
typeNames.Add($"{typeNamespace}.{typeName}");
}
}
31. Azure Search
“Search-as-a-Service”
Scales across partitions and replicas
Define an index that will hold documents consisting of fields
Fields can be searchable, facetable, filterable, sortable, retrievable
Can’t be changed easily, think upfront!
Have to define what we want to search, and what we want to display
My function will also write documents to a JSON blob
Can re-index using Azure Search importer in case needed
33. “Do one thing well”
Our function shouldn’t care about creating a search index.
Better: return index operations, have something else handle those
Custom output binding?
35. Almost there…
NuGet.org catalog Watch catalog
Index command
Find type API
Find namespace API
Search index
Index package
Raw .nupkg
Index as JSON
Download packageDownload command
38. One issue left...
Download counts - used for sorting and scoring search results
Change continuously on NuGet
Not part of V3 catalog
Could use search but that’s N(packages) queries
https://github.com/NuGet/NuGetGallery/issues/3532
If that data existed, how to update search?
Merge data! new PackageDocumentDownloads(key, downloadcount)
39. We’re done!
NuGet.org catalog Watch catalog
Index command
Find type API
Find namespace API
Search index
Index package
Raw .nupkg
Index as JSON
Download packageDownload command
40. We’re done!
Functions
Collect changes from NuGet catalog
Download binaries
Index binaries using PE Header
Make search index available in API
Trigger, input and output bindings
Each function should do only one thing
NuGet.org catalog Watch catalog
Index command
Find type API
Find namespace API
Search index
Index package
Raw .nupkg
Index as JSON
Download packageDownload command
41. We’re done!
All our functions can scale (and fail)
independently
Full index in May 2019 took ~12h on 2 B1 instances
Can be faster on more CPU’s
~ 1.7mio packages (NuGet.org homepage says)
~ 2.1mio packages (the catalog says )
~ 8 400 catalog pages
with ~ 4 200 000 catalog leaves
(hint: repo signing)
NuGet.org catalog Watch catalog
Index command
Find type API
Find namespace API
Search index
Index package
Raw .nupkg
Index as JSON
Download packageDownload command
42. Closing thoughts…
Would deploy in separate function apps for cost
Trigger binding collects all the time so needs dedicated capacity (and thus, cost)
Others can scale within bounds (think of $$$)
Would deploy in separate function apps for failure boundaries
Trigger, indexing, downloading should not affect health of API
Are bindings portable...?
Avoid them if (framework) lock-in matters to you
They áre nice in terms of programming model…
Show feature in action in Visual Studio (and show you can see basic metadata etc.)
Copied in 2017 in VS - https://www.hanselman.com/blog/VisualStudio2017CanAutomaticallyRecommendNuGetPackagesForUnknownTypes.aspx
Demo the feed quickly?
Around 3 TB in May 2019
Demo ODataDump quickly
Demo: click around in the API to show some base things
Raw API - click around in the API to show some base things, explain how a cursor could go over it
Root https://api.nuget.org/v3/catalog0/index.json
Page https://api.nuget.org/v3/catalog0/page0.json
Leaf https://api.nuget.org/v3/catalog0/data/2015.02.01.06.22.45/adam.jsgenerator.1.1.0.json
Explain CatalogDump
NuGet.Protocol.Catalog comes from GitHub
CatalogProcessor feches all pages between min and max timestamp
My implementation BatchCatalogProcessor fetches multiple pages at the same time and build a “latest state” – much faster!
Fetches leaves, for every leaf calls into a simple method
Much faster, easy to pause (keep track of min/max timestamp)
LOL input, process, output
More serious: events trigger code
Periodic check for packages
Queue message to index things
API request runs a search
No server management or capacity planning
Will use storage queues n demo’s to be able to run things locally. Ideally use SB topics or event grid (transactional)
Create a new TimerTrigger function
We will need a function to index things from NuGet
Timer will trigger every X amount of time
Timer provides last timestamp and next timestamp, so we can run our collector for that period
Snippet: demo-timertrigger
Mention HttpClient not used correctly: not disposed, so will starve TCP connections at some point
Go over code example and run it
var httpClient = new HttpClient();
var cursor = new InMemoryCursor(timer.ScheduleStatus?.Last ?? DateTimeOffset.UtcNow);
var processor = new CatalogProcessor(
cursor,
new CatalogClient(httpClient, new NullLogger<CatalogClient>()),
new DelegatingCatalogLeafProcessor(
added =>
{
log.LogInformation("[ADDED] " + added.PackageId + "@" + added.PackageVersion);
return Task.FromResult(true);
},
deleted =>
{
log.LogInformation("[DELETED] " + deleted.PackageId + "@" + deleted.PackageVersion);
return Task.FromResult(true);
}),
new CatalogProcessorSettings
{
MinCommitTimestamp = timer.ScheduleStatus?.Last ?? DateTimeOffset.UtcNow,
MaxCommitTimestamp = timer.ScheduleStatus?.Next ?? DateTimeOffset.UtcNow,
ServiceIndexUrl = "https://api.nuget.org/v3/index.json"
},
new NullLogger<CatalogProcessor>());
await processor.ProcessAsync(CancellationToken.None);
Each function should only do one thing! We are violating this.
Go over Approach1 code – Enqueuer class
Mention we are using roughly the same code as before
Differences are that our function is now no longer doing things itself, instead it’s adding messages to a queue for processing later on
That Queue binding is interesting. This is where the input/output comes from. Instead of managing our own queue connection, we let the framework handle all plumbing so we can focus on adding messages.
In Indexer, we use the Queue as an input binding, and read messages.
We can now scale enqueuing and scaling separately! But are we there yet?
Go over Approach2 code
Show this is MUCH simpler – trigger binding that provides input, queue output bindign to write that input to a queue
Let’s go over what it takes to build a trigger binding
NuGetCatalogTriggerAttribute – the data needed for the trigger to work – go over properties and attributes
Hooking it up requires a binding configuration – NuGetCatalogTriggerExtensionConfigProvider
It says: if you see this specific binding, register it as a trigger that maps to some provider
So we need that provider – NuGetCatalogTriggerAttributeBindingProvider
Provider is there to create an object that provides data. In our case we need to store the NuGet catalog timestamp cursor, so we do that on storage, and then return the actual binding – NuGetCatalogTriggerBinding
In NuGetCatalogTriggerBinding, we have to specify how data can be bound. What if I use a differnt type of object than PackageOperation? What if someone used a node.js or Python function instead of .NET. Need to define the shape of the data our trigger provides.
PackageOperationValueProvider is also interesting, this provides data shown in the portal diagnostics
CreateListenerAsync is where the actual triger code will be created – NuGetCatalogListener
NuGetCatalogListener uses the BatchCatalogProcessor we had previously, and when a package is added or deleted it will call into the injected ITriggeredFunctionExecutor
ITriggeredFunctionExecutor is Azure Functions framework specific, but it’s the glue that will clal into our function with the data we provide
Note StartAsync/StopAsync where you can add startup/shutdown code
ONE THING LEFT THAT IS NOT DOCUMENTED – Startup.cs to register the binding.
And since we are in a different class library, also need Microsoft.Azure.WebJobs.Extensions referenced to generate \bin\Debug\netcoreapp2.1\bin\extensions.json
As a result our code is now MUCH cleaner, show it again and maybe also show it in action
Mention [Singleton(Mode = SingletonMode.Listener)] – we need to ensure this binding only runs single-instance (cursor clashes otherwise). This is due to ho the catalog works, parallel processing is harder to do. But we can fix that by scaling the Indexer later on.
Show Approach3 PopulateQueueAndTable
Same code, but a bit more production worthy
Sending data to two queues (indexing and downloading)
Storing data in a table (and yes, violating “do one thing” again but I call it architectural freedom)
Next up will be downloading and indexing. Let’s start with downloading.
Grab a copy of the .nupkg from NuGet and store it in a blob
Redundancy - no need to re-download/stress NuGet on a re-index
Go over Approach3 code
DownloadToStorage uses a QueueTrigger to run whenever a message appears in queue
Note no singleton: we can scale this across multiple instances/multiple servers
Uses a Blob input binding that provides access to a blob
Note the parameters, name of the blob is resolved based on data from other inputs which is prety nifty
Our code checks whether it’s an add or a delete, and either downloads+uploads to the blob reference, or delets the blob reference
Next up will be indexing itself. There are a couple of things here…
Go over Approach3 code
PackageIndexer uses a QueueTrigger to run whenever a message appears in queue
Uses a Blob input binding that provides access to a blob where we can write our indexed entity – will show this later
Based on package operation, we will add or delete from the index
RunAddPackageAsync has some plumbing, probably too much, to dowload the .nupkg file and store it on disk
Note: we store it on disk as we need a seekable stream. So why no memoy stream? Some NuGet packages are HUGE.
Find PEReader usage and show how it will index a given package’s public types and namespaces
All goes into a typeNames collection.
Now: how do we add this info to the index?
Show PackageDocument class, has MANY properties
First important: the Identifier property has [Key] applied. Azure Search needs a key for teh document so we can retrieve by key, which could be useful when updating existing content or to find a specific document and delete it from the index.
Second important: TypeNames is searchable. Also mention “simpleanalyzer”: “Divides text at non-letters and converts them to lower case.” Other analyzers remove stopwords and do other things, this one should be as searchable as possible.
Other fields are sometimes searchable, sometimes facetable – a bit of leftover from me thinking about search use cases. The R# API ony searches on typename so could make everything else just retrievable as well.
Of course, index is not there by default, so need to create it. We do this when our function is instantiated (static constructor, so only once per launch of our functions)
Is this good? Yes, because only once per server instance our function runs on. No because we do it at one point, what if the index is deleted in between and needs to be recreated? Edge case, but a retry strategy could be a good idea...
Next, we create our package document, and at one point we add it to a list of index actions, and to blob storage
indexActions.Add(IndexAction.MergeOrUpload(packageToIndex));
JsonSerializer.Serialize(jsonWriter, packagesToIndex);
Writing to index using batch - var indexBatch = IndexBatch.New(actions);
Leftover code from earlier, batch makes no sense for one document, but in case you want to do multiple in one go this is the way. Do beware a batch can only be several MB in size, for this NuGet indexing I can only do ~25 in a batch before payload is too large.
That’s… it!
Run approach 3 (for last hour) and see functions being hit / packages added to index
Go to Azure Search portal as well, show how importer would work in case of fire
Go over Approach3 code
PackageIndexerWithCustomBinding is mostly the same code
One difference: it uses the [AzureSearchIndex] binding to write add/delete operations to the index instead
Go over how it works. Again, an attribute with settings – AzureSearchIndexAttribute
Also a configuration that registers the binding as an output binding using BindToCollector – AzureSearchExtensionConfigProvider
Now, what’s this OpenType?
It’s some sort of dynamic type. If we want to create an AzureSearch output binding, we better support more than just our PackageDocument use case!
So we need a collector builder that can create the actual binding implementation based on the real type requested by our function parameter – AzureSearchAsyncCollectorBuilder
In AzureSearchAsyncCollectorBuilder, we do that. Very simple bootstrap code in this case, but could be more complex depending on the type of binding you are creating.
Our AzureSearchAsyncCollector uses the attribute to check for Azure Search connection details, as well as the type of operation we expect it to handle. Why not all? Well, IAsyncCollector only has Add and Flush.
Note: add called manually, flush at function complete – could use flush to send things in a batch...
Code itself pretty straightforward. On Add, we add an action to search. With a retry in case the index does not exist – we then create it.
Creation code kind of interesting as we use some reflection in case we specify a given type of coument to index.
Why? Cause when we do Upserts, we may want to update just one or two properties, and can use a different Dto in that case (but still have the index shaped to the full document shape)
Run when time left, but nothing fancy here...
Now we need to make ReSharper talk to our search. We have the index, so that should be a breeze, right?
Go over Web code
RunFindTypeApiAsync and RunFindNamespaceAsync
Both use “name” as their query parameter to search for
RunInternalAsync does the heavy lifting
Grabs other parameters
Runs search, and collects several pages of results
Why is this ForEachAsync there?
Search index has multiple versions for every package id, yet ReSharper expects only the latest matching all parameters
Azure Search has no group by / distinct by, so need to do this in memory. Doing it here by fetching a maximum number of results and doing the grouping manually.
Use the collected data to build result. Add matching type names etc.
Example requests:
http://localhost:7071/api/v1/find-type?name=JsonConvert
http://localhost:7071/api/v1/find-type?name=CamoServer&allowPrerelease=true&latestVersion=false
https://nugettypesearch.azurewebsites.net/api/v1/find-type?name=JsonConvert
In ReSharper (devenv /ReSharper.Internal, go to NuGet tool window, set base URL to https://nugettypesearch.azurewebsites.net/api/v1/)
Write some code that uses JsonConvert / JObject and try it out.