25

In an answer to Strange new database 'Data.StackExchangePOC' is messing up cross-site SEDE queries our DBA extraordinair Aaron Bertrand suggested to have a feature request for a view to list databases that can safely be used in a multi-database query.

For context, multi-database queries are possible by doing some nifty trickery that rely on the meta-data found in the sys schema of SQL Server. These tricks fail when unexpected databases show up (1, 2, 3 , 4)

There have been prior proposals to add extra meta-data to the schema, for example:

This feature request is explicitly not a duplicate of any of those prior requests. I'm asking here to implement exactly what Aaron suggested and only that as I assume that can be done now and not in 6 to 8 weeks.

So without further ado here is Aaron's proposal:

A feature request that would likely be easy for me to implement would be a view that provides a list of databases that:

  • are not system databases
  • are not secret POC databases
  • are online
  • are accessible by svc_sede
  • contain a Posts table

This would present you with a reliable list that will be updated when new sites are added. It would really reduce the grunt work in queries like Glorfindel's, and prevent you from having to parse this question for that same list. In fact, the view could be used to generate the markdown in that question.

Please create that view for us.

2
  • 4
    feature-request: staff write my feature-requests :)
    – starball
    Commented Apr 8 at 9:54
  • 9
    @starball feature-request: stop writing feature-requests as staff because they only create work for me. :-)
    – Aaron Bertrand Staff
    Commented Apr 8 at 13:02

2 Answers 2

31
+500

I've built something you will hopefully find useful, as it will present a list of databases (and a bunch of additional information) that you can use for cross-database queries.

Before I explain further, though, I'm going to start with some caveats:


Caveats

  1. Due to the principle of least disruption™, as of April 8th, 2024, these views mentioned below are only exposed in 3D Printing - just happened to alphabetically be the first non-busy Data Explorer site (maybe this is a clever ploy to change that!). The rest of the sites will have them enabled on our next SEDE refresh.

  2. The views reference data that is pulled on demand from each database as it becomes available on Sunday, and then cached. It's a heavy query so, immediately after a site comes online, the information might take a couple of minutes to be refreshed in the views.

  3. Site names that contain ampersands come back as the "fixed" name that doesn't break SEDE. For example, Anime & Amiga is presented as Anime and Amiga. For background, see this question.

  4. SEDE simply doesn't have access to some of the information folks have requested in the past, like the launch date (which you can probably approximate from MIN(Posts.CreationDate)), whether a site is in beta, and even the data dump filename. The latter is just {LOWER(site_url without https://).7z}, but is complicated by the fact that Stack Overflow now pushes 8 separate files to archive.org.

  5. I'll document them here but you won't find these views (or their definitions) in Data Explorer. If you know why, or think you know why, please don't guess out loud - I won't comment. Let's just all agree to call it magic. 🎩


dbo.sede_databases

This view has the following columns:

column name data type description
database_id int database_id from sys.databases.
database_name nvarchar(128) name from sys.databases.
site_name nvarchar(64) Name from Sites, e.g. Ubuntu or Webapps Meta.
tiny_name nvarchar(50) TinyName from Sites, which is also the API site parameter, e.g. askubuntu or meta.webapps.
long_name nvarchar(64) LongName from Sites, e.g. Ask Ubuntu or Web Applications Meta.
site_type nvarchar(9) For meta sites other than this one, meta_site. For this site and all main sites, main_site.
site_url nvarchar(2048) Clickable link to the site, e.g. https://askubuntu.com or https://webapps.meta.stackexchange.com.
sede_url nvarchar(2048) Clickable link to the site-specific Data Explorer. Note that a link to meta effectively lands you at its parent.
api_site_parameter nvarchar(50) The api site parameter, e.g. askubuntu or meta.webapps.
initialized datetime When we started populating that database.
made_available datetime When the database came online and was ready for queries.
processing_time time(3) The duration between initialized and made_available, in hh:mm:ss.fff format.
questions int Total number of questions in this site at the time of the current refresh.
answers int Total number of answers in this site at the time of the current refresh.
latest_post datetime The timestamp of the last post captured in this refresh.
notes varchar(46) This will be non-NULL when a database is in transition (Database is not ready.) or when the database is currently being refreshed (... refresh in progress since {hh:mm:ss} UTC ...). For the former, this should be rare and very brief. For the latter, the existing database is still available, it's just the old copy. If you want to get just online databases with the most recent week, use WHERE notes IS NOT NULL. If you want to wait until all databases are online, just check WHERE NOT EXISTS (SELECT 1 FROM dbo.sede_databases WHERE notes IS NOT NULL.

Looping query alternatives

One of the main things that prompted me to prompt you to submit a feature request is because it was painful to watch people craft queries that rely on parsing markdown from a meta question - something arguably quite fragile - simply to be able to query each database using information I knew we already had.

CREATE TABLE #results (
  db           sysname,
  Id           int,
  Title        nvarchar(250),
  ViewCount    int,
  Score        int,
  CreationDate datetime);

DECLARE @exec nvarchar(500), @sql nvarchar(max), @c cursor;

SELECT @sql = N'INSERT #results
  (db,Id,Title,ViewCount,Score,CreationDate)
  SELECT DB_NAME(), p.Id, p.Title, p.ViewCount, p.Score, p.CreationDate
  FROM Posts AS p INNER JOIN Users AS u
    ON p.OwnerUserId = u.Id
    WHERE p.PostTypeId = 1 AND u.AccountId = @AccountId;';

SET @c = cursor FOR SELECT database_name
  FROM dbo.sede_databases WHERE notes IS NULL;

OPEN @c;
FETCH NEXT FROM @c INTO @exec;

WHILE @@FETCH_STATUS <> -1
BEGIN
  SET @exec = QUOTENAME(@exec) + N'.sys.sp_executesql';
  EXEC @exec @sql, N'@AccountId int', ##AccountId:int##;
  FETCH NEXT FROM @c INTO @exec;
END

SELECT Site        = sd.long_name, 
       Question    = CONCAT(sd.site_url, '/q/', r.Id, '|', r.Title),
       [# views]   = r.ViewCount,
       r.Score,
       [Posted on] = r.CreationDate
  FROM dbo.sede_databases AS sd
  INNER JOIN #results AS r 
     ON r.db = sd.database_name
  ORDER BY r.ViewCount DESC;

Aggregate query alternatives

This view also lets you simplify some other queries I've seen out there to obtain aggregates about the state of various databases and how long they took. Some of that information is available directly from sys.databases, but an aggregate variation that comes to mind is:

The first query there...

...returns information about the oldest and newest databases, any databases currently refreshing, and the average duration. (And a couple of other rows I got too lazy to grok, but I think they're covered in the next query?) We can roughly replicate most of this query against the new views like this:

;WITH dbs AS
(
  SELECT database_id, 
         database_name,
         made_available,
         initialized,
         ma_days_ago = DATEDIFF(DAY, made_available, GETDATE()),
         init_days_ago = DATEDIFF(DAY, initialized, GETDATE()),
         notes
  FROM dbo.sede_databases
)
SELECT 1, category = 'oldest', * FROM
(
  SELECT TOP (1) db = RTRIM(database_id), 
             Description = database_name, 
             create_date = made_available, 
             [days ago] = ma_days_ago
    FROM dbs WHERE notes IS NULL
    ORDER BY made_available
) AS o
UNION ALL SELECT 2, 'newest', * FROM 
(
  SELECT TOP (1) db = RTRIM(database_id), 
             database_name, 
             made_available, 
             ma_days_ago
    FROM dbs WHERE notes IS NULL
    ORDER BY made_available DESC
) AS n
UNION ALL SELECT 3, 'restoring', * FROM
(
  SELECT db = RTRIM(database_id), 
         database_name, 
         initialized, 
         init_days_ago
    FROM dbs WHERE notes LIKE N'...%since%UTC%'
) AS r
UNION ALL SELECT 6, 'duration (min)', * FROM
(
  SELECT c  = CONCAT('count ', COUNT(*)), 
         av = CONCAT('avg ', CONVERT(decimal(12,2), 
              AVG(1.0*DATEDIFF(MINUTE, initialized, made_available)))),
         x  = null, y  = null FROM dbs) AS av;

The second query there...

...shows the earliest and latest database, how many days ago that was, whether there are any databases still restoring, and then how many databases there are and how long the whole process took. This query can work against the new view like this (arguably, it's not much simpler):

SELECT [earliest date (UTC)] = FirstDB,
       [latest date (UTC)]   = LastDB,
       [days ago]            = DATEDIFF(DAY, LastDB, getdate()),
       restoring             = COALESCE(restoring, N'None'),
       [count]               = DBCount,
       minutes               = DATEDIFF(MINUTE, FirstDB, LastDB)
FROM 
(
  SELECT FirstDB   = MIN(made_available),
         LastDB    = MAX(made_available),
         DBCount   = COUNT(*),
         restoring = (
           SELECT STRING_AGG(database_name, ', ')
             FROM dbo.sede_databases
             WHERE notes LIKE N'...%since%UTC%'
         )
    FROM dbo.sede_databases
    WHERE notes IS NULL
) AS details;

The only difference is the new variation shows a comma-separated list of any databases still restoring, which will be helpful since the process is now parallel. Also, the made_available timestamp is up to a few seconds later because it includes time spent on further database operations not captured by sys.databases.

And the third query there...

...is also captured above.


Generating markdown

There's a question over here about getting URLs and site names for a cross-site SEDE query:

One of the answers there has a big chunk of markdown presenting URL, Site name ("long name"), and Database name for each database, ordered by "long name" (so that, e.g., Ask Ubuntu and Ask Ubuntu Meta appear together).

I've created a view called dbo.sede_databases_markdown that will generate this markdown for you, with the following columns:

column name data type description
rn int row number, for sorting by "long name"
content nvarchar(max) markdown with headers

My example query looks like this (but I also added some cache-busting):

SELECT content 
  FROM dbo.sede_databases_markdown
  ORDER BY rn;

If you don't want the markdown headers, you can say

SELECT content 
  FROM dbo.sede_databases_markdown
  AND rn > 0
  ORDER BY rn;

To make it easy to copy and paste the results (which will have some extra things you'll need to manually remove, like the content result header, a long line of -----, and 300+ row(s) returned), use the "Text-only results" option:

Check the box


But wait, there's more!

If you didn't just want to know about databases, I can also tell you a little about the tables in each site!

dbo.sede_tables

Some additional information here about each table, including the ability to see which tables are the biggest (without running expensive COUNT(*) queries) and which took the longest to complete (though that information will become less interesting as the process changes).

column name data type description
database_id int database_id from sys.databases.
database_name nvarchar(128) name from sys.databases.
table_name nvarchar(128) name from sys.tables.
latest_date datetime The latest timestamp (e.g. CreationDate) for tables that have such a column.
row_count bigint Number of rows in the table, according to sys.partitions. Note: For Posts, this is PostsWithDeleted; if you want non-deleted total number of Posts, use answers + questions from dbo.sede_databases.
initialized datetime Timestamp of initial table creation.
made_available datetime Timestamp of final operation against table.
processing_time time(3) Duration of initialized -> made_available, in {hh:mm:ss.fff} format.

An example query that shows how long each table took for Stack Overflow (keeping in mind, this weekend is a bad reference point that involved me literally falling asleep; next weekend's should look a lot better):

SELECT table_name, 
       latest_date, 
       row_count,
       initialized,
       made_available,
       processing_time
  FROM dbo.sede_tables
 WHERE database_name = N'StackOverflow';

There's still more!

Well, this was mostly to help me.

dbo.sede_sites

This is just a slimmer version of dbo.sede_databases, without all the additional computation and materialization. It just has four columns:

column name data type description
site_name nvarchar(64) Name from Sites, e.g. Ubuntu or Webapps Meta.
site_url nvarchar(2048) Clickable link to the site, e.g. https://askubuntu.com or https://webapps.meta.stackexchange.com.
database_name nvarchar(128) name from sys.databases.
long_name nvarchar(64) LongName from Sites, e.g. Ask Ubuntu or Web Applications Meta.
site_id int Id from Sites.

Most importantly, this can be used to join the database you're in to get a couple of extra properties, with a very simple index seek against a pretty narrow row:

SELECT s.site_name, s.site_url, s.long_name
  FROM dbo.sede_sites AS s
  WHERE database_name = DB_NAME();

And yet more!

A common request has been to look at a specific account (say, your own) across multiple sites, without having to loop and scan entire Users tables to find the matching AccountId. So, I offer...

dbo.sede_users

This is meant to provide some of the information you'll spot in the network profile / accounts tab; namely, reputation, question count, and answer count.* For example, here are my top 5 sites by reputation:

SELECT  TOP (5)
        ss.site_name,
        profile = CONCAT(ss.site_url, '/u/', su.user_id),
        su.reputation,
        su.question_count,
        su.answer_count
   FROM dbo.sede_users AS su
  INNER JOIN dbo.sede_sites AS ss
     ON su.site_id = ss.site_id
  WHERE su.account_id = ##AccountId:int##
  ORDER BY su.reputation DESC;

Results:

Top 5 sites by reputation

* Note that this view is refreshed once daily, around 00:00 UTC.

Column list (mostly self-explanatory):

column name data type description
account_id int Id from Accounts.
site_id int Id from Sites (and site_id from dbo.sede_sites).
user_id int Id from Users. (Sorry for the keyword collision, another vicious debate I had with myself about keeping the established convention or letting IntelliSense be angry. Lazy won.)
reputation int Reputation from the network profile accounts tab.
question_count int Questions from the network profile accounts tab.
answer_count int Answers from the network profile accounts tab.

To keep things simple, sorry, I did not include badge counts, so it is not a full representation. You could replicate a partial row from this page for a single site using this query:

  SELECT su.user_id, 
         su.reputation,
         su.question_count,
         su.answer_count
    FROM dbo.sede_users AS su
   INNER JOIN dbo.sede_sites AS ss
      ON su.site_id = ss.site_id
   WHERE su.account_id = ##AccountId:int##
     AND ss.database_name = DB_NAME();

Results:

Partial representation of accounts tab for a single site

Or for all sites, you can use this query:

  SELECT ss.long_name,
         profile = CONCAT(ss.site_url, '/u/', su.user_id),
         su.reputation,
         su.question_count,
         su.answer_count
    FROM dbo.sede_users AS su
   INNER JOIN dbo.sede_sites AS ss
      ON su.site_id = ss.site_id
   WHERE su.account_id = ##AccountId:int##;

Results:

Partial representation of accounts tab across all of my sites

But to get the badges, it's a little more complex (and again, this is for a single site only):

;WITH src AS 
(
  SELECT su.user_id, 
         su.reputation,
         su.question_count,
         su.answer_count
    FROM dbo.sede_users AS su
   INNER JOIN dbo.sede_sites AS ss
      ON su.site_id = ss.site_id
   WHERE su.account_id = ##AccountId:int##
     AND ss.database_name = DB_NAME()
)
SELECT reputation,
       question_count, 
       answer_count,
       [bronze badges] = [1],
       [silver badges] = [2],
       [gold badges]   = [3]
  FROM src
 OUTER APPLY
       (
         SELECT class, BadgeCount = COUNT(*)
           FROM dbo.Badges AS b
          WHERE b.UserId = src.user_id
          GROUP BY class
       ) AS b
 PIVOT
 (
   MAX(BadgeCount) 
   FOR [class] IN ([1],[2],[3])
 ) AS p;

Results:

Adding badges to the mix

There isn't a direct way to combine getting global information across all sites with local information within each site, without creating your own #temp table and looping. At least not yet.


Have fun

These views are provided as is and without any warranty or support. I will not be taking bug reports at this time, because that wiped me out, I didn't really get approval to work on this and, also, I don't want to have to update any of the above markup or queries. 🥱

3
  • 11
    I've just tested this and it is pretty awesome. Magic was as expected. Anyone that wants to copy-out the markdown output: Put a check-mark in 'text-only results' field before you run the query. I can't explain how grateful we are for you doing this, specially in your spare time.
    – rene
    Commented Apr 9 at 7:17
  • 3
    @rene Thanks, updated.
    – Aaron Bertrand Staff
    Commented Apr 9 at 11:04
  • (As a side note on "the launch date (which you can probably approximate from MIN(Posts.CreationDate))", migrated posts from older sites will break this approximation) Commented May 24 at 13:40
6
+500

Posting as a separate answer because it's even more elaborate than the original one and solves the problem in a slightly different way.

Several folks have asked for direct support for cross-site queries:

While Tim Stone is right that there is a mechanism built into the application code to facilitate this kind of query, it was first written not long after Android 1.0 was released, possibly hasn't been touched since Tim's comment over a decade ago, and was intentionally disabled at some point early on, for reasons unknown. As part of demonstrating the views solution in the other answer, it really struck me that these feature requests deserved another look, and it was far simpler for me to fulfill this request independently (and without touching the application code).

So, I've added a stored procedure that should help. It is along the same lines as the "execute in the context of each database" procedure, sp_ineachdb, that I wrote back at the beginning of Covid to solve a similar class of problems (and to overcome a pretty serious bug in the undocumented and unsupported sp_MSforeachdb). In 2020, I somehow convinced Brent Ozar to include it in the SQL Server First Responder Kit, and the community has been helping evolve and improve it ever since.

Fast forward to April 2024, when I used some of those techniques to create...


dbo.sede_ineachdb

This procedure takes the command you specify and runs it in the context of each site's database. By default, it will collect the results for you, and present you with a single output table at the end. The goal was to reduce the hassle involved with creating your own #temp table, figuring out the full list of databases to loop through, then looping, inserting results from each database into the #temp table, and finally pulling out the results. This is potentially your new one-stop shop to do all of that in a single call! After a couple of quick examples, there are some limitations and known issues you should review.

Very simple examples

This query returns a single result set with each site's database name, without any manual looping, and without having to cross-reference the site name anywhere:

EXEC dbo.sede_ineachdb 
     @SQLCommand = N'SELECT db = DB_NAME();';

The query returns 364 rows; here are partial results:

A query returning just 364 DB_NAME()s

Want other information, like the site name, long name (which is subtly different, e.g. check Ubuntu), and URL? Want to leave out meta sites?

EXEC dbo.sede_ineachdb 
     @SQLCommand = N'SELECT db_name   = DB_NAME(), 
                            site_name = [$site_name$],
                            long_name = [$long_name$],
                            site_url  = [$site_url$];',
     @IncludeMetaSites = 0;

Partial results:

A query returning attributes for non-meta sites

How about just printing out some of this information? There is an additional argument that avoids the overhead of data collection:

EXEC dbo.sede_ineachdb
     @SQLCommand = N'PRINT ''======'' + char(13) + char(10)
                     + CONCAT_WS
                       (
                         char(13) + char(10),
                         DB_NAME(),
                         [$long_name$],
                         [$site_url$]
                       );',
     @CollectResultsForMe = 0;

Partial results:

PRINT output for attributes for every site

You can borrow from these queries here, but maybe not before reading the rest of this answer:


Parameters

In addition to the expected @SQLCommand parameter and others mentioned above, here are all of the parameters. You may find them useful, depending on what type of cross-site query you're trying to achieve.

Parameter Data Type Default Description
@SQLCommand nvarchar(4000) This may seem short, but I don't want to encourage queries that require a max type. Anecdotally, most queries I've seen that large were only so long because they had to do exactly the thing this procedure intends to solve: build complex loops to run queries per database, or generate large dynamic SQL strings that mention each database explicitly.
@IncludeMainSites bit 1 Include all non-meta sites and this site (StackExchange.Meta), which is kind of in a grey area, but more of a main site than a meta.
@IncludeMetaSites bit 1 Include all meta sites except this site (StackExchange.Meta).
@IncludeMainMeta bit 1 Include this site (StackExchange.Meta) regardless of the above setting. It will be included with main sites unless you set it to 0, but not with meta sites unless you set it to 1. Probably the truth table I wrestled with the most during this work...
@CollectResultsForMe bit 1 For standard SELECT queries, this will attempt to put each database's results into a #temp table, and then return the whole #temp table in one shot. Nested EXEC is supported, as are many query patterns, as long as they end up producing a single result set and don't try to write. You'd want to set this to 0 if you want to issue PRINT commands or want to manage your own #temp table inside the command you pass to the procedure. But like the views above, I can't guarantee unforeseen behavior, and won't be able to provide support if you find a way to break it.
@ErrorOnSkippedSites bit 0 Set this to 1 if you want execution to halt in the event any site is missing due to transition change (not because you set any of the @Include... parameters to 0). If any site is unavailable before the loop starts, there will be a message telling you which databases would have been skipped - up to the first four. (Fun fact: I almost shipped this with a pretty glaring pluralization bug I would have never lived down.) A database may be skipped more quietly if state changes as the query is about to run for that database. You are most likely to see this overnight Saturday and into Sunday (UTC), during the weekly refresh, so just wait it out. If you see this at other times, or you ever see more than four, there's probably a bigger problem afoot.

Metadata tokens

There are currently four metadata tokens available to swap in site-specific information without having to write your own joins against the new sede_ views (and I explain why below). While there are likely others that make sense, I can't make any promises about development time to get them in.

Token Description
[$site_name$] Name from Sites, or site_name from dbo.sede_databases - e.g. Ubuntu or Webapps Meta. We'll wrap this in single quotes for you, so just do e.g. EXEC ... N'SELECT CONCAT([$site name$], N'':foo''); to get Ubuntu:foo.
[$long_name$] LongName from Sites, e.g. Ask Ubuntu or Web Applications Meta. We'll wrap this in single quotes for you, so just do e.g. EXEC ... N'SELECT CONCAT([$long name$], N'':foo''); to get Ask Ubuntu:foo.
[$site_url$] Clickable link to the site, e.g. https://askubuntu.com or https://webapps.meta.stackexchange.com. Similarly to site name, use e.g. EXEC ... N'SELECT CONCAT([$site url$], N''/q/'', Posts.Id)...; to get a link to that post's question, e.g. https://askubuntu.com/q/3.
[$site_id$] Id from Sites.

Tokens will also work with a space instead of an underscore.


More examples

The inspiration behind this procedure struck me when I used the above views to create a replacement for a looping query that would be far easier to adapt to other use cases. My first attempt with the views was still extraordinarily complex and, while I am comfortable writing cursors and collecting information in #temp tables in a loop, that scaffolding is annoying to build every time. The procedure takes most of that work away:

The original A messy replacement The final version
The original: Networkwide questions per account, ordered by view count A shorter—but still messy, IMHO—replacement using the new views: Aaron's top questions by view count With this new procedure, it is much tidier: Aaron's top questions by view count using sede_ineachdb
If you're a visual person (click to enlarge):
The progression of a cross-site query

You might note that the final version does not preserve the order, an issue mentioned below. Clicking on the column header in this specific case will work, but that won't always be practical.

Example: Count of users per site

For a query as simple as SELECT COUNT(*) FROM Users (1, 2), the procedure is a simpler way to handle this, and we can also avoid scanning growing tables by using sys.partitions instead. Here is the version I wrote:

EXEC dbo.sede_ineachdb @SQLCommand = N'

  SELECT Site      = CONCAT([$site_url$],''|'',[$long_name$]),
         UserCount = SUM([rows])
    FROM sys.partitions
   WHERE [object_id] = OBJECT_ID(N''Users'')
     AND index_id IN (0,1);';

If you need to filter or aggregate, that obviously won't work. You may want to know, for example, how many users were created on each site last month:

EXEC dbo.sede_ineachdb @SQLCommand = N'
     SELECT SiteName = [$site_name$],
            NewUsers = COUNT(*) 
       FROM Users
      WHERE CreationDate >= DATE_BUCKET(MONTH, 2, GETDATE())
        AND CreationDate <  DATE_BUCKET(MONTH, 1, GETDATE());';

Example: My badge count on each site

Toward the end of this answer, I showed how you could use the dbo.sede_users view to get badges by class across sites. The stored procedure makes this collection easy; unfortunately, the stored procedure doesn't currently support CTEs, so the query has been reorganized slightly to use a subquery instead. Otherwise, the version I wrote is really not much more complicated than the single-site version:

EXEC dbo.sede_ineachdb @SQLCommand = N'

SELECT Url,
       reputation,
       question_count,
       answer_count,
       [bronze badges] = COALESCE([1], 0),
       [silver badges] = COALESCE([2], 0),
       [gold badges]   = COALESCE([3], 0)
FROM 
(
  SELECT su.user_id, 
         su.reputation,
         su.question_count,
         su.answer_count,
         Url = CONCAT([$site_url$], ''/users/'', 
               su.user_id, ''/?tab=badges&sort=class|'', [$long_name$])
    FROM dbo.sede_users AS su
   WHERE su.account_id = ##AccountId:int##
     AND su.site_id = [$site_id$]
) AS src
OUTER APPLY
    (
      SELECT class, BadgeCount = COUNT(*)
        FROM dbo.Badges AS b
       WHERE b.UserId = src.user_id
       GROUP BY class
    ) AS b
 PIVOT
 (
   MAX(BadgeCount) 
   FOR [class] IN ([1],[2],[3])
 ) AS p;';

Example: Top 50 by total reputation, network-wide

rene wrote this query to show how you could use the stored procedure to pull the display name out of each site's database for all users with network-wide reputation over 9,000.

I took this idea a little further, using the stored procedure and the above views to craft a query that is certainly more elaborate, and just returns the top 50 accounts, but returns additional information from each site - like their highest-scoring post (and its score).

It's a lot to look at, too much to transcribe here, but check out Top 50 accounts by network rep, plus their top post from their "best" site.


Known issues

Area Description
Output If you do something out of bounds, you might see messages referencing "6 to 8 weeks." These are not real estimates and do not imply forthcoming fixes or improvements. You might also see references to other memes, like "Nice try, robot." That doesn't mean I think you're a robot; at worst, it just means I have a weirder sense of humor than you. Some error messages are intentionally vague for various reasons. If you're getting an error trying to make your query working within the procedure call, please try making it function against a single database first. You won't be able to use the above tokens, though.
Output By default, if you send a PRINT or other non-SELECT command, the procedure will bless your heart and then just not do anything. This isn't C#; you can't store PRINT output. If you just want to print output using a PRINT or similar statement, also set @CollectResultsForMe = 0, like in one of the first examples here.
Output If you use @CollectResultsForMe = 1, CTEs are not currently supported. The problem is that the collection pattern uses INSERT <temp table> <your query> - which doesn't work if <your query> starts with WITH. I am contemplating ways they could be implemented (maybe a token like [$insert_goes_here$]?), but it isn't a critical gap since most CTEs can be converted to derived tables pretty easily. The ones that can't usually involve recursion, in which case we probably don't want you bludgeoning us with that anyway. 😂
Output There is still a 50,000 (or so) row limit in total, if using @CollectResultsForMe = 1. The procedure will stop collecting data for you once you exceed this threshold, and whatever's been collected for you so far, that's what you get (along with a message telling you which database it stopped on). So, you still need to be conservative about how much data you expect to pull back. If you try to collect results for all databases individually (e.g. running a SELECT using @CollectResultsForMe = 0) to get around this limit, without otherwise processing along the way, you'll likely have a challenging time paging through 364 query results and having a stable and responsive browser.
Output If you don't properly alias your output columns, we'll alias them for you, with names derived from NEWID() to prevent collisions - e.g. AA23347C.1, AA23347C.2, AA23347C.3, and so on.
Output The procedure doesn't support ordering or additional filtering or parameters, and will return data in whatever order SQL Server feels like. If you have fewer than 5,000 rows, you can sort by the column headers in the UI. If more, that might be an issue. The procedure does support nested EXEC, so you could dump the output into a #temp table and then implement any filtering or complex ordering expression(s) against the #temp table.
Performance Please use the metadata tokens where practical rather than joining against the existing sede_ views. Some views perform extraneous calculations that you won't always need and, with no support for push-down predicates, if you reference a view in a query against each database, you'll potentially be materializing it for all databases, every time.
Performance We're essentially doing the same thing here as your own manual looping, so performance is usually the same (or, where I can make it so, even better). In some cases, though, you might find performance to be a little worse, especially if you join to the sede_ views—but that's potentially just a trade-off for convenience. If you have a scenario where the old approach worked fine and the new approach is unacceptably slow or, worse, times out, please provide permalinks for both examples, and I'll try to take a look (just no promises about turnaround time).
Performance Interaction with existing tokens like [Post Link] has not been tested at all, so... use at your own risk.
Execution plan "Include execution plan" is not currently supported. It is not so much an issue of permissions or hiding what's going on, just the complexity of stripping out the SET SHOWPLAN commands and preventing the app from sending things separately. Also, not sure what you would do with 364 essentially identical copies of the execution plan. Suggestion: If you're trying to test the execution plan of your query, run it against just a single database first, without calling the procedure.

Other planned enhancements (later):

Area Description
Output A parameter like @IncludeDatabaseList/@ExcludeDatabaseList so that you can provide a specific set of databases to enumerate (or skip).
Output Parameters like @DatabaseNamePattern/@SitePattern so that you can match on a site or database name pattern, like '%math%'. You can currently do the latter in your query using: N'... WHERE LOWER([$site name$]) LIKE N''%math%''...'; (we've got some case-sensitive collations here and there, sorry).

Kudos

I did not win an Oscar here, so this is not some big acceptance speech, but I want to acknowledge that I did not do this on my own. Big thanks to Andy Mallon for once again (read: still/always) being a great sounding board for some of my wildest ideas, Glorfindel for putting in the grunt work for some of the more challenging cross-site queries I've come across, and to rene and starball for talking through this idea with me, kicking the tires on an early version, sending me great feedback, and providing multiple examples to showcase.

2
  • 3
    I asked for a view, not a stored procedure. So -1 from me. All kidding aside: This is a great extension and helpful to improve existing and future queries where data analysis across sites is needed. I also think this new stored procedure opens cross-site query-ing for users that found the previous methods above their capabilities.
    – rene
    Commented Apr 23 at 16:56
  • 3
    Very cool! Again, I'd like to reiterate: chat.meta.stackexchange.com/transcript/89?m=9786245#9786245 (@rene) Commented Apr 23 at 18:41

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .