Jump to content

Talk:Quarry

About this board

Previous discussion was archived at Talk:Quarry/Archive 1 on 2015-04-17. Discussion area for discussion about Quarry itself and help about individual queries.

Quarry / SQL query - how to fix OperationalError('table resultsets already exists')?

3
Gluo88 (talkcontribs)

I was able to complete the above SQL query as seen in https://quarry.wmcloud.org/history/84807/911494/884555.  

However, when I run the same query again at https://quarry.wmcloud.org/query/84807 .   I got the error "OperationalError('table resultsets already exists')".           How should I fix the error?

I guess that resultsets may be a temporary table that was produced from running my previous  SQL query.   I tried to execute the following:

DROP TABLE IF EXISTS resultsets;

However,  I got the following information: “Access denied; you need (at least one of) the SUPER, READ_ONLY ADMIN privilege(s) for this operation”

May anyone help me on this issue?

Thanks.

Gluo88 (talkcontribs)

I just found that the above query is complete now. The issue looks to be automatically solved, at least for now, although I still don't know the reason of the "OperationalError('table resultsets already exists').


However, my query https://quarry.wmcloud.org/query/84817 has just failed still with the same reason of the "OperationalError('table resultsets already exists').

May anyone know how to handle "OperationalError('table resultsets already exists')?

Thanks.

Gluo88 (talkcontribs)

The issue looks to be automatically solved again. Did someone help in background?

Thanks a lot.

Reply to "Quarry / SQL query - how to fix OperationalError('table resultsets already exists')?"

Quarry / SQL optimization - using DB indexes?

3
Fl.schmitt (talkcontribs)

Hi, I'm trying to optimize a simple SQL query for pages on commonswiki (table page) which start with a certain string (e.g. "SELECT * FROM page WHERE page_title LIKE "Building%" ORDER by page_title;"- see also https://quarry.wmcloud.org/query/83277 for an example with a smaller result set). The SQL Optimizer on Toolforge tells me that this query would use filesort instead of indexes which causes performance issues ("Query plan 1.1 is using filesort or a temporary table. This is usually an indication of an inefficient query. If you find your query is slow, try taking advantage of available indexes to avoid filesort."). The DB schema documentation tells me that there should be an index defined (key: page_name_title) on page_title and page_namespace columns. The MySQL docs tell me that i could use USE INDEX (page_name_title) to enforce using that index. If I add that clause to my query (SELECT * FROM page USE INDEX (page_name_title) WHERE page_title LIKE "Building%" ORDER by page_title;), the SQL optimizer complains about a "Query error: Key 'page_name_title' doesn't exist in table 'page'". At the Commons village pump, I've learned that the replicas may lack the indices. So, I'm not sure if there's a way to optimize such a query on Quarry. I would prefer using Quarry instead of the toolforge CLI because Quarry allows for linking queries and results.

Matěj Suchánek (talkcontribs)

This query cannot use the index since you don't have a condition on page_namespace.

Fl.schmitt (talkcontribs)

Aww - ok :-) - yes, with such a condition, it works like a charm - thanks a lot!

Reply to "Quarry / SQL optimization - using DB indexes?"
SoySauceOnRice (talkcontribs)

Hi, I am doing some research on sockpuppets. I'm trying to use the SQL database to assist in building my dataset, but queries seem to take a long time.

For instance, this previous query ran in 196 seconds:

https://quarry.wmcloud.org/query/61732

Whereas my identical query has been going for over 9 hours:

https://quarry.wmcloud.org/query/83588

A previous instance was running over a week before I restarted it.

Am I doing something wrong, or are these running times typical?

On another note, I have been looking for complete SQL dumps to run my own instance so that I don't cause issues with excessive queries, however I can only find complete xml dumps (https://dumps.wikimedia.org/enwiki/20240601/).

Is there any way to get a full SQL dump (complete with article revision history)?

Thank you for your help.

Matěj Suchánek (talkcontribs)

Using Toolforge CLI, the query returns 187814 rows in 31.187 sec. Maybe that's too much for Quarry.

Reply to "Long Query Time"
Plastikspork (talkcontribs)
Reply to "Queued queries"

Query for wikipedia user registrations by day

3
Felipeangelim (talkcontribs)

Hi! I was trying to get the timeseries of new user registrations in wikipedia, by day (two columns, day and count), but my query is really slow. Am I doing something wrong?

SELECT

  DATE(user_registration) AS registration_date,

  COUNT(*) AS user_count

FROM

  user

WHERE

  user_editcount > 10

GROUP BY

  DATE(user_registration)

ORDER BY

  registration_date;
Matěj Suchánek (talkcontribs)

There is no index on user_registration and user_editcount, so you are doing nothing wrong, it's just impossible to use a better query plan than scanning the whole table.

Matěj Suchánek (talkcontribs)

You can try query Special:Log/newusers instead. For example:

SELECT LEFT(log_timestamp, 8), COUNT(*)
FROM logging_logindex
JOIN user ON user_name = REPLACE(log_title, '_', ' ')
WHERE log_type = 'newusers'
AND log_action IN ('create', 'newusers')
AND log_timestamp > '2006'
AND user_editcount > 10
GROUP BY LEFT(log_timestamp, 8);

There are even more values possible for log_action, but I'm not sure if the JOIN worked for them, too.

Reply to "Query for wikipedia user registrations by day"

All files in a category (for categories with million+ files)

2
Schlurcher (talkcontribs)

Hi, I'm trying to efficiently get a list of all files in a category with 1'000'000+ files. For example all files on Commons in this Category "Category:Flickr images reviewed by FlickreviewR 2". This was the most simple I could come up with, but it takes like forever to get an output: https://quarry.wmcloud.org/query/83527

Is there a better way or does this require another tool (which would that be)?

TheDJ (talkcontribs)

You would have to write a dedicated tool to do requests that divides the query into multiple chunks and writes them to a file. You could use the Special:MyLanguage/API:Categorymembers api for instance, and list per 500 (limit) and use the continue parameter from request to request. Or get a database dump installed locally and export them from there. A million is a lot, it is not a type of request that the wikimedia optimises for.

Also depending on what you are trying to achieve, there might be alternate/better ways to achieve that goal other than listing a million+ entries in one go.

Reply to "All files in a category (for categories with million+ files)"

Help finding articles with a given template

6
Wizmut (talkcontribs)

I am trying to find articles that use "Template:Auto short description" but not "Template:Short description". Is this possible?

Matěj Suchánek (talkcontribs)

Though it is possible using an SQL query, you can also use the default search.

Wizmut (talkcontribs)

Thank you. But I also need to sort by the length of the induced short description, so it has to be SQL.

Matěj Suchánek (talkcontribs)

I think I misunderstood your request. "Template:Short description" is used inside articles, but "Template:Auto short description" is only for documentation purposes. So you actually want articles with a template documented as "short description inducing" without "Template:Short description", right?

I made this query (hope it's correct):

SELECT page_title, LENGTH(pp_value) AS len, pp_value
FROM page_props
JOIN page ON page_id = pp_page
LEFT JOIN templatelinks ON tl_from = page_id AND tl_target_id = 137  # Short_description
WHERE pp_propname = 'wikibase-shortdesc'
AND tl_target_id IS NULL
AND page_namespace = 0
AND page_is_redirect = 0
AND EXISTS (
 SELECT 1 FROM templatelinks
 JOIN linktarget ON lt_id = tl_target_id
 JOIN page ON page_namespace = lt_namespace AND page_title = lt_title
 JOIN categorylinks ON cl_from = page_id
 WHERE tl_from = page_id AND cl_to = 'Templates_that_generate_short_descriptions'
);

It took 10 minutes and the only returned entry was for en:Main_Page.

Wizmut (talkcontribs)
Wizmut (talkcontribs)

The "AND tl_target_id = 137" seems to filter in only articles with explicit SDs. But changing it to "AND tl_target_id <> 137" gives a list of template links that aren't T:short_description. I'm not sure how to make it return a list of pages instead of template links.

Reply to "Help finding articles with a given template"

Quarry appears to have stopped working properly

3
The Anome (talkcontribs)

I seem to having problems with queries running and/or completing. Are any other users seeing this?

GTrang (talkcontribs)
GTrang (talkcontribs)
Reply to "Quarry appears to have stopped working properly"

SQL query in Quarry for getting the current statements on a Wikidata item given its QID

5
Rdrg109 (talkcontribs)

I am learning how to use Quarry to query Wikidata data. I have written these two queries:

  1. Get the number of statements on a Wikidata item given its QID
  2. Get all revisions on a Wikidata item given its QID

I now want to write a SQL query that gets the current statements on a Wikidata item given its QID (the returned rows should be the same of those rows returned by this SPARQL query). This is my initial idea on how to do it: I know that the table revision contains all the revisions on a given page and the table comment contain information on the introduced changes, so one way to do what I want to do is to parse the column comment_text (see an example in the results of query 2 above), which describes the changes, and determine the latest changes. I think this method is complex to implement using SQL since I need to determine which changes were not edited by any other other changes. I wonder if there's a simpler approach or a table that already contains the current statements on a Wikidata item.

I know that the current statements on a Wikidata item can be easily obtained in SPARQL (this query already does that), but since I'm learning how to query data in Quarry (i.e. using SQL), reading a SQL query that does that would help me to understand more about how data is stored and should be queried in Quarry.

Matěj Suchánek (talkcontribs)

I wonder if there's a simpler approach or a table that already contains this information. SQL queries are not suitable for Wikidata data model, that's why Wikidata Query Service exists. Some SQL queries are possible, but they are rather management-oriented, not data-oriented.

Rdrg109 (talkcontribs)

For the record, I asked a similar question in Libera Chat's room #wikimedia-cloud and some user replied the following:

18:50 <rdrg109> For the record, I have created a topic with that question in Talk:Quarry here: https://www.mediawiki.org/wiki/Topic:Y3u6dz3ci6eqlura
19:08 <+wm-bb> <<hidden user>> rdrg109: it’s basically not doable. it’s best to use SPARQL instead
19:30 <rdrg109> <<hidden user>>: Ok, thanks for the help!
BDavis (WMF) (talkcontribs)
Dipsacus fullonum (talkcontribs)

It is impossible to reconstruct all claims from comment_text. There are several Wikibase API commands that create or edit entities without specifying in the comment text which claims are created, removed or modified. See for example the creation of d:Q125692383 today which was created with 5 claims at once without it being visible in the comment.

Reply to "SQL query in Quarry for getting the current statements on a Wikidata item given its QID"

Internal server error when trying to stop a query

1
Tom.Reding (talkcontribs)

I tried using 2 different browsers, Chrome & Firefox, but get the same result. 79955 & 82289 have been in queue status for ~18 hours.

task T363644

Reply to "Internal server error when trying to stop a query"