Search Modules with the search API
ActivePublic
Actions

Authored by dcausse on Fri, Jul 5, 8:07 AM.

Tags

None

Referenced Files

	F56234213: Search Modules with the search API
	Fri, Jul 5, 8:13 AM

	F56234210: Search Modules with the search API
	Fri, Jul 5, 8:11 AM

	F56234206: Search Modules with the search API
	Fri, Jul 5, 8:07 AM

Subscribers

AndyRussG

	https://en.wikipedia.org/w/api.php?action=query&generator=search&gsrsearch="mw.wikibase"&gsrnamespace=828&prop=info&gsrqiprofile=empty&gsrinfo=totalhits&utf8=&gsrwhat=text&gsrlimit=500&format=json

	gsrnamespace=828
	note that you can add the namespace in your query as a prefix as well like "Module:search string", but I think that using gsrnamespace might be better if you know the namespace id

	gsrqiprofile=empty
	this is an expert param and it's true that if ranking does not matter it might save some cpu cycles on the search cluster to not do any reranking

	utf8=
	did you mean utf8=1?

	gsrlimit=500
	this is the max we allow for normal use-cases, if the number of pages you want to extract is always lower than 500 it is easy, if not you might have to make multiple calls using the continuation system but there's still a hard limit that will prevent you from digging past the 10000th result, in other words gsrlimit+gsroffset must be < 10000. There could be techniques to workaround this limitation but we never actually got time to expose it from the API (continuation would not be based on offsets)
	If you have to iterate and make multiple calls note that the ordering is not guaranteed to be stable (it generally is but...) and thus possibly you might skip some docs, if this is important you can use gsrsort with for instance gsrsort=create_timestamp_asc

	gsrsearch="mw.wikibase"
	Here you perfram a "phrase" query, if I understand you are interested in Scribunto modules.
	The use of insource: insource has 2 modes
	insource:"mw.wikibase" which should run a phrase query on the tokenized version of the source text, it should be fast but will possibly match text like "mw, wikibase" or "MW Wikibase", if you don't get much noise this would be the version I recommend.
	insource:/mw\.wikibase/ which is running the regular expression engine, it should be slower but is very precise, it's case sensitive by default but does not take into account word boundaries, so that it might match mw.wikibaseSomething. When using the regex it's important to try to filter as much as you can using other criteria (here the filter on the namespace is very important). Note that the regex engine has also strong limitations (it supports very limited number of features), it uses some optimization techniques that will only work if you have at least 3 consecutive chars in your expression, e.g. insource:/a.bc/ is not optimized, insource:/a.bcd/ is (it will filter first all docs having the trigram "bcd").

	Since you search for the code you might want to possibly exclude doc pages, this can be done with "contentmodel:Scribunto", the query would look like: insource:"mw.wikibase" contentmodel:Scribunto

	Overall all the above is something you can use, you should not worry too much about perf (except when using regex to avoid timeouts and/or if you're writing an automated tool that will possibly make a lot of requests)

Event Timeline

dcausse created this paste.Fri, Jul 5, 8:07 AM

dcausse edited the content of this paste. (Show Details)Fri, Jul 5, 8:11 AM

dcausse edited the content of this paste. (Show Details)

Search Modules with the search APIActivePublicActions

Event Timeline

Search Modules with the search API
ActivePublic
Actions