Paste P65853

Search Modules with the search API
ActivePublic

Authored by dcausse on Fri, Jul 5, 8:07 AM.
Tags
None
Referenced Files
F56234213: Search Modules with the search API
Fri, Jul 5, 8:13 AM
F56234210: Search Modules with the search API
Fri, Jul 5, 8:11 AM
F56234206: Search Modules with the search API
Fri, Jul 5, 8:07 AM
Subscribers
https://en.wikipedia.org/w/api.php?action=query&generator=search&gsrsearch="mw.wikibase"&gsrnamespace=828&prop=info&gsrqiprofile=empty&gsrinfo=totalhits&utf8=&gsrwhat=text&gsrlimit=500&format=json
gsrnamespace=828
note that you can add the namespace in your query as a prefix as well like "Module:search string", but I think that using gsrnamespace might be better if you know the namespace id
gsrqiprofile=empty
this is an expert param and it's true that if ranking does not matter it might save some cpu cycles on the search cluster to not do any reranking
utf8=
did you mean utf8=1?
gsrlimit=500
this is the max we allow for normal use-cases, if the number of pages you want to extract is always lower than 500 it is easy, if not you might have to make multiple calls using the continuation system but there's still a hard limit that will prevent you from digging past the 10000th result, in other words gsrlimit+gsroffset must be < 10000. There could be techniques to workaround this limitation but we never actually got time to expose it from the API (continuation would not be based on offsets)
If you have to iterate and make multiple calls note that the ordering is not guaranteed to be stable (it generally is but...) and thus possibly you might skip some docs, if this is important you can use gsrsort with for instance gsrsort=create_timestamp_asc
gsrsearch="mw.wikibase"
Here you perfram a "phrase" query, if I understand you are interested in Scribunto modules.
The use of insource: insource has 2 modes
insource:"mw.wikibase" which should run a phrase query on the tokenized version of the source text, it should be fast but will possibly match text like "mw, wikibase" or "MW Wikibase", if you don't get much noise this would be the version I recommend.
insource:/mw\.wikibase/ which is running the regular expression engine, it should be slower but is very precise, it's case sensitive by default but does not take into account word boundaries, so that it might match mw.wikibaseSomething. When using the regex it's important to try to filter as much as you can using other criteria (here the filter on the namespace is very important). Note that the regex engine has also strong limitations (it supports very limited number of features), it uses some optimization techniques that will only work if you have at least 3 consecutive chars in your expression, e.g. insource:/a.bc/ is not optimized, insource:/a.bcd/ is (it will filter first all docs having the trigram "bcd").
Since you search for the code you might want to possibly exclude doc pages, this can be done with "contentmodel:Scribunto", the query would look like: insource:"mw.wikibase" contentmodel:Scribunto
Overall all the above is something you can use, you should not worry too much about perf (except when using regex to avoid timeouts and/or if you're writing an automated tool that will possibly make a lot of requests)

Event Timeline

dcausse edited the content of this paste. (Show Details)