I have a huge table (millions of rows) which looks like this (in essence)
datatime tagname interesting somemore columns
2014-12-04 20:00:00 grp1_tagA 77 0 0
2014-12-04 20:00:00 grp1_tagB 88 0 0
2014-12-04 20:00:00 grp1_tagC 99 0 0
2014-12-04 20:00:00 grp2_tagA 11 0 0
2014-12-04 20:00:00 grp2_tagB 22 0 0
2014-12-04 20:00:00 grp2_tagC 13 0 0
2014-12-04 21:00:00 grp1_tagA 17 0 0
2014-12-04 21:00:00 grp1_tagC 28 0 0
2014-12-04 21:00:00 grp1_tagC 29 0 0
2014-12-04 21:00:00 grp2_tagA 31 0 0
2014-12-04 21:00:00 grp2_tagB 62 0 0
2014-12-04 21:00:00 grp2_tagC 53 0 0
2014-12-04 22:00:00 grp1_tagA 87 0 0
2014-12-04 22:00:00 grp1_tagB 48 0 0
2014-12-04 22:00:00 grp1_tagC 99 0 0
2014-12-04 22:00:00 grp2_tagA 51 0 0
2014-12-04 22:00:00 grp2_tagB 42 0 0
2014-12-04 22:00:00 grp2_tagC 53 0 0
In the real table, there are tens of groups, each group has ~100 tags, and for each group and tag, there is several years worth of hourly data (so couple of ten thousand rows per tagname), amounting to currently about 8 million rows. At a later stage, other tables, which have smaller time interval, and are hence even bigger, will come into play.
I need a FAST way to get all data out of the table which has to do with a certain group (say, group 1, i.e. tagname starting with "grp1"), in some date range (data to be sent to some client's browser for visualization.)
So I want to produce a "group 1 digest" table such like this
A simplistic query would be something like (dropping the date constraint for now)
SELECT A.`datatime` as `datatime`,
A.`interesting` as tagA, B.`interesting` as tagB, C.`interesting` as tagC
FROM `everything` A, `everything` B, `everything` C
WHERE
A.`datatime` = B.`datatime` AND
A.`datatime` = C.`datatime` AND
A.`tagname` = "grp1_tagA" AND
B.`tagname` = "grp1_tagB" AND
C.`tagname` = "grp1_tagC"
It's actually a little more complicated, because at some date, some tags might have data, while others don't, and I also want the rows with partial data. So with one more row
what I want is
A possible query to this end is
SELECT GLUE.thyme, A.iwant as tagA, B.iwant as tagB, C.iwant as tagC FROM
(SELECT distinct `datatime` as thyme from `everything`) GLUE left join
(SELECT `datatime` as thyme, `interesting` as iwant from `everything` where `tagname` = "grp1_tagA") A on GLUE.thyme = A.thyme left join
(SELECT `datatime` as thyme, `interesting` as iwant from `everything` where `tagname` = "grp1_tagB") B on GLUE.thyme = B.thyme left join
(SELECT `datatime` as thyme, `interesting` as iwant from `everything` where `tagname` = "grp1_tagC") C on GLUE.thyme = C.thyme
Problem: The "real world" version of this query is not fast enough. I tested the above query structure with 34 tag names (making 35 table joins), adding a date constraint like where/and datatime >= '2013-12-04'
to each of the subqueries, so that a total of 8760 rows (i.e. 1 year of data) was returned. The resulting run time was 2 and a half minutes. I'm targeting something well below half a minute, which is the time to transfer the data over the internet.
The big table has a composite primary key index on datatime and tagname, and an index (key) on datatime.
How can I get the data faster with a better equivalent query?
tagname
column, you should at least consider having to separate columns,groupname
andtagwithingroup
, Doing so will make the queries easier to write, almost certainly. I'd review dropping the existingtagname
column (or make it hold what I calledtagwithingroup
); you can create its value easily enough when you need to present it. You'd probably want an index on the groupname and tagwithingroup columns, maybe with the date/time column added as the third item in the index (which might then be a unique index).