TLDR; Is SQL the way to go when trying to access one huge data table?
As title. In astronomy we often must deal with huge data files. Usually, we can manage to do it using different optimisation methods inside Python. However, RAM inevitably runs out and 90GB ASCII files slow down or flat out stop progress. I have no experience with SQL, apart from a couple of hour-long YouTube videos.
Until now we have worked with files formatted as such:
# RA DEC umag uerr gmag gerr imag ierr Au Ag Ai dist dist_err [Fe/H] [Fe/H]_err
263.73815 -39.34179 22.866 0.229 19.830 0.007 17.064 0.016 4.682 3.738 2.030 8.731 0.858 -0.30 0.49
263.81260 -39.36577 22.640 0.037 19.964 0.003 17.220 0.007 5.204 4.154 2.256 8.754 0.886 -0.68 0.53
263.83048 -39.34130 22.891 0.451 20.358 0.000 17.639 0.006 5.073 4.050 2.200 10.951 1.150 -0.73 0.58
...
Here, each row represents one "star" and its measured values for the given file, as presented in the header. We have an incoming file which will weigh 90+GB and we need to prepare for the data access. We usually use ASCII tables (or ecsv) and NumPy, AstroPy, Pandas and other libraries often used in Astrophysics. ASCII tables also allows the usage of awk, cat, and other similar commands from the terminal. On top of that, the files can be opened from the Notepad (or vim/gedit) and looked at if needed to have a reference.
I understand SQL is made for relational databases and spreading data between tables is good practice. However, I do not see how it would be useful to separate tables such as ours, considering each star would only have one set of values associated with it. Huge astronomical databases (such as GAIA) utilise ADQL (a variation of SQL) for accessing just one huge table, but I would say that it is warranted since they have a vast amount of data and its spread amongst several columns, which are well filtered through SQL. We, however, don't really go over 20 columns and we use all of them.
These are the kinds of operations we intend to run when accessing the table:
- Filter rows to only have stars with an 'umag' value of a given range
- Perform basic operations to the values (let's say, + or - 180)
In the future (if this works out) we'd probably add a couple of tables and some stars will appear in different catalogues. A common thing we do is search for stars that are present amongst two or more catalogues. So:
- Obtain stars that are present in Table1 and Table2
After accessing the data, we'll need to work with it in Python, proceed with the science and make graphics. There are other solutions to the optimisation of the research, such as utilising only .fits files and going deeper into software that we already use (such as topcat). So:
- How big is the impact of accessing and filtering data through an SQL query for the context of astronomy (as explained loosely above)?
- What factors would you use to determine that impact?
- How demanding for the hardware is importing the 90GB ASCII table for storage in an SQL server? Is it more efficient than storing the tables separately as fits (binary) tables and trying to operate directly on them?
select ... where ... sort by ...
is not overly complex, and you will get a subset of the data that corresponds to yourwhere ...
- this is simpler than you might expect. e.g.pip install pyodbc
migh tbe a starting point.