1

TLDR; Is SQL the way to go when trying to access one huge data table?

As title. In astronomy we often must deal with huge data files. Usually, we can manage to do it using different optimisation methods inside Python. However, RAM inevitably runs out and 90GB ASCII files slow down or flat out stop progress. I have no experience with SQL, apart from a couple of hour-long YouTube videos.

Until now we have worked with files formatted as such:

# RA DEC umag uerr gmag gerr imag ierr Au Ag Ai dist dist_err [Fe/H] [Fe/H]_err
263.73815 -39.34179 22.866 0.229 19.830 0.007 17.064 0.016 4.682 3.738 2.030 8.731 0.858 -0.30 0.49
263.81260 -39.36577 22.640 0.037 19.964 0.003 17.220 0.007 5.204 4.154 2.256 8.754 0.886 -0.68 0.53
263.83048 -39.34130 22.891 0.451 20.358 0.000 17.639 0.006 5.073 4.050 2.200 10.951 1.150 -0.73 0.58
...

Here, each row represents one "star" and its measured values for the given file, as presented in the header. We have an incoming file which will weigh 90+GB and we need to prepare for the data access. We usually use ASCII tables (or ecsv) and NumPy, AstroPy, Pandas and other libraries often used in Astrophysics. ASCII tables also allows the usage of awk, cat, and other similar commands from the terminal. On top of that, the files can be opened from the Notepad (or vim/gedit) and looked at if needed to have a reference.

I understand SQL is made for relational databases and spreading data between tables is good practice. However, I do not see how it would be useful to separate tables such as ours, considering each star would only have one set of values associated with it. Huge astronomical databases (such as GAIA) utilise ADQL (a variation of SQL) for accessing just one huge table, but I would say that it is warranted since they have a vast amount of data and its spread amongst several columns, which are well filtered through SQL. We, however, don't really go over 20 columns and we use all of them.

These are the kinds of operations we intend to run when accessing the table:

  • Filter rows to only have stars with an 'umag' value of a given range
  • Perform basic operations to the values (let's say, + or - 180)

In the future (if this works out) we'd probably add a couple of tables and some stars will appear in different catalogues. A common thing we do is search for stars that are present amongst two or more catalogues. So:

  • Obtain stars that are present in Table1 and Table2

After accessing the data, we'll need to work with it in Python, proceed with the science and make graphics. There are other solutions to the optimisation of the research, such as utilising only .fits files and going deeper into software that we already use (such as topcat). So:

  • How big is the impact of accessing and filtering data through an SQL query for the context of astronomy (as explained loosely above)?
  • What factors would you use to determine that impact?
  • How demanding for the hardware is importing the 90GB ASCII table for storage in an SQL server? Is it more efficient than storing the tables separately as fits (binary) tables and trying to operate directly on them?
4
  • Please edit the question to limit it to a specific problem with enough detail to identify an adequate answer.
    – Community Bot
    Commented Sep 23, 2023 at 9:38
  • 2
    A couple of thoughts - - SQL (eg Postgres) is likely to be A LOT faster then Python. One factor to consider is how you will index the data. Indexes can massively impact speed of access and it does not seem you have considered that. On the flipside it likely takes more space on disk and more memory to parse the a giant CSV. Thete os nothing erong with a single big table.
    – davidgo
    Commented Sep 23, 2023 at 10:03
  • Note, you can access an SQL database from python, select ... where ... sort by ... is not overly complex, and you will get a subset of the data that corresponds to your where ... - this is simpler than you might expect. e.g. pip install pyodbc migh tbe a starting point.
    – Hannu
    Commented Sep 23, 2023 at 11:42
  • 1
    Are there better ways to store your data? Definitely! Is SQL likely to be one of those? Most likely. But this is really broad. I recommend having someone more familiar with databases or data management, and hopefully as into astronomic bodies as you, look at the data and advise how it may better be stored and served. Storing the data is only one part: you also need to retrieve and present the data (the database is the "back-end", and there's a "front-end" user interface). There may already be solutions that exist, or this will require an entire system designed and built. Commented Sep 27, 2023 at 17:14

0

You must log in to answer this question.

Browse other questions tagged .