Dear Habr, I think I need advice from a good DBA. Or a good architect. Or anyone.
The fact is that in one French village a programmer will retire this summer. And everything would be fine, but this programmer wrote 15 years ago and up to now has maintained a database of catalogs of double stars (an observatory is located right next to the village). Now this database of some kind of mutual cooperation agreement will move to our institute, where it will be further supported. Most likely supported by me.
As a close examination has shown, this program is now, rather, of museum interest and in such a form it cannot be placed on our server. This is a set of csh scripts in csh that select data from epic-sized text files. Awk, sed, and French are widely used in places where awk and sed are not enough. In general, we are now talking about the almost complete rewriting of everything from scratch, using SQL and any programming languages, we will not focus on them.
The question is: how to correctly transfer the following data structure to SQL:
')
The database consists of:
Catalogs. Approximate number - two or three dozen. It is possible to add new ones. Directories are either unchanged or updated entirely. Each directory is a table with a dozen or so fields. Also, a directory can contain sets of files (for example, light curves), with names referenced in the table. Each directory contains different fields, field sets intersect weakly. The search currently takes place only by the identifiers (names) of the components, but in the future it is also planned to search by a specific set of parameters, each parameter, in general, a function of several fields. Say, it will be necessary to find all the stars in a certain radius around a point on the celestial sphere, or, say, all stars like T Tau, with periods of <10 days.
Each
entry in the catalog corresponds to either an
object — a component of a double (or multiple) star, a minority of such directories, or a
pair — a double star. For some binary stars there is information only about one component and, say, about the orbital period. Just the second component is not visible. Stars can be not only double, but also multiples. Most often, a multiple star is broken into a set of doubles (in different directories this can be done in different ways). Upon request, you will need to find all the components of a multiple system. The number of entries in directories varies and varies from several tens to hundreds of thousands.
In addition, there are
cross-identification tables, created partly by hand, by different authors, sometimes conflicting. They determine the correspondence between each other in different directories.
As a result, it is necessary that upon request (by identifier or parameters) the user receives all available information stored in directories. And not only by the requested objects / pairs, but also by other components of the dual / multiple system, where the requested objects are included, if any.
Attention, question: How to correctly scatter such heterogeneous data on the tables so that the addition of new directories does not require changing the structure of the tables and the search and selection was fast enough?
I didn’t succeed in solving this problem, especially since I am a physicist rather than a DBA :)