
E023 Astrology as a Statistical Modeling Framework
Statistical Models in Astrology is the third part in the series on Statistical Constructs in Indian Astrology. This will be the final part written purely from a statistical perspective. A Tamil version of this essay is available here. There is also a YouTube video in Tamil language that elaborates the statistical dimensions of Indian astrology.
In the first part of this series, we explored why the rules of astrology must be validated. As we continued to second part, we looked at some basic statistical concepts and methods, along with an example, on how to predict an unknown quantity based on a known one.
In this third part, we will examine how a statistical or econometric model is constructed from data, what its outputs are, and what their utility is. You can consider this as the last part written from the statistical side before we delve into astrological side.
Who will benefit from this part?
I write this with the awareness that it could be difficult to understand for astrologers who are not familiar with statistics. The purpose of this essay is simply to give you an introduction. It is sufficient if you can grasp the gist of the article.
For those familiar with statistics and working in the field, this part may serve as a recap. This section might help you to re-evaluate what you already know and to correctly understand some concepts you may have previously overlooked.
Now, let’s dive straight into the article. In the second part of this article, we looked at an example question: how to estimate a person’s weight using only a measuring tape. This part aims to find a solution to that example based on sample data.
The goal of this part is to show you the real outcomes of statistics instead of merely telling a theoretical story. My underlying intention is that by doing so, your confidence in what I say will increase.
Statistical Models – Sample Data
This part is a compilation of a sample data analysis. The data used herein is imaginary. You can download this “|” delimited data to recreate the outputs I have shared in this article.
A screenshot from this dataset is shown below for your quick reference.

The key take away from this article should be the approaches and statistical principles highlighted. The underlying data may change, but the approaches do not.
Disclaimer*:
*This sample data may differ from reality. I request that you do not make any real-life decisions based on the rules or statistical formulas derived from this data and its results.
The Example Under Consideration
Let me remind the example I have mentioned in the previous part of this article:
Let’s assume you are going for a job interview at an office. As part of the assessment, you are given a measuring tape. Seated opposite you are some men and women in middle age, average height, and build.
You are given a question: using this measuring tape alone, you must find the approximate weight of all of them. You are allowed to ask the interviewer for further questions, information, or other data if needed. Now, what will you do? What kind of questions, data, or information will you ask for to quickly arrive at the answer to the question given?
In response to this, I listed four methods. Among them, I ranked methods like averages, formulas, and formulas refined based on data as the best solutions.
In this part, we will see how the formulas mentioned in Approach 4 can be derived using the sample data I mentioned above.
Our Example
Scientifically, many factors influence body weight: gender, height, age, diet, obesity, exercise habits, race (Asian, Caucasian, etc.) are some of the key determinants.
However, in our example question, we only have three types of data: gender, height measurements, and weight. Since gender is a nominal variable, we already saw that it cannot be directly used in calculations.
We know that as people’s height increases, their weight also tends to increase. Nevertheless, not everyone of the same height has the same weight. We also know that body weight can change based on a person’s circumference or girth. People with bellies and rolls tend to weigh more, don’t they?
The circumference of a person’s abdomen (Stomach) can be used to measure girth. Additionally, the circumference of their buttocks/seat could also be used as a supplementary measure. It is natural for women to have larger buttocks, isn’t it? The circumference of other specific body parts might also be useful. However, we will try to find the answer using only the measurements of the variables we currently possess.
Statistical Models Step 1: Knowing the Data Range and Distribution
Let’s assume the data of the 190 individuals we have taken truly represents the general population. This basic assumption is crucial. If this is incorrect, the formula we derive may not work accurately for new individuals. For example, in astrology, if all the rules and experiences you have learned pertain only to Moving and Fixed signs, you won’t be able to predict accurately for those with dual (ubhaya) signs, right? It’s the same here!
This collection lists the data for 93 women and 97 men (a total of 190 people). The data for any two individuals is not the same. Therefore, if we were to explain each person individually, we might end up with 190 equations! Furthermore, we are not guaranteed that the data of the new person whose weight we are going to estimate will be among these 190. Thus, we cannot make a direct decision based only on this data. We need a more generalized solution.
Statistical Measures
Statistics has many techniques and calculations to summarize the distribution of data for all these sample individuals. The important ones are measures of central tendency and measures of dispersion. Key measures of central tendency include minimum, maximum, mean (or average), median, and mode. The minimum (bottom), maximum (top), and mean (middle) for the four variables in our dataset are given in the image below.

This image visualizes that these values are slightly different for the two genders, male and female. Therefore, this suggests that it may be wiser to create separate statistical models or formulas based on gender.
We cannot reach direct conclusions just with these measures of central tendency. Using the statistical measure of the mean, we can only make an approximate conclusion, as mentioned in Approach 2 in the previous part.
However, these measures of central tendency help us find the range within which our formulas will work. For example, the formula we create between women’s height and weight will only be applicable to women whose height is between 145 cm and 179 cm (the boundary used for model creation). When we use the formula we created to estimate the weight of women outside this height range, its reliability is reduced.
We can learn the range within this boundary where our formula will fit even better through measures of dispersion calculations.
A Box Plot or Box and Whisker Plot for all variables is given below. The dark bar you see in the center of this plot indicates the data range up to which your formula will work most reliably. In our example of women’s height, you can expect your formula, created using height, to estimate her weight best if the new person’s height is between 153 and 172 cm. Similar boundaries can be easily found for other variables.

This plot shows that the data for women’s stomach measurements is widely dispersed. The image also indicates that some outliers (marked with a small red circle) present abnormally could potentially affect your analysis. Such sample records should be removed before creating a formula. I have used them in my analysis without removing them to show how they might affect the results.
The equivalent to consider in astrology: In astrology, there is no clear definition of the range up to which a rule applies. Astrologers often fail in their predictions when they apply general astrological rules to outliers or special cases (data beyond the defined boundary).
Statistical Models Step 2: Knowing the Relationship Between Variables
In the next stage of our analysis, let’s look at how the data is organized between any two individual variables. In the images below, individual data points are displayed with Weight on the Y-axis and the Predictor Variables on the X-axis. Data points for men are shown in green and data points for women are shown in red. Each point represents the data for one person.
In the data plot of Height and Weight you see below, you can generally observe that the weight of men is higher than the weight of women. Furthermore, you can see that men’s weight increases more rapidly as their height increases. The weights of many men of the same height are clustered closely together. Conversely, for women, the increase in weight with respect to height varies less. Also, there is a greater variation in weight among women of the same height.

Similarly, we can see the data distribution for the Stomach circumference and Weight below. These data points are somewhat more spread out for both men and women. We see that even a slight increase in men’s waist size is associated with a rapid increase in weight.

The next image below shows the data distribution for the Buttocks circumference and Weight. Here too, the data distribution is somewhat wide. So far, we have looked at the relationship between the effect (Y) and the causes (X). Next, let’s see how the three X factors that help us predict are related to each other.

A crucial fundamental rule in statistical models is that the predictor variables should not have a close relationship among themselves. If two predictor variables have a high degree of correlation between them, it is enough to use only one of them in our equation. This is because a good statistical equation should measure the effect with the minimum number of explanatory variables, which is a core principle of statistics. Based on this, let’s look at the data distribution between our three factors here.
In the images below:

From the plot on the left, we understand that there is a direct relationship between Height and Waist size.
From the plot in the middle, we understand that there is a direct relationship between Height and Buttocks size.
In both of these plots, it is also evident that there is no separate distribution pattern for male and female.
Similarly, the plot on the right shows that there is a very close and strong correlation between Waist and Buttocks size. Statistics guides us that it is sufficient to use only one of these two in our formula.
We have now reasonably identified the variables for finding the equation. It seems that Height and one of the other two factors can be used to find the weight.
Let’s now view what we have seen so far as compiled knowledge rather than individual data points. The statistical formula called the Pearson Correlation Coefficient (r) summarizes the correlation between two variables as a single number. This Pearson Correlation Coefficient can range from -1 to +1. If the coefficient is close to -1, it means there is a very strong negative correlation between the two variables, and if it is close to +1, it means there is a very strong positive correlation. For the value is 0, it means there is no linear correlation between the two variables. If the correlation coefficient is above 0.5, we can reliably calculate one variable using the other.
Detailed explanations about these concepts can be found in any basic statistics text books and the new to this subject are encouraged to explore more and learn. These explanations are typically available in high school textbooks as well.
Let’s return to our article. I have calculated and provided the correlation coefficients between the variables we are considering, based on our data and separated by gender, below. We will only consider the cells colored green and yellow.

For example, in the Men’s table, the correlation coefficient between Height and Weight is 0.96. This confirms that height will be an excellent factor for finding weight. Next, the coefficient between Weight and Stomach is 0.68. This is higher than the Buttocks coefficient of 0.63. Therefore, Stomach size could be a superior predictive factor between the two. The coefficient between Stomach and Buttocks is 0.88. Since this is very high, only one of them will be sufficient for us.
Similarly, you can evaluate the table for women. All three variables can individually help us find the weight. The best among them is Height. Next, Stomach size, and then Buttocks size can be used.
Take a pause here and think! Compare what we have seen so far with the individual variables used for prediction in astrology. All the astrological variables we use individually can be considered factors that help find the effects or events, just like in statistics. In astrology, do we have any measure, like the one we find in statistics, to determine which variable is the best? Also, do we have deeper qualifying metrics to decide superior predictors among the lot. With the current generational expertise on statistics, is it not our duty to create them?
Statistical Models Step 3: Calculating Data Relationships as an Equation or Formula (Estimation)
So far, we have only looked at the importance of variables. In the next stage, we will create the best equation(s). There are various methods to create such equations. The most important of them is Regression Analysis. Please refer to standard textbooks to know more about this workhorse used in most of the modern predictive modeling framework.
In this method, we attempt to find the equation of a straight line that passes through the data points in a two-dimensional space. The straight line must be positioned such that the total sum of the squares of the errors in our prediction (the difference between the reality and the prediction) is the minimum (Ordinary Least Square – OLS technique). Accordingly, I have used this method to find the equation for weight through individual variables for both genders and displayed them in the graphs below.


Here, we find two things: a best mathematical equation (approximate, but reliable based on the data) and its measure of reliability (statistical significance – R2 in this case). The special feature of statistics is the additional calculations for finding the best mathematical equation.
For example, the equation between women’s Height and Weight is:
y = 0.4916 x – 18.545 | R2 = 0.8637
This means that a woman’s weight (y) can be calculated by the equation:
{Weight} = 0.4916 {Height (cm)} – 18.545
In this equation, 0.4916x helps us find that for every 0.4916 cm increase in height, the weight increases by 1 kg. This equation is applicable for estimating the weight of women within the height range of 145 to 179 cm (mentioned as an example – do not use in reality).
The value R2 = 0.8637 shown here indicates that 86% of the variation occurring in weight can be explained by the changes occurring in height. You can similarly understand the remaining equations.
Statistical Models – Refining the Equation
The six equations we saw above estimate weight as a result of a single variable. Three equations per gender is too many. Also, the full information content of the three types of variables we have has not been fully utilized yet. There are still unexplained variations in weight. If we include the other girth-related variables with height, our equation could be further improved.
First, we must see if the other two girth-related variables can additionally explain any variations in weight, after accounting for the variations in weight explained by height. Statistical calculations like partial correlation will help with this.
Two variable model:
Now, let’s try to include girth along with height to explain weight in our equation. I have conducted such a data analysis, separated by gender, and provided the results for you below. This data analysis was done using the SAS 9.4 statistical software. The Ordinary Least Squares (OLS) method was used here as well. Additionally, I derived this equation by imposing some statistical constraints on the variables for them to enter the equation. The details of these analytical constraints cannot be covered within the scope of this article. If you wish to know, you can search for them online and learn.
We are trying to find the following equation here:
{Weight} = f({Height, Abdomen Circumference, Buttocks Circumference, Gender}) + ei


Based on our data analysis, I have obtained the following equations:
Women
{Weight} = -17.42762854 + 0.404423695 {Height (cm)} + 0.358308166 {Stomach (inch)}
R2 = 0.9030 (Partial R2: Height = 0.8637 + Stomach = 0.0393 = 0.9030)
Men
{Weight} = -64.98131551 + 0.674555744 {Height (cm)} + 0.729737511 {Stomach (inch)}
R2 = 0.9648 (Partial R2 : Height = 0.9264 + Stomach = 0.0384)
For the men’s equation, Height explains 92.6% of the changes in weight, and adding the Stomach size helps to additionally find another 3.8% of the weight change. Together, these two help to predict a total of 96.5% of the weight change. We can similarly analyze the women’s equation.
In this equation, Buttocks size is not sufficient to explain weight beyond height and stomach size, so it is excluded.
Using these two formulas, we can find the weight of an individual who falls within our data boundaries. The men’s equation is significantly stronger of the two. It can be said that the women’s equation has been slightly affected by some of the data points we did not exclude.
For those interested in knowing more, refer to the graph labeled Leverage vs RStudent in the image “Fit Diagnosis for Weight” below, where some data points are significantly distant (circled). I have also included additional statistical measures for your info.
Women:

MEN:

From the tables above, we can know what percentage each of the two factors in our equations contributes to determining weight, using the STD Beta As % values given in the table. Height plays a 78.4% role in determining men’s weight. Waist plays the remaining 21.6% role. Similarly, in the women’s equation, Height plays a 75% role and Waist plays the remaining 25%. Remember that all these are based on imaginary data. Such statistical calculations improve the quality of our predictions.
What we have seen so far is only a small port area of the vast ocean of statistics. Similarly, through statistics, we can discover and confirm the hidden relationships in data as formulas.
Statistical Models – Article Summary
Let us summarize the gist of what we have seen so far:
- You would have seen that a variable or effect can be determined by many different explanatory variables or factors. Astrologically, this means that multiple astrological variables can help to find an outcome.
- We also found that there could be only a few superior factors that explain the outcome sufficiently well. We also learned how to statistically find the order of their importance. Comparing this with Astrology, this is about deciding which factor to look at first for magnitude of importance in predictions.
- We found that it is possible to obtain better equations using more than one factor than with individual factors. This means that better/superior results can be obtained using more than one astrological factor/variable/dimension. For example, you can give a better prediction if you include the Lagna (Ascendant) along with the Rasi (Moon Sign), right? Take it like that.
- We saw that not all the factors we know are needed to predict an outcome. That is, when we have sugarcane itself, we do not need the Iluppai flower (Mahua) for sweetness. We might resort to the Iluppai (Mahua) flower only if there is no sugar. In astrology, you can compare that with not needing all variables to predict a high level outcome.
- We saw that calculating the importance of variables in terms of statistical importance makes it easy to convey the importance of a factor without confusion. That is, we learned that we must find how much importance to give to each astrological variable.
- We learned the data boundaries within which a formula works correctly. That is, when an unknown horoscope falls outside the definitions of our known astrological rules, it is best to avoid predicting an unknown matter about it.
Now, imagine how amazing your predictions would be if all these superior statistical methods you have studied were available in astrology. However, it is a very sad and unsettling fact that there are no such statistically confirmations available for any variable, rule, or formula used to determine an effect in astrology. However, thining on developing those statistical measures is the necessary precursor to move along and find a solution, right?
My Views and Some Things for Your Consideration:
From what I know so far, I would say that the fundamental statistical constructs for astrological rules were laid very strongly by our predecessors in astrology. I ask you a question to prove this: Have you ever noticed any two astrological variables that are exactly the same? For example, have you seen the Rasi lords, planetary aspects, exaltation/debilitation, ruling, or trine houses being the same anywhere? Can you guess why our predecessors created so many unique astrological variables like Bhava (house), Bhava Lord, Graha (planet), Karaka (signification), Exaltation, Debilitation, Retrogression, Parivartana (exchange), Graha Yuddha (planetary war), Combustion (Astangama), Shadbala (six-fold strength), Rasi Chart, Navamsa, Shodasa Varga (sixteen divisional charts), Dasa-Bhukti (planetary periods), Ashtakavarga, etc.?
Why so many variables?
If you understand why sugarcane and the Iluppai flower were created in this world, you will understand why our predecessors created so many astrological variables. All they tried to achieve through these many variables and constructs was how to neatly solve a very complex, large puzzle. Their expert knowledge of mathematics has remained with us as astrological rules. It is always better to know how to cook a meal than just knowing how to eat it, isn’t it?
I conclude the statistical part of this article series here in my attempt to connect the two different worlds of Astrology and Statistics. In the upcoming parts, we will directly approach each of the astrological constructs from a statistical perspective.
Coming up next…
In the next part, I will try to introduce you to the important astrological variables of the traditional Parasari method with their statistical uniqueness and correlations. It is going to be like a kid individually trying to identify the entire building blocks of a large ship. When you read it, you are sure to be astonished by the statistical dimensions that exist in astrology that you have overlooked until now! 😉
Thank you for reading this far! 😊
Feedback and sharing are welcome.
