Identification of Sources Causing Air Pollution in Indian Cities using Hierarchical Agglomerative Cluster Analysis

The distressing levels of air pollution in India is becoming health hazard to the inhabitants. It's important to note that due to the nation's continued urbanisation and its heavy reliance on coal for electricity generation, air pollution is expected to get worse in some areas of India over the next few decades. Present study aims to address the issue by identifying the sources causing air pollution using Hierarchical Agglomerative Cluster Analysis [HACA]. Two years daily data (2018 and 2019), downloaded from publicly available source Kaggle.com of sixteen selected air pollution monitoring stations was used for the study. The stations were selected based upon diversified environmental conditions and local sources. HACA was successful in grouping the monitoring stations into four clusters based on their average Air Quality Index (AQI) level. These four clusters are named as Low Pollution, Moderate Pollution, High Pollution and Very High Pollution Region [LPR, MPR, HPR and VHPR] with average AQI 96; 135; 173 and 227 respectively. Discriminant Analysis (DA) confirmed the resulting clusters with 100% accuracy. It was found that stations with similar environmental factors, regional sources, and pollution amounts were clustered together. Despite numerous actions taken by the authorities to reduce air pollution, it was noticed that topographical conditions play an essential role in the rise of pollution. This study helps to implement different strategies by the authorities’ concern based on local sources and topographical conditions.


Introduction
According to scientific studies numerous Indian cities and regions experience poor air quality because of harmful emissions.Anthropogenic activities, associated with the rise in urbanization and boom in industrialization are the main cause of air pollutant emissions and poor air quality.The situation in India has been characterised by growing population growth trends and their effects on air quality for the past 20 years.Air pollution significantly increased cases of the following diseases: lung and tracheal cancers, lower respiratory infections, bronchitis, ischemic heart disease, and chronic obstructive pulmonary disease. 1 As per World Health Organization, 2 in 2012 air pollution was responsible for one out of every nine fatalities.According to, 3 based on pollutant emissions, with 21 of the top 30 most polluted cities being found, India was placed as the fifth-most polluted nation.The Health Effects Institute has listed air pollution is one of the top five global causes of death, making it a major issue for both global health and the environment. 4The mortality toll in the year 2019 from outdoor pollution was above 980,000 in India was reported. 5Emissions including Particulate matter (PM), Surface Ozone (O 3 ), Nitrogen Oxides (NOx) and Sulphur dioxide (SO 2 ) have the potential to harm people's respiratory and cardiovascular systems.
Construction, motor vehicles, and dust are the main sources of PM 10 , whereas a variety of coal combustion processes, such as burning garbage, stubble burning and cooking, are the main sources of PM 2.5 .Because they are so tiny, particles with a diameter of less than or equal to 10 micrometers have the potential to enter the lungs and cause major health issues.It's important to highlight that fine particulate matter pollution is expected to get worse in some places of India over the next few decades due to the nation's continuous urbanisation and strong reliance on coal for electricity production.
After particulate matter, nitrogen oxides (NOx), which are precursors to ground-level ozone (O 3 ), are India's next primary concern pollutants.19 cities, according to the WHO, have NO 2 levels over the national annual standard (40 g/m 3 ), which is still anticipated to have detrimental effects on health, particularly on the lung development of new-borns (WHO, Air Quality Guidelines).After NO 2 and O 3 , SO 2 is the second most critical target for mitigation because one city now exceeds the annual (50 g/m 3 ) standard, and eleven more cities have moderate concentrations.Vasudha and Venkateswara rao 6 used Factor analysis and concluded that the technique can used to classify different localities based upon air pollutants.Using multi-linear regression for the same areas, they are effective in identifying the impact of different air contaminants to AQI in their subsequent study. 7e main objective of the current study is to identify regions with extremely high and high levels of air pollution, as well as the contributing sources, in order to determine the best ways to reduce these levels and improve the locals' access to clean air.

Materials and Methods Hierarchical Agglomerative Cluster Analysis (HACA)
In a hierarchical categorization, the data are not immediately sorted into a predetermined number of classes or clusters.Instead, the classification is divided into several groups that can range from one cluster that contains each individual to 'n' clusters that each contain one individual.Two subcategories of hierarchical clustering approaches are agglomerative methods and divisive methods.Agglomerative methods progressively combine the 'n' people into groups, whereas divisive methods gradually split the 'n' individuals into smaller groupings.Both the agglomerative path and the divisive path can produce hierarchical classifications, this can be depicted by a dendrogram, a twodimensional diagram.It shows the fusions or divides made at each stage of the study.
The most widely used hierarchical approaches are agglomerative techniques.They divide the data into a sequence of partitions, the first of which consists of n "clusters" with a single member, and the last of which consists of a single group that includes all n people.At each stage, the people or groups who are closest to one another are combined (or most similar).In 1963, Ward developed Ward's method 8 a unique form of agglomerative hierarchical clustering.Clusters having the least within-cluster variance are created using Ward's approach Clustering is done using an analysis of variance method rather than distance measurements.The method relies on calculating the error sum of squares (ESS), this is the cluster mean divided by the sum of the squared distances between each point.

...(1)
Where X ij is the jth cluster's ith observation.The following is the sum of the ESS j values VASUDHA & RAO, Curr.World Environ., Vol.18(2) 580-588 (2023) for all clusters, or the error sum of squares for all clusters.
where k is the number of clusters.
A total of n clusters with a single element each must be created as the initial step of the procedure, the number of observations is n.The means of each of these on-element clusters are identical to that single observation.Two elements are combined into one cluster in the algorithm's initial stage such that the erroneous sum of squares, or ESS, grows as little as possible.Merging the dataset's two closest observations is one approach to achieving this.The merging, however, causes the least rise in ESS as each stage moves forward, as can be seen.
This reduces the separation between the observations and the cluster centres.The procedure is repeated until a single cluster contains all the observations.

Discriminant Analysis (DA)
Discriminant analysis is a statistical method for data analysis where the dependent variable is categorical and the independent variables are interval variables.
It can estimate the discriminant function coefficients after determining the analysis sample.There are two major strategies available.To include all the predictors at once in the direct method, the discriminant function must be estimated, regardless of how efficient a variable's ability to discriminate is.
The stepwise method is an alternate strategy.Based on their ability to group discrimination, the predictor variables are entered successively in stepwise discriminant analysis. 9 Data Collection Six primary air pollutant components' daily data (PM 10 , PM  2 reveals that the average pollution levels at VHPR are twice of standard value and the AQI values at stations under HPR are quite above the standard value.Talcher coalfield and the surrounding area's residents continued to experience poor air quality because of the presence of PM 2.5 and PM 10 above the allowable limits. 10However, AQI of Jaipur, Amritsar, Brajrajnagar, Kolkata and Visakhapatnam stations falling under MPR are just above the standard values.This may be due to stubble burning and wildfire at Jaipur and Amritsar.Kolkata and Visakhapatnam are densely populated and highly industrialized resulting into unhealthy AQI.Pollution in Brajrajnagar may be due to location of Orient Paper mills.Areas under LPR have average AQI 96 which is just below the unhealthy AQI (100).Either the local government or the environment regulates pollution in these locations.Pollution in these areas is either being controlled naturally or by local authorities.The present study is aimed to identify the sources at stations which falls under VHPR and HPR because hazardous levels of pollution causing severe ill-health to the inhabitants.The excessive population density is one more factor responsible for the increase in air pollution in these cities.Due to the ongoing destruction of the Aravalli range, an additional problem is the increasing amount of dust coming from the Thar Desert.
In recent years, Patna, Bihar's capital city, has seen an increase in pollution levels as in the case of other cities in India.Traffic is the main contributor to the high concentration of particulate matter in Patna and it is the major air quality issue when compared to other pollutants like NO 2 , SO 2 , etc which are within the standards of NAAQ.It has been noted that air quality during the winter months deteriorates drastically due to the condensation of fine particulate matter in the lower parts of the atmosphere.

Pollution Causing Sources in Jorapokhar and Talchar
In India, coal is the main fossil fuel utilised to generate electricity.Mining operations are getting more and busier because of the increased demand for coal.Coal mining-related activities have a severe negative influence on the environment, including changes to the landform, land use/land cover, and distribution of flora.Jorapokhar is the neighbourhood town in Dhanbad which is the second most populated city in Jharkhand state and Talcher in Odisha state are small towns but have extensive coal mines and fertilizer units.Dhanbad, known for its rich coal reserves and industries is the most polluted city in India.High levels of SPM and dust are a severe issue in the mining districts of Talcher.Suspended particulate matter in the Talcher region of Odisha had risen to an alarming level of 1848 kg/km 2 and the levels of nitrogen dioxide (NO 2 ) is generally within the permitted limits in coal mining areas. 12According to calculations, the air quality index (AQI) for areas affected by coal mine fires is roughly 1.5 times higher than for places not affected by mine fires.The contribution of each pollutant at regions corresponding to different pollution levels were illustratedin Box and whiskers plot shown in Fig. 4. The concentrations of PM 2.5 and PM 10 in Fig. 4a and Fig. 4b shows that there is a gradual increase from LPR to VHPR.This indicates the major pollutant sources to AQI are PM 2.5 and PM 10 .Not much variation of CO across all the regions was found.The contribution of O 3 and NO 2 at HPR is very less when compared to other regions however, SO 2 is found to be high.Sulphur dioxide resulting from coal combustion is the major source of air pollution at Talcher.Gopinath et al 15

Conclusions
There is a serious need to act quickly to solve the cities of India's increasing air pollution levels.
The same motivated the authors to make an effort to pinpoint the origins of various air contaminants at various environmental sites.Cluster algorithm helps in grouping the data into homogeneous clusters sharing an underlining property.Using HACA, selected monitoring stations were grouped into clusters, and the results were verified by DA.
It was noticed that stations with same environmental conditions, local sources and pollution levels were grouped into same clusters.Despite numerous steps done by the authorities to reduce air pollution, it was found that the regions within the same clusters share comparable local sources that significantly contribute to the growth in pollution levels.Among the clusters that are formed urban regions with huge population and heavy traffic are highly polluted (VHPR) with an average AQI greater than 200 which is hazardous to residents.The high level of air pollution in these regions (VHPR) may be due to their proximity to the stubble burning areas and desert.Major Contribution to the air pollution in the regions falling under HPR with AQI greater that 150 is existence of coal mines and combustion of coal by fertilizer industries.Direct dependence of particulate matter on AQI was another aspect that was observed in this study.This work highlights the prominence of local sources in enhancing the levels of air pollution.
The findings aid the authorities concerned in applying different methods at various study sites such as: implementing pollution control systems in industries and mines, conducting awareness campaigns, staggered offices, educational institutions operating hours, effective methods of pollution less stubble burning, encouraging usage of electrical vehicles, construction of required road infrastructure to avoid traffic congestion, implementation of stringent rules and imposition of heavy fines to the defaulters.

Fig. 3 :Fig. 4 :
Fig.3: AQI monitoring stations located in the major cities of India depicted as Clusters generated from stepwise DA

Table 1 : F-values of six air pollution variables using ANOVA ANOVA Cluster Error
Initially the data of all the stations was subjected to K-means cluster algorithm to identify the number of clusters suitable for the analysis.Based on the F-value of air pollution variables in ANOVA table four clusters were chosen.Higher the F, more is the strength of variables in contribution.It was observed from Table1 that the contributions of all the variables except CO were significant.
2.5 , CO, O 3 , NO 2 and SO 2 ) was downloaded with AQI using openly accessible data at Kaggle.com during a two-year period (January 2018 to December 2019) from 16 AQI monitoring stations [listed inFig 1]located in the major cities of India.These AQI stations were selected based on consistent availability of the data.To do the cluster analysis for this study, stations were selected based on their VASUDHA & RAO, Curr.World Environ., Vol.

: Dendrogram showing AQI monitoring stations located in the major cities of India.
Following the clustering, four clusters were created in a very convincing manner as the stations in the homogeneity traits are present in all these groups.

Table 4 : Wilk's Lambda and Chi-square statistic F of three AQI levels Wilks' Lambda
14e range of the scale is 0 to 1, with 1 denoting complete lack of discrimination.14Smallervalues of Lambda in Table4emphasize the significant difference of each level.Chi-Square statistic F with significance level < 0.01 concludes that the corresponding function explains the group membership of each level well.Clusters validated by discriminant analysis are depicted in Fig.3.
16ed the field emission scanning electron microscope and observed major sulphide minearls in Talcher.Bhanu Pandeyet al16reported high values of SO 2 in the coal mines area in Jharia coal fields near Jorapokhar.The high concentration of SO 2 is due to production of urea using coal as a feedstock by the Fertilizer Corporation of India Ltd (FCIL) established in Talcher.https://tflonline.co.in/about.html.