This repository contains a set of scripts to train a machine-learning (ML) tree-based model for performing a vulnerability assessment of households in a country to droughts.
- Household data: high quality household budget survey data (the Household Expenditure and Income Survey - HEIS) on an annual basis for Iran is collected from the publicly available website of the country's statistical agency. Importantly, the data constains information on consumption variables (e.g. “pcer” and “pcer_17”) and poverty line values (e.g. "pline685").
- Geospatial data: different climate and drought-related data is collected from different sources in order to capture the drought hazard dynamics in the selected country (i.e. Iran). The collected climate and drought-related data is extracted at the ADM2 level and provided as monthly mean aggregates covering the period between 2011-2020. Climate-related variables are collected from the ERA5-Land dataset and include both temperature and precipitation data. Drought-related variables are collected from different sources to help categorizing drought events as meteorological, hydrological, or aricultural droughts. Drought-variables include the Agricultural Stress Index (ASI) provided by the Global Information and Early Warning System on Food and Agriculture (GIEWS) of the Food and Agriculture Organization of the United Nations (FAO), the Self-calibrating Palmer Drought Severity Index (scPDSI) provided by CRU_ts4.06, the Standardized Precipitation-Evapotranspiration Index (SPEI) provided by the Consejo Superior de Investigaciones Científicas (CSIC), and the Standardized Precipitation Index (SPI) provided by the International Research Institute for Climate and Society (IRI).
The objective of this analysis is two-fold:
- first, to train a model capable of accurately classifying a household as under or above a certain poverty line value, and;
- secondly, to understand if any relationship (in terms of correlation) exists between the input household, climate and drought-related and the objective poverty datasets.
Building on the hypothesis that a model that can achieve both objectives i) and ii), we explore the utilization of a decision tree-based model to predict the probability of a consumption variable (e.g. “pcer” and “pcer_17”) to be above or below the poverty line (e.g. “pline685”).
Four sequential scripts have been developed in R-programming language to achieve the above mentioned objectives.
- Script "00.All_fTrainModel_StataData.R": This scripts create the consolidated databases necessary for training a ML model and performs the training of the selected model type. Data from the consumption variables (“pcer” and “pcer_17”) is extracted at the household level from the file “drought_adm2_merged.dta”. The data pertaining to household welfare measurement and the pricing of basic services are then combined, including, for instance, a range of variables such as level of education, marital status, female head, urban, consumption levels, and number of persons per household (household size), with the climate- and drought-related information. Each individual record of household is then classified into a Boolean variable called “poverty” that represents two situations, “poor” or “not-poor”, by comparing their consumption levels (variable “pcer_17”) with the poverty line set by the variable “pline685”. The consolidated database of household and geospatial data is used to feed an eXtreme Gradient Boosting model (XGboost). The model is applied as in a binary classification problem with the objective of classifying the variable “poverty” as a function of the range of household and geospatial data. The script outputs a series of files containing the trained models and charts both in .pdf and .png formats.
- Script "01.AugmentDataPCA_ADM2.R": This scripts performs the bootstrapping of climate- and drought-Related data. In order to assess how climate and drought-related information can influence poverty in Iran, a 10,000 samples bootstrapped database of climate- and drought-related is created, and then used to simulate the poverty classification of each individual household in Iran for each of the 10,000 bootstrapped sets. The goal is to create a 10,000 resampled climate and drought-related dataset that represents the variability of the climate and drought-related variables for each household considering the climate variability in the period between 2012 and 2019. Attention is given to ensure linearly independent results, thus before bootstrapping the climate and drought-related data, data is transformed into a feature space variables that are orthogonal principal components by means of principal component analysis (PCA), essentially generating a new set of variables that are linearly independent.
- Script "02.SimulateBoostrappedData.R": This script uses the data generated in script "01.AugmentDataPCA_ADM2.R" as new input to the ML models trained in "00.All_fTrainModel_StataData.R" to generate new synthetic vulnerability results as a result of how climate- and drought-related events can affect the vulnerability of households in Iran. The simulations are run for the data pertaining to the year 2019 only. The selection of the year 2019 is interesting as this is a pre-COVID19 year, so less influence from the pandemic is expected in the household dataset.
- Script "03.CreatePlots.R": This script reads all the produced results generated by the previous scripts and creates a set of output results in the format of images, plots, and tables.