Rice Paddy Methane Emissions Estimation: Part 1


Methane Emissions Estimation Data Part 1: A Comparison between FAOSTAT and University of Malaysia Estimates

This post documents the data exploration phase of a project that determines whether global methane emissions produced by rice paddies are undercounted.

It is fairly code python and pandas heavy.

The code and data exploration follows the summary below.

Inspiration for this work

The University of Malaysia in partnership with Climate TRACE release a paper stating that the UN undercounts rice paddy methane emissions by about 16%. Upon review of their claims, I decided to test the data myself across a 5 year distribution to verify that the claims hold up across multiple distributions.

University of Malaysia Methodological Deficiencies

  • Comparing a 2019 distribution FAOSTAT to a 2020 distribution.
  • Rounding to 4 signficant figures.
  • Testing only one year of emissions data.
  • Did not publish data to hypothesis testing to determine if emission distributions significantly vary annually and between states.
  • Relying only on satellite data may undercount hectares in cultivation at higher altitudes.

FaoStat Methodological Deficiencies

  • Relies upon official government statistics which can be manipulated at any point along the bureaucratic paper chain.
  • There is an incentive for certain nations to reduce their counts in order to receive international aid and to meet emissions standards.

University of Malaysia acknowledging deficiencies (like all good papers should)

“The difference between harvested rice cultivation area from statistical data and remote-sensing estimates can be due to two factors: (i) MODIS data which have moderate spatial resolution lead to mixed pixels, where rice fields and non-rice fields are combined. This can overestimate area, especially in lowland regions and have a low ability to detect small rice field patches in upland regions (Frolking et al 1999, Seto et al 2000); and (ii) political and policy factors (Yan et al., 2019) such as determination of the amount of subsidies for fertilizers and evaluation of achievement of government programs in the agricultural sector. Other factors that contribute to discrepancy in CH4 emission are from different emission and scale factors that are related to water regime and organic amendment. These values give high uncertainty since the availability of these data are limited and quite variable.”

My Initial Impressions and findings

Initial

The percent difference and the tonnage difference do not support each other. I need to recalculate the totals section to ensure that we are doing things correctly.

I need to confirm the values, but I’m initially impressed by the fact that the FAOSTAT data reports higher values than the Malaysia data on average. According to the included paper this should not be the case.

Verified

I calculated differences totals and means per DataFrame to ensure accuracy prior to the join. I also dropped pre-calcuated values when joining to ensure that the aggregation algorithms to not modify the results.

The data is now consistent and supports the findings of the University of Malaysia Paper. With that said it is important to note that the differences recorded are far smaller than suggested.

Import Dependencies

import pandas as pd
import matplotlib.pyplot as plt
import geopandas as gpd
import folium
import contextily as cx
from shapely.geometry import Point, LineString, Polygon
import numpy as np
from scipy.spatial import cKDTree
from geopy.distance import distance
import scipy.stats as stats

Dependencies

  • Geopandas
  • pandas
  • openpyxl
  • Shapely
  • geopy Distance
  • numpy

Exploration Plan

Data Imports

  • /Users/jnapolitano/Projects/wattime-takehome/data/ch4_2015-2021.xlsx
  • /Users/jnapolitano/Projects/wattime-takehome/data/emissions_csv_fao_emiss_csv_ch4_fao_2015_2019_tonnes.xlsx

Import Data Frames

Since jupyter caches the data to the notebook json I can import the dataframes that I will be using together.

If I were to build automated scripts to perform the analysis I would only load the data necessary to perform a process.

Experiment with Plots for each Set

I don’t know exactly which plots I want to include in the final report.

I ’ll plot a few for each data set

Calculate differences between the datasets

  • create a differences data frame
  • write to file for use
  • plot

University of Malaysia Emission Estimates

filepath = "/Users/jnapolitano/Projects/wattime-takehome/wattime-takehome/data/ch4_2015-2021.xlsx"

malaysia_emissions_df = pd.read_excel(filepath)
malaysia_emissions_df
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

Calculate Co2 Equivalency

malaysia_emissions_df['tCO2_2015'] = (malaysia_emissions_df['tCH4_2015'] * 25)
malaysia_emissions_df['tCO2_2016'] = (malaysia_emissions_df['tCH4_2016'] * 25)
malaysia_emissions_df['tCO2_2017'] = (malaysia_emissions_df['tCH4_2017'] * 25)
malaysia_emissions_df['tCO2_2018'] = (malaysia_emissions_df['tCH4_2018'] * 25)
malaysia_emissions_df['tCO2_2019'] = (malaysia_emissions_df['tCH4_2019'] * 25)

Calculate Means

malaysia_emissions_df.loc['mean'] = malaysia_emissions_df.loc[(malaysia_emissions_df['country_name'] != "Total")].select_dtypes(np.number).mean()
malaysia_emissions_df.at['mean','country_name'] = 'mean'
malaysia_emissions_df
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

Calculate Means and Totals Across Rows

malaysia_emissions_df["Mean_CH4"] = mean_series
malaysia_emissions_df['Total_CH4'] = total_series 
## the select np.number is uncecessary, but i'm including anyways as it doesnt really hurt but for a small calculation penalty
malaysia_emissions_df["Mean_CO2"] = mean_series
malaysia_emissions_df['Total_CO2'] = total_series 
malaysia_emissions_df.reset_index(inplace=True, drop = True)
malaysia_emissions_df
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

Write Data to File

outfile = "/Users/jnapolitano/Projects/wattime-takehome/wattime-takehome/data/TRACE_DATA.csv"

malaysia_emissions_df.to_csv(outfile)

University of Malaysia Plots

University of Malaysia Bar Plot

malaysia_emissions_df.plot(kind = "barh", x = 'country_name', xlabel = "Country Name", ylabel = "CH4 Tonnes", figsize = (10,5))
<AxesSubplot:ylabel='Country Name'>

png

University of Malaysia Density Plot

malaysia_emissions_df.plot(rot = 0, kind = "density", figsize = (15,5)) 
<AxesSubplot:ylabel='Density'>

png

I did not exclude totals or mean from the dataframe, but as we can see the second hump in the density graph shows the distribution of totals annualy. Interestingly the 2020 data is shifted further to the right than other years. This actually questions the validity of the study promoted by the University of malaysia

FAOSTAT Data

filepath = "/Users/jnapolitano/Projects/wattime-takehome/wattime-takehome/data/emissions_csv_fao_emiss_csv_ch4_fao_2015_2019_tonnes.xlsx"

faostat_emissions_df = pd.read_excel(filepath)
## I didn't write the index to the csv file in the previous step.  IF time permits go back and fix this error
faostat_emissions_df
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

Change code to iso3_country


faostat_emissions_df.rename(columns={"code": "iso3_country"}, inplace =True)
faostat_emissions_df.rename(columns={"country": "country_name"}, inplace =True)
# The column title is not a string.  It is understood as an int or a datetime.  
#faostat_emissions_df['2015']
faostat_emissions_df
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

Set country_name total to total

faostat_emissions_df.at[23,'country_name'] = 'Total'

Drop Fao Country Code

faostat_emissions_df.drop(labels = ['country_fao'], axis=1, inplace=True)
faostat_emissions_df
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

Calculate Co2 Equivalency

faostat_emissions_df['tCO2_2015'] = faostat_emissions_df[2015] * 25
faostat_emissions_df['tCO2_2016'] = faostat_emissions_df[2016] * 25
faostat_emissions_df['tCO2_2017'] = faostat_emissions_df[2017] * 25
faostat_emissions_df['tCO2_2018'] = faostat_emissions_df[2018] * 25
faostat_emissions_df['tCO2_2019'] = faostat_emissions_df[2019] * 25

Calculate Means

faostat_emissions_df.loc['mean'] = faostat_emissions_df.loc[(faostat_emissions_df['country_name'] != "Total")].select_dtypes(np.number).mean()
faostat_emissions_df.at['mean','country_name'] = 'mean'
faostat_emissions_df.reset_index(inplace=True, drop=True)
#faostat_emissions_df.at['mean','country_fao'] = 'mean'

Calculate Means and Totals Across Rows

faostat_emissions_df["Mean_CH4"] = mean_series
faostat_emissions_df['Total_CH4'] = total_series 
## the select np.number is uncecessary, but i'm including anyways as it doesnt really hurt but for a small calculation penalty
faostat_emissions_df["Mean_CO2"] = mean_series
faostat_emissions_df['Total_CO2'] = total_series 
faostat_emissions_df.reset_index(inplace=True, drop=True)
faostat_emissions_df
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

FAOSTAT Data to File

outfile = "/Users/jnapolitano/Projects/wattime-takehome/wattime-takehome/data/FAOSTAT_DATA.csv"

faostat_emissions_df.to_csv(outfile)

FaoSTAT PLOTS

FAOSTAT Hectare Estimates Bar Plot

faostat_emissions_df.plot(kind = "barh", x = 'country_name', y = [2015, 2016, 2017, 2018, 2019], xlabel = "Country Name", ylabel = "Tonnes CH4", figsize = (10,5))
<AxesSubplot:ylabel='Country Name'>

png

FAOSTAT Density Plot

faostat_emissions_df.plot(rot = 90, kind = "density",y = [2015, 2016, 2017, 2018, 2019], figsize = (15,5)) 
<AxesSubplot:ylabel='Density'>

png

The density plot is fairly consistent. There is nearly no variation between nations and in total. The 2020 data may show otherwise as the Malaysian data shows.

Join Df’s by ISO3 Country

Drop totals and means from the original df.

Because I am joining on iso3 country country code if the totals and means are located at different indexes we may experience merge and calculation errors

faostat_emissions_df = faostat_emissions_df[(faostat_emissions_df["country_name"] != "Total") & (faostat_emissions_df['country_name'] != 'mean')].copy()
malaysia_emissions_df = malaysia_emissions_df[(malaysia_emissions_df["country_name"] != "Total") & (malaysia_emissions_df['country_name'] != 'mean')].copy()
faostat_emissions_df
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

malaysia_emissions_df
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

merged_df = faostat_emissions_df.merge(malaysia_emissions_df,suffixes=('_FAOSTAT', '_TRACE'), on='iso3_country', how='left', sort=False)

Dropping 2020 and 2021 from the data sets

I will only compare data compiled from the same year.

merged_df.drop([2020, 2021, "tCH4_2020","tCH4_2021"], axis = 1, inplace = True)
merged_df
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

Calculate difference in Ch4 Tonnes Between the Estimates

# Calculate Difference in tons
merged_df['CH4_diff_2015'] = merged_df[2015] - merged_df['tCH4_2015']
merged_df['CH4_diff_2016'] = merged_df[2016] - merged_df['tCH4_2016']
merged_df['CH4_diff_2017'] = merged_df[2017] - merged_df['tCH4_2017']
merged_df['CH4_diff_2018'] = merged_df[2018] - merged_df['tCH4_2018']
merged_df['CH4_diff_2019'] = merged_df[2019] - merged_df['tCH4_2019']
merged_df['CH4_diff_means'] = merged_df['Mean_CH4_FAOSTAT'] - merged_df['Mean_CH4_TRACE']
merged_df['CH4_diff_totals'] = merged_df['Total_CH4_FAOSTAT'] - merged_df['Total_CH4_TRACE']
merged_df
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

Calculate difference in C02 Tonnes Between the Estimates

# Calculate Difference in tons
merged_df['CO2_diff_2015'] = merged_df['tCO2_2015_FAOSTAT'] - merged_df['tCO2_2015_TRACE']
merged_df['CO2_diff_2016'] = merged_df['tCO2_2016_FAOSTAT'] - merged_df['tCO2_2016_TRACE']
merged_df['CO2_diff_2017'] = merged_df['tCO2_2017_FAOSTAT'] - merged_df['tCO2_2017_TRACE']
merged_df['CO2_diff_2018'] = merged_df['tCO2_2018_FAOSTAT'] - merged_df['tCO2_2018_TRACE']
merged_df['CO2_diff_2019'] = merged_df['tCO2_2019_FAOSTAT'] - merged_df['tCO2_2019_TRACE']
merged_df['CO2_diff_means'] = merged_df['Mean_CO2_FAOSTAT'] - merged_df['Mean_CO2_TRACE']
merged_df['CO2_diff_totals'] = merged_df['Total_CO2_FAOSTAT'] - merged_df['Total_CO2_TRACE']

Calculating the CH4 Percent Differences Between the Estimates

## Calculate Percent Differnces on this data set )*100
# With raw data i could have accomplished this with a groupby.aggregate(lambda x ), however the pivot tables given are not easy to apply #vectorized functions across time series
merged_df['CH4_abs_percent_diff_2015'] = ((abs(merged_df[2015] - merged_df['tCH4_2015']))/((merged_df[2015] + merged_df['tCH4_2015'])/2))*100
merged_df['CH4_abs_percent_diff_2016'] = ((abs((merged_df[2016] - merged_df['tCH4_2016']))/((merged_df[2016] + merged_df['tCH4_2016'])/2)))*100
merged_df['CH4_abs_percent_diff_2017'] = ((abs(merged_df[2017] - merged_df['tCH4_2017']))/((merged_df[2017] + merged_df['tCH4_2017'])/2))*100
merged_df['CH4_abs_percent_diff_2018'] = (abs((merged_df[2018] - merged_df['tCH4_2018']))/((merged_df[2018] + merged_df['tCH4_2018'])/2))*100
merged_df['CH4_abs_percent_diff_2019'] = (abs((merged_df[2019] - merged_df['tCH4_2019']))/((merged_df[2019] + merged_df['tCH4_2019'])/2))*100
merged_df['CH4_abs_percent_diff_means'] = (abs((merged_df['Mean_CH4_FAOSTAT'] - merged_df['Mean_CH4_TRACE']))/((merged_df['Mean_CH4_FAOSTAT'] + merged_df['Mean_CH4_TRACE'])/2))* 100
merged_df['CH4_abs_percent_diff_totals'] = (abs((merged_df['Total_CH4_FAOSTAT'] - merged_df['Total_CH4_TRACE']))/((merged_df['Total_CH4_TRACE'] + merged_df['Total_CH4_FAOSTAT'])/2))*100


merged_df['CH4_relative_percent_diff_2015'] = ((merged_df[2015] - merged_df['tCH4_2015'])/(merged_df[2015]))*100
merged_df['CH4_relative_percent_diff_2016'] = ((merged_df[2016] - merged_df['tCH4_2016'])/(merged_df[2016]))*100
merged_df['CH4_relative_percent_diff_2017'] = ((merged_df[2017] - merged_df['tCH4_2017'])/(merged_df[2017]))*100
merged_df['CH4_relative_percent_diff_2018'] = ((merged_df[2018] - merged_df['tCH4_2018'])/(merged_df[2018]))*100
merged_df['CH4_relative_percent_diff_2019'] = ((merged_df[2019] - merged_df['tCH4_2019'])/(merged_df[2019]))*100
merged_df['CH4_relative_percent_diff_means'] = ((merged_df['Mean_CH4_FAOSTAT'] - merged_df['Mean_CH4_TRACE'])/(merged_df["Mean_CH4_FAOSTAT"]))*100
merged_df['CH4_relative_percent_diff_totals'] = ((merged_df['Total_CH4_FAOSTAT'] - merged_df['Total_CH4_TRACE'])/(merged_df["Total_CH4_FAOSTAT"]))*100

Calculate CO2 Differences

## Calculate Percent Differnces on this data set )*100
# With raw data i could have accomplished this with a groupby.aggregate(lambda x ), however the pivot tables given are not easy to apply #vectorized functions across time series
merged_df['CO2_abs_percent_diff_2015'] = (abs((merged_df['tCO2_2015_FAOSTAT'] - merged_df['tCO2_2015_TRACE']))/((merged_df['tCO2_2015_TRACE'] + merged_df['tCO2_2015_FAOSTAT'])/2))*100
merged_df['CO2_abs_percent_diff_2016'] = ((abs(merged_df['tCO2_2016_FAOSTAT'] - merged_df['tCO2_2016_TRACE']))/((merged_df['tCO2_2016_TRACE'] + merged_df['tCO2_2016_FAOSTAT'])/2))*100
merged_df['CO2_abs_percent_diff_2017'] = ((abs(merged_df['tCO2_2017_FAOSTAT'] - merged_df['tCO2_2017_TRACE']))/((merged_df['tCO2_2017_TRACE'] + merged_df['tCO2_2017_FAOSTAT'])/2))*100
merged_df['CO2_abs_percent_diff_2018'] = ((abs(merged_df['tCO2_2018_FAOSTAT'] - merged_df['tCO2_2018_TRACE']))/((merged_df['tCO2_2018_TRACE'] + merged_df['tCO2_2018_FAOSTAT'])/2))*100
merged_df['CO2_abs_percent_diff_2019'] = ((abs(merged_df['tCO2_2019_FAOSTAT'] - merged_df['tCO2_2019_TRACE']))/((merged_df['tCO2_2019_TRACE'] + merged_df['tCO2_2019_FAOSTAT'])/2))*100
merged_df['CO2_abs_percent_diff_means'] = ((abs(merged_df['Mean_CO2_FAOSTAT'] - merged_df['Mean_CO2_TRACE']))/((merged_df['Mean_CO2_FAOSTAT'] + merged_df['Mean_CO2_TRACE'])/2))* 100
merged_df['CO2_abs_percent_diff_totals'] = ((abs(merged_df['Total_CO2_FAOSTAT'] - merged_df['Total_CO2_TRACE']))/((merged_df['Total_CO2_TRACE'] + merged_df['Total_CO2_FAOSTAT'])/2))*100


merged_df['CO2_relative_percent_diff_2015'] = ((merged_df['tCO2_2015_FAOSTAT']  - merged_df['tCO2_2015_TRACE'])/(merged_df['tCO2_2015_FAOSTAT']))*100
merged_df['CO2_relative_percent_diff_2016'] = ((merged_df['tCO2_2016_FAOSTAT']  - merged_df['tCO2_2016_TRACE'])/(merged_df['tCO2_2016_FAOSTAT']))*100
merged_df['CO2_relative_percent_diff_2017'] = ((merged_df['tCO2_2017_FAOSTAT']  - merged_df['tCO2_2017_TRACE'])/(merged_df['tCO2_2017_FAOSTAT']))*100
merged_df['CO2_relative_percent_diff_2018'] = ((merged_df['tCO2_2018_FAOSTAT']  - merged_df['tCO2_2018_TRACE'])/(merged_df['tCO2_2018_FAOSTAT']))*100
merged_df['CO2_relative_percent_diff_2019'] = ((merged_df['tCO2_2019_FAOSTAT']  - merged_df['tCO2_2019_TRACE'])/(merged_df['tCO2_2019_FAOSTAT']))*100
merged_df['CO2_relative_percent_diff_means'] = ((merged_df["Mean_CO2_FAOSTAT"] - merged_df['Mean_CO2_TRACE'])/(merged_df["Mean_CO2_FAOSTAT"]))*100
merged_df['CO2_relative_percent_diff_totals'] = ((merged_df['Total_CO2_FAOSTAT'] - merged_df['Total_CO2_TRACE'])/(merged_df["Total_CO2_FAOSTAT"]))*100
merged_df
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

Recalculate Means

merged_df.loc['mean'] = merged_df.select_dtypes(np.number).mean()
merged_df.at['mean','country_name_FAOSTAT'] = 'mean'
merged_df.at['mean','country_name_TRACE'] = 'mean'
merged_df
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

Recalculate Totals

merged_df.loc['total'] = merged_df[merged_df['country_name_FAOSTAT'] != 'mean'].select_dtypes(np.number).sum()
merged_df.at['total','country_name_FAOSTAT'] = 'total'
merged_df.at['total','country_name_TRACE'] = 'total'
merged_df.reset_index(inplace=True, drop = True)
merged_df
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

Merged Data to File

outfile = "/Users/jnapolitano/Projects/wattime-takehome/wattime-takehome/data/MERGED_DATA.csv"

merged_df.to_csv(outfile)

CO2 Difference Plots

merged_df.plot(kind = "barh", x = 'country_name_FAOSTAT', y = ["CO2_diff_2015", "CO2_diff_2016",	"CO2_diff_2017", "CO2_diff_2018", "CO2_diff_2019"], xlabel = "Country Name", ylabel = "Tonnes CH4", figsize = (10,10))
<AxesSubplot:ylabel='Country Name'>

png

Percent Difference Plot

merged_df.plot(kind = "barh", x = 'country_name_FAOSTAT', y = ["CO2_relative_percent_diff_2015", "CO2_relative_percent_diff_2016",	"CO2_relative_percent_diff_2017", "CO2_relative_percent_diff_2018", "CO2_relative_percent_diff_2019"], xlabel = "Country Name", ylabel = "Tonnes CH4", figsize=(10,10))
<AxesSubplot:ylabel='Country Name'>

png