Point to grid conversion

This code takes the georeferenced points and related them to a fishnet grid of 1km x 1km scale.

This will be for both for Australian Plague Locust and Desert Locust

In [1]:

import pandas as pd
import geopandas as gpd
import os

In [3]:

os.chdir(os.path.abspath(os.path.join(os.getcwd(), os.pardir)))
os.getcwd()

'/home/datascience/herbivore_nutrient_interactions'

In [4]:

# This function takes the point dataset and creates a sparse fishnet grid that overlaps with all points. 
# there is a user species cell size and few other handy parameters
# it defaults to parallel processing but can be turned off

# it creates a bounding box, creates a coarse grid, filters out unused grids and then within each coarse grid, it creates the fine scale grid
# at the speicifed cell size and then removes unused grids.

with open('scripts/functions/fishnet_grid_function.py') as f:
    code = f.read()

# Execute the code
exec(code)

Australian plague locust data

This dataset is comprised of several species but the records are biased towards the more economically important species (e.g. australian plauge locust)

Species identifier

The Species column is the identifier per species

10 = no locust/grasshopper found
11 = Australian plague locust (Chortocietes terminifera)
12 = Spur-throated locust (Austracris guttulosa)
15 = Small plague locust (Austroicetes cruciata)
19 = Eastern plague grasshopper (Oedaleus australis)

Species data management

For each species, we need to combined the species specific ID with the 10 observations (nil-observations) and observations in the other species categories that are nil. This will provide us with all the records that a specific grasshopper species was present and combine it with all the nil observation possible.

In [7]:

CT = pd.read_csv('data/raw/survey_data/CT.csv')
OA = pd.read_csv('data/raw/survey_data/OA.csv')
AG = pd.read_csv('data/raw/survey_data/AG.csv')
AC = pd.read_csv('data/raw/survey_data/AC.csv')

# Add a species column to each dataframe
CT['species'] = 'CT'
OA['species'] = 'OA'
AG['species'] = 'AG'
AC['species'] = 'AC'

# Combine all the dataframes
combined_df = pd.concat([CT, OA, AG, AC], ignore_index=True).drop(columns=['Longitude2', 'Latitude2','Unnamed: 0'])
combined_df

	Source	Date	Longitude	Latitude	Species	Data.Quality	Nymph.Stage	Nymph.Density	Adult.Stage	Adult.Density	species
0	1	7-Apr-17	143.144968	-24.760067	11	6	0	0	7	1	CT
1	1	7-Apr-17	143.276912	-24.473368	10	6	0	0	0	0	CT
2	1	7-Apr-17	143.278365	-24.643177	11	6	0	0	7	1	CT
3	1	7-Apr-17	143.367765	-24.354893	11	6	0	0	7	1	CT
4	1	7-Apr-17	143.459170	-24.319528	11	6	0	0	7	1	CT
...	...	...	...	...	...	...	...	...	...	...	...
303829	1	5-Jan-00	143.073900	-25.521900	10	3	0	0	0	0	AC
303830	1	5-Jan-00	143.307100	-25.209900	10	3	0	0	0	0	AC
303831	1	5-Jan-00	143.839400	-25.032800	10	3	0	0	0	0	AC
303832	1	5-Jan-00	143.921900	-24.971200	10	3	0	0	0	0	AC
303833	1	5-Jan-00	146.820000	-31.751700	10	3	0	0	0	0	AC

303834 rows × 11 columns

Fishnet grid construction

I first create the sparse fishnet grid (roughly 1km x 1km) and then aggregate the species data to the polygons. This is done BY species and not between.

In [8]:


aplc_survey_dataframe = create_grids_parallel(
        combined_df,
        longitude_column='Longitude',
        latitude_column='Latitude',
        csv_crs='EPSG:4326',
        grid_crs='EPSG:4326',
        transformation_crs='EPSG:4326',
        final_cell_size=0.01,
        coarse_cell_size=None,
        parallel=True
    )

Processing Polygons: 100%|██████████| 1315/1315 [00:17<00:00, 73.10it/s]

In [9]:

# Add an index to the polygons
## this will allow us to join tables together (and make GEE easier)

aplc_survey_dataframe = gpd.GeoDataFrame(aplc_survey_dataframe)
aplc_survey_dataframe['polygon_id'] = aplc_survey_dataframe.index
aplc_survey_dataframe.set_crs(epsg=4326, inplace=True)

	geometry	polygon_id
13	POLYGON ((131.66364 -28.11974, 131.66364 -28.1...	13
3072	POLYGON ((133.20694 -26.35200, 133.20694 -26.3...	3072
4563	POLYGON ((133.18694 -26.24757, 133.18694 -26.2...	4563
4620	POLYGON ((133.19694 -26.07757, 133.19694 -26.0...	4620
4626	POLYGON ((133.19694 -26.01757, 133.19694 -26.0...	4626
...	...	...
2101240	POLYGON ((151.10087 -27.06644, 151.10087 -27.0...	2101240
2101580	POLYGON ((151.18087 -26.86644, 151.18087 -26.8...	2101580
2101656	POLYGON ((151.20087 -26.90644, 151.20087 -26.8...	2101656
2102441	POLYGON ((151.00087 -26.66200, 151.00087 -26.6...	2102441
2102562	POLYGON ((151.03087 -26.65200, 151.03087 -26.6...	2102562

67144 rows × 2 columns

Lets summarize the APL point data to the polygon grid data

In [12]:

combined_df

	Source	Date	Longitude	Latitude	Species	Data.Quality	Nymph.Stage	Nymph.Density	Adult.Stage	Adult.Density	species
0	1	7-Apr-17	143.144968	-24.760067	11	6	0	0	7	1	CT
1	1	7-Apr-17	143.276912	-24.473368	10	6	0	0	0	0	CT
2	1	7-Apr-17	143.278365	-24.643177	11	6	0	0	7	1	CT
3	1	7-Apr-17	143.367765	-24.354893	11	6	0	0	7	1	CT
4	1	7-Apr-17	143.459170	-24.319528	11	6	0	0	7	1	CT
...	...	...	...	...	...	...	...	...	...	...	...
303829	1	5-Jan-00	143.073900	-25.521900	10	3	0	0	0	0	AC
303830	1	5-Jan-00	143.307100	-25.209900	10	3	0	0	0	0	AC
303831	1	5-Jan-00	143.839400	-25.032800	10	3	0	0	0	0	AC
303832	1	5-Jan-00	143.921900	-24.971200	10	3	0	0	0	0	AC
303833	1	5-Jan-00	146.820000	-31.751700	10	3	0	0	0	0	AC

303834 rows × 11 columns

In [13]:

columns_to_keep = ['Longitude', 'Latitude', 'Date','Species','Data.Quality','Nymph.Density','Adult.Density']
APL_dat = combined_df[columns_to_keep]

In [14]:

# Convert DataFrame to GeoDataFrame

gdf_points = gpd.GeoDataFrame(
    APL_dat,
    geometry=gpd.points_from_xy(APL_dat.Longitude, APL_dat.Latitude)
)

gdf_points.set_crs(epsg=4326, inplace=True)

	Longitude	Latitude	Date	Species	Data.Quality	Nymph.Density	Adult.Density	geometry
0	143.144968	-24.760067	7-Apr-17	11	6	0	1	POINT (143.14497 -24.76007)
1	143.276912	-24.473368	7-Apr-17	10	6	0	0	POINT (143.27691 -24.47337)
2	143.278365	-24.643177	7-Apr-17	11	6	0	1	POINT (143.27837 -24.64318)
3	143.367765	-24.354893	7-Apr-17	11	6	0	1	POINT (143.36776 -24.35489)
4	143.459170	-24.319528	7-Apr-17	11	6	0	1	POINT (143.45917 -24.31953)
...	...	...	...	...	...	...	...	...
303829	143.073900	-25.521900	5-Jan-00	10	3	0	0	POINT (143.07390 -25.52190)
303830	143.307100	-25.209900	5-Jan-00	10	3	0	0	POINT (143.30710 -25.20990)
303831	143.839400	-25.032800	5-Jan-00	10	3	0	0	POINT (143.83940 -25.03280)
303832	143.921900	-24.971200	5-Jan-00	10	3	0	0	POINT (143.92190 -24.97120)
303833	146.820000	-31.751700	5-Jan-00	10	3	0	0	POINT (146.82000 -31.75170)

303834 rows × 8 columns

In [15]:

joined_gdf = gpd.sjoin(gdf_points, aplc_survey_dataframe, how="left", predicate='within')

In [16]:

# I wanted to ensure every point has an assoicated polygon_id 

## which they do

na_counts = joined_gdf.isna().sum()

print(na_counts)

Longitude        0
Latitude         0
Date             0
Species          0
Data.Quality     0
Nymph.Density    0
Adult.Density    0
geometry         0
index_right      0
polygon_id       0
dtype: int64

In [18]:


# Custom aggregation function to get min, median, and max dates
def date_aggregations(series):
    series = pd.to_datetime(series, format='%d-%b-%y') # converts to a datatime
    return [series.min(), series.median(), series.max()]

# aggregation function to count unique values
def count_unique_values(series):
    return series.value_counts().to_dict()

# Group by polygon geometry and aggregate the data
aggregated_gdf = joined_gdf.groupby(['Species', 'polygon_id']).agg({
    'Date': date_aggregations,
    'Longitude': 'median',
    'Latitude': 'median',
    'Data.Quality': count_unique_values,
    'Nymph.Density': count_unique_values,
    'Adult.Density': count_unique_values
}).reset_index()

In [19]:

aggregated_gdf

	Species	polygon_id	Date	Longitude	Latitude	Data.Quality	Nymph.Density	Adult.Density
0	10	13	[2014-03-08 00:00:00, 2014-03-08 00:00:00, 201...	131.655639	-28.119211	{6: 4}	{0: 4}	{0: 4}
1	10	3072	[2017-04-01 00:00:00, 2017-04-01 00:00:00, 201...	133.206243	-26.350383	{6: 4}	{0: 4}	{0: 4}
2	10	4563	[2006-03-16 00:00:00, 2006-03-16 00:00:00, 200...	133.186734	-26.237582	{6: 4}	{0: 4}	{0: 4}
3	10	4620	[2006-03-16 00:00:00, 2006-03-16 00:00:00, 200...	133.195693	-26.075735	{6: 4}	{0: 4}	{0: 4}
4	10	4626	[2017-04-01 00:00:00, 2017-04-01 00:00:00, 201...	133.194124	-26.009872	{6: 4}	{0: 4}	{0: 4}
...	...	...	...	...	...	...	...	...
99172	19	2089037	[2008-01-15 00:00:00, 2008-01-15 00:00:00, 200...	150.841310	-26.695573	{6: 1}	{0: 1}	{2: 1}
99173	19	2089373	[2007-03-18 00:00:00, 2007-03-18 00:00:00, 200...	150.927725	-26.933713	{6: 1}	{0: 1}	{2: 1}
99174	19	2099579	[2007-11-07 00:00:00, 2007-11-07 12:00:00, 200...	151.075760	-27.261202	{6: 2}	{0: 2}	{1: 2}
99175	19	2100149	[2007-03-18 00:00:00, 2007-03-18 00:00:00, 200...	151.215045	-27.163347	{6: 1}	{0: 1}	{3: 1}
99176	19	2101084	[2007-03-18 00:00:00, 2007-03-18 00:00:00, 200...	151.054728	-27.023087	{6: 1}	{0: 1}	{2: 1}

99177 rows × 8 columns

In [20]:

# Lets count up the total number of observations in the dataframe:

# Define the aggregation functions for each column
agg_funcs = {
    'Data.Quality': 'count',  
    'Adult.Density': 'count',    
    'Nymph.Density': 'count'    
}

# Perform groupby operation and aggregate
grouped_joined_gdf = joined_gdf.groupby(['Species', 'polygon_id']).agg(agg_funcs).reset_index()

# Rename the aggregated columns
grouped_joined_gdf.rename(columns={
    'Data.Quality': 'Data Quality Total Count',
    'Adult.Density': 'Adult Density Total Count',
    'Nymph.Density': 'Nymph Density Total Count'
}, inplace=True)


# These three columns should agree with one another....
## if any rows dont -- this command filters for them
grouped_joined_gdf[
    (grouped_joined_gdf['Data Quality Total Count'] != grouped_joined_gdf['Adult Density Total Count']) |
    (grouped_joined_gdf['Data Quality Total Count'] != grouped_joined_gdf['Nymph Density Total Count']) |
    (grouped_joined_gdf['Adult Density Total Count'] != grouped_joined_gdf['Nymph Density Total Count'])
]

	Species	polygon_id	Data Quality Total Count	Adult Density Total Count	Nymph Density Total Count

In [21]:

final_df = pd.merge(aggregated_gdf, grouped_joined_gdf, on=['Species', 'polygon_id'], how='left')
final_df = pd.merge(final_df, aplc_survey_dataframe, on='polygon_id', how='left')
final_df.head()
na_counts = final_df.isna().sum()

print(na_counts)

Species                      0
polygon_id                   0
Date                         0
Longitude                    0
Latitude                     0
Data.Quality                 0
Nymph.Density                0
Adult.Density                0
Data Quality Total Count     0
Adult Density Total Count    0
Nymph Density Total Count    0
geometry                     0
dtype: int64

The APLC data is complete – lets write to disk

In [22]:

final_df.to_csv('data/processed/spatial_modeling/aplc_data_aggregated_to_polygon_grid.csv')

3-1 spatial modeling - fishnet grid construction