Back to Article
3-1 spatial modeling - fishnet grid construction
Download Notebook

Point to grid conversion

This code takes the georeferenced points and related them to a fishnet grid of 1km x 1km scale.

This will be for both for Australian Plague Locust and Desert Locust

In [1]:
import pandas as pd
import geopandas as gpd
import os
In [3]:
os.chdir(os.path.abspath(os.path.join(os.getcwd(), os.pardir)))
os.getcwd()
'/home/datascience/herbivore_nutrient_interactions'
In [4]:
# This function takes the point dataset and creates a sparse fishnet grid that overlaps with all points. 
# there is a user species cell size and few other handy parameters
# it defaults to parallel processing but can be turned off

# it creates a bounding box, creates a coarse grid, filters out unused grids and then within each coarse grid, it creates the fine scale grid
# at the speicifed cell size and then removes unused grids.

with open('scripts/functions/fishnet_grid_function.py') as f:
    code = f.read()

# Execute the code
exec(code)

Australian plague locust data

This dataset is comprised of several species but the records are biased towards the more economically important species (e.g. australian plauge locust)

Species identifier

The Species column is the identifier per species

  • 10 = no locust/grasshopper found
  • 11 = Australian plague locust (Chortocietes terminifera)
  • 12 = Spur-throated locust (Austracris guttulosa)
  • 15 = Small plague locust (Austroicetes cruciata)
  • 19 = Eastern plague grasshopper (Oedaleus australis)

Species data management

For each species, we need to combined the species specific ID with the 10 observations (nil-observations) and observations in the other species categories that are nil. This will provide us with all the records that a specific grasshopper species was present and combine it with all the nil observation possible.

In [7]:
CT = pd.read_csv('data/raw/survey_data/CT.csv')
OA = pd.read_csv('data/raw/survey_data/OA.csv')
AG = pd.read_csv('data/raw/survey_data/AG.csv')
AC = pd.read_csv('data/raw/survey_data/AC.csv')

# Add a species column to each dataframe
CT['species'] = 'CT'
OA['species'] = 'OA'
AG['species'] = 'AG'
AC['species'] = 'AC'

# Combine all the dataframes
combined_df = pd.concat([CT, OA, AG, AC], ignore_index=True).drop(columns=['Longitude2', 'Latitude2','Unnamed: 0'])
combined_df
Source Date Longitude Latitude Species Data.Quality Nymph.Stage Nymph.Density Adult.Stage Adult.Density species
0 1 7-Apr-17 143.144968 -24.760067 11 6 0 0 7 1 CT
1 1 7-Apr-17 143.276912 -24.473368 10 6 0 0 0 0 CT
2 1 7-Apr-17 143.278365 -24.643177 11 6 0 0 7 1 CT
3 1 7-Apr-17 143.367765 -24.354893 11 6 0 0 7 1 CT
4 1 7-Apr-17 143.459170 -24.319528 11 6 0 0 7 1 CT
... ... ... ... ... ... ... ... ... ... ... ...
303829 1 5-Jan-00 143.073900 -25.521900 10 3 0 0 0 0 AC
303830 1 5-Jan-00 143.307100 -25.209900 10 3 0 0 0 0 AC
303831 1 5-Jan-00 143.839400 -25.032800 10 3 0 0 0 0 AC
303832 1 5-Jan-00 143.921900 -24.971200 10 3 0 0 0 0 AC
303833 1 5-Jan-00 146.820000 -31.751700 10 3 0 0 0 0 AC

303834 rows × 11 columns

Fishnet grid construction

I first create the sparse fishnet grid (roughly 1km x 1km) and then aggregate the species data to the polygons. This is done BY species and not between.

In [8]:

aplc_survey_dataframe = create_grids_parallel(
        combined_df,
        longitude_column='Longitude',
        latitude_column='Latitude',
        csv_crs='EPSG:4326',
        grid_crs='EPSG:4326',
        transformation_crs='EPSG:4326',
        final_cell_size=0.01,
        coarse_cell_size=None,
        parallel=True
    )
Processing Polygons: 100%|██████████| 1315/1315 [00:17<00:00, 73.10it/s]
In [9]:
# Add an index to the polygons
## this will allow us to join tables together (and make GEE easier)

aplc_survey_dataframe = gpd.GeoDataFrame(aplc_survey_dataframe)
aplc_survey_dataframe['polygon_id'] = aplc_survey_dataframe.index
aplc_survey_dataframe.set_crs(epsg=4326, inplace=True)
geometry polygon_id
13 POLYGON ((131.66364 -28.11974, 131.66364 -28.1... 13
3072 POLYGON ((133.20694 -26.35200, 133.20694 -26.3... 3072
4563 POLYGON ((133.18694 -26.24757, 133.18694 -26.2... 4563
4620 POLYGON ((133.19694 -26.07757, 133.19694 -26.0... 4620
4626 POLYGON ((133.19694 -26.01757, 133.19694 -26.0... 4626
... ... ...
2101240 POLYGON ((151.10087 -27.06644, 151.10087 -27.0... 2101240
2101580 POLYGON ((151.18087 -26.86644, 151.18087 -26.8... 2101580
2101656 POLYGON ((151.20087 -26.90644, 151.20087 -26.8... 2101656
2102441 POLYGON ((151.00087 -26.66200, 151.00087 -26.6... 2102441
2102562 POLYGON ((151.03087 -26.65200, 151.03087 -26.6... 2102562

67144 rows × 2 columns

Lets summarize the APL point data to the polygon grid data

In [12]:
combined_df
Source Date Longitude Latitude Species Data.Quality Nymph.Stage Nymph.Density Adult.Stage Adult.Density species
0 1 7-Apr-17 143.144968 -24.760067 11 6 0 0 7 1 CT
1 1 7-Apr-17 143.276912 -24.473368 10 6 0 0 0 0 CT
2 1 7-Apr-17 143.278365 -24.643177 11 6 0 0 7 1 CT
3 1 7-Apr-17 143.367765 -24.354893 11 6 0 0 7 1 CT
4 1 7-Apr-17 143.459170 -24.319528 11 6 0 0 7 1 CT
... ... ... ... ... ... ... ... ... ... ... ...
303829 1 5-Jan-00 143.073900 -25.521900 10 3 0 0 0 0 AC
303830 1 5-Jan-00 143.307100 -25.209900 10 3 0 0 0 0 AC
303831 1 5-Jan-00 143.839400 -25.032800 10 3 0 0 0 0 AC
303832 1 5-Jan-00 143.921900 -24.971200 10 3 0 0 0 0 AC
303833 1 5-Jan-00 146.820000 -31.751700 10 3 0 0 0 0 AC

303834 rows × 11 columns

In [13]:
columns_to_keep = ['Longitude', 'Latitude', 'Date','Species','Data.Quality','Nymph.Density','Adult.Density']
APL_dat = combined_df[columns_to_keep]
In [14]:
# Convert DataFrame to GeoDataFrame

gdf_points = gpd.GeoDataFrame(
    APL_dat,
    geometry=gpd.points_from_xy(APL_dat.Longitude, APL_dat.Latitude)
)

gdf_points.set_crs(epsg=4326, inplace=True)
Longitude Latitude Date Species Data.Quality Nymph.Density Adult.Density geometry
0 143.144968 -24.760067 7-Apr-17 11 6 0 1 POINT (143.14497 -24.76007)
1 143.276912 -24.473368 7-Apr-17 10 6 0 0 POINT (143.27691 -24.47337)
2 143.278365 -24.643177 7-Apr-17 11 6 0 1 POINT (143.27837 -24.64318)
3 143.367765 -24.354893 7-Apr-17 11 6 0 1 POINT (143.36776 -24.35489)
4 143.459170 -24.319528 7-Apr-17 11 6 0 1 POINT (143.45917 -24.31953)
... ... ... ... ... ... ... ... ...
303829 143.073900 -25.521900 5-Jan-00 10 3 0 0 POINT (143.07390 -25.52190)
303830 143.307100 -25.209900 5-Jan-00 10 3 0 0 POINT (143.30710 -25.20990)
303831 143.839400 -25.032800 5-Jan-00 10 3 0 0 POINT (143.83940 -25.03280)
303832 143.921900 -24.971200 5-Jan-00 10 3 0 0 POINT (143.92190 -24.97120)
303833 146.820000 -31.751700 5-Jan-00 10 3 0 0 POINT (146.82000 -31.75170)

303834 rows × 8 columns

In [15]:
joined_gdf = gpd.sjoin(gdf_points, aplc_survey_dataframe, how="left", predicate='within')
In [16]:
# I wanted to ensure every point has an assoicated polygon_id 

## which they do

na_counts = joined_gdf.isna().sum()

print(na_counts)
Longitude        0
Latitude         0
Date             0
Species          0
Data.Quality     0
Nymph.Density    0
Adult.Density    0
geometry         0
index_right      0
polygon_id       0
dtype: int64
In [18]:

# Custom aggregation function to get min, median, and max dates
def date_aggregations(series):
    series = pd.to_datetime(series, format='%d-%b-%y') # converts to a datatime
    return [series.min(), series.median(), series.max()]

# aggregation function to count unique values
def count_unique_values(series):
    return series.value_counts().to_dict()

# Group by polygon geometry and aggregate the data
aggregated_gdf = joined_gdf.groupby(['Species', 'polygon_id']).agg({
    'Date': date_aggregations,
    'Longitude': 'median',
    'Latitude': 'median',
    'Data.Quality': count_unique_values,
    'Nymph.Density': count_unique_values,
    'Adult.Density': count_unique_values
}).reset_index()
In [19]:
aggregated_gdf
Species polygon_id Date Longitude Latitude Data.Quality Nymph.Density Adult.Density
0 10 13 [2014-03-08 00:00:00, 2014-03-08 00:00:00, 201... 131.655639 -28.119211 {6: 4} {0: 4} {0: 4}
1 10 3072 [2017-04-01 00:00:00, 2017-04-01 00:00:00, 201... 133.206243 -26.350383 {6: 4} {0: 4} {0: 4}
2 10 4563 [2006-03-16 00:00:00, 2006-03-16 00:00:00, 200... 133.186734 -26.237582 {6: 4} {0: 4} {0: 4}
3 10 4620 [2006-03-16 00:00:00, 2006-03-16 00:00:00, 200... 133.195693 -26.075735 {6: 4} {0: 4} {0: 4}
4 10 4626 [2017-04-01 00:00:00, 2017-04-01 00:00:00, 201... 133.194124 -26.009872 {6: 4} {0: 4} {0: 4}
... ... ... ... ... ... ... ... ...
99172 19 2089037 [2008-01-15 00:00:00, 2008-01-15 00:00:00, 200... 150.841310 -26.695573 {6: 1} {0: 1} {2: 1}
99173 19 2089373 [2007-03-18 00:00:00, 2007-03-18 00:00:00, 200... 150.927725 -26.933713 {6: 1} {0: 1} {2: 1}
99174 19 2099579 [2007-11-07 00:00:00, 2007-11-07 12:00:00, 200... 151.075760 -27.261202 {6: 2} {0: 2} {1: 2}
99175 19 2100149 [2007-03-18 00:00:00, 2007-03-18 00:00:00, 200... 151.215045 -27.163347 {6: 1} {0: 1} {3: 1}
99176 19 2101084 [2007-03-18 00:00:00, 2007-03-18 00:00:00, 200... 151.054728 -27.023087 {6: 1} {0: 1} {2: 1}

99177 rows × 8 columns

In [20]:
# Lets count up the total number of observations in the dataframe:

# Define the aggregation functions for each column
agg_funcs = {
    'Data.Quality': 'count',  
    'Adult.Density': 'count',    
    'Nymph.Density': 'count'    
}

# Perform groupby operation and aggregate
grouped_joined_gdf = joined_gdf.groupby(['Species', 'polygon_id']).agg(agg_funcs).reset_index()

# Rename the aggregated columns
grouped_joined_gdf.rename(columns={
    'Data.Quality': 'Data Quality Total Count',
    'Adult.Density': 'Adult Density Total Count',
    'Nymph.Density': 'Nymph Density Total Count'
}, inplace=True)


# These three columns should agree with one another....
## if any rows dont -- this command filters for them
grouped_joined_gdf[
    (grouped_joined_gdf['Data Quality Total Count'] != grouped_joined_gdf['Adult Density Total Count']) |
    (grouped_joined_gdf['Data Quality Total Count'] != grouped_joined_gdf['Nymph Density Total Count']) |
    (grouped_joined_gdf['Adult Density Total Count'] != grouped_joined_gdf['Nymph Density Total Count'])
]
Species polygon_id Data Quality Total Count Adult Density Total Count Nymph Density Total Count
In [21]:
final_df = pd.merge(aggregated_gdf, grouped_joined_gdf, on=['Species', 'polygon_id'], how='left')
final_df = pd.merge(final_df, aplc_survey_dataframe, on='polygon_id', how='left')
final_df.head()
na_counts = final_df.isna().sum()

print(na_counts)
Species                      0
polygon_id                   0
Date                         0
Longitude                    0
Latitude                     0
Data.Quality                 0
Nymph.Density                0
Adult.Density                0
Data Quality Total Count     0
Adult Density Total Count    0
Nymph Density Total Count    0
geometry                     0
dtype: int64

The APLC data is complete – lets write to disk

In [22]:
final_df.to_csv('data/processed/spatial_modeling/aplc_data_aggregated_to_polygon_grid.csv')