import pandas as pd
import geopandas as gpd
import osThis code takes the georeferenced points and related them to a fishnet grid of 1km x 1km scale.
This will be for both for Australian Plague Locust and Desert Locust
In [1]:
In [3]:
os.chdir(os.path.abspath(os.path.join(os.getcwd(), os.pardir)))
os.getcwd()'/home/datascience/herbivore_nutrient_interactions'
In [4]:
# This function takes the point dataset and creates a sparse fishnet grid that overlaps with all points.
# there is a user species cell size and few other handy parameters
# it defaults to parallel processing but can be turned off
# it creates a bounding box, creates a coarse grid, filters out unused grids and then within each coarse grid, it creates the fine scale grid
# at the speicifed cell size and then removes unused grids.
with open('scripts/functions/fishnet_grid_function.py') as f:
code = f.read()
# Execute the code
exec(code)Australian plague locust data
This dataset is comprised of several species but the records are biased towards the more economically important species (e.g. australian plauge locust)
Species identifier
The Species column is the identifier per species
- 10 = no locust/grasshopper found
- 11 = Australian plague locust (Chortocietes terminifera)
- 12 = Spur-throated locust (Austracris guttulosa)
- 15 = Small plague locust (Austroicetes cruciata)
- 19 = Eastern plague grasshopper (Oedaleus australis)
Species data management
For each species, we need to combined the species specific ID with the 10 observations (nil-observations) and observations in the other species categories that are nil. This will provide us with all the records that a specific grasshopper species was present and combine it with all the nil observation possible.
In [7]:
CT = pd.read_csv('data/raw/survey_data/CT.csv')
OA = pd.read_csv('data/raw/survey_data/OA.csv')
AG = pd.read_csv('data/raw/survey_data/AG.csv')
AC = pd.read_csv('data/raw/survey_data/AC.csv')
# Add a species column to each dataframe
CT['species'] = 'CT'
OA['species'] = 'OA'
AG['species'] = 'AG'
AC['species'] = 'AC'
# Combine all the dataframes
combined_df = pd.concat([CT, OA, AG, AC], ignore_index=True).drop(columns=['Longitude2', 'Latitude2','Unnamed: 0'])
combined_df| Source | Date | Longitude | Latitude | Species | Data.Quality | Nymph.Stage | Nymph.Density | Adult.Stage | Adult.Density | species | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 7-Apr-17 | 143.144968 | -24.760067 | 11 | 6 | 0 | 0 | 7 | 1 | CT |
| 1 | 1 | 7-Apr-17 | 143.276912 | -24.473368 | 10 | 6 | 0 | 0 | 0 | 0 | CT |
| 2 | 1 | 7-Apr-17 | 143.278365 | -24.643177 | 11 | 6 | 0 | 0 | 7 | 1 | CT |
| 3 | 1 | 7-Apr-17 | 143.367765 | -24.354893 | 11 | 6 | 0 | 0 | 7 | 1 | CT |
| 4 | 1 | 7-Apr-17 | 143.459170 | -24.319528 | 11 | 6 | 0 | 0 | 7 | 1 | CT |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 303829 | 1 | 5-Jan-00 | 143.073900 | -25.521900 | 10 | 3 | 0 | 0 | 0 | 0 | AC |
| 303830 | 1 | 5-Jan-00 | 143.307100 | -25.209900 | 10 | 3 | 0 | 0 | 0 | 0 | AC |
| 303831 | 1 | 5-Jan-00 | 143.839400 | -25.032800 | 10 | 3 | 0 | 0 | 0 | 0 | AC |
| 303832 | 1 | 5-Jan-00 | 143.921900 | -24.971200 | 10 | 3 | 0 | 0 | 0 | 0 | AC |
| 303833 | 1 | 5-Jan-00 | 146.820000 | -31.751700 | 10 | 3 | 0 | 0 | 0 | 0 | AC |
303834 rows × 11 columns
Fishnet grid construction
I first create the sparse fishnet grid (roughly 1km x 1km) and then aggregate the species data to the polygons. This is done BY species and not between.
In [8]:
aplc_survey_dataframe = create_grids_parallel(
combined_df,
longitude_column='Longitude',
latitude_column='Latitude',
csv_crs='EPSG:4326',
grid_crs='EPSG:4326',
transformation_crs='EPSG:4326',
final_cell_size=0.01,
coarse_cell_size=None,
parallel=True
)
Processing Polygons: 100%|██████████| 1315/1315 [00:17<00:00, 73.10it/s]
In [9]:
# Add an index to the polygons
## this will allow us to join tables together (and make GEE easier)
aplc_survey_dataframe = gpd.GeoDataFrame(aplc_survey_dataframe)
aplc_survey_dataframe['polygon_id'] = aplc_survey_dataframe.index
aplc_survey_dataframe.set_crs(epsg=4326, inplace=True)| geometry | polygon_id | |
|---|---|---|
| 13 | POLYGON ((131.66364 -28.11974, 131.66364 -28.1... | 13 |
| 3072 | POLYGON ((133.20694 -26.35200, 133.20694 -26.3... | 3072 |
| 4563 | POLYGON ((133.18694 -26.24757, 133.18694 -26.2... | 4563 |
| 4620 | POLYGON ((133.19694 -26.07757, 133.19694 -26.0... | 4620 |
| 4626 | POLYGON ((133.19694 -26.01757, 133.19694 -26.0... | 4626 |
| ... | ... | ... |
| 2101240 | POLYGON ((151.10087 -27.06644, 151.10087 -27.0... | 2101240 |
| 2101580 | POLYGON ((151.18087 -26.86644, 151.18087 -26.8... | 2101580 |
| 2101656 | POLYGON ((151.20087 -26.90644, 151.20087 -26.8... | 2101656 |
| 2102441 | POLYGON ((151.00087 -26.66200, 151.00087 -26.6... | 2102441 |
| 2102562 | POLYGON ((151.03087 -26.65200, 151.03087 -26.6... | 2102562 |
67144 rows × 2 columns
Lets summarize the APL point data to the polygon grid data
In [12]:
combined_df| Source | Date | Longitude | Latitude | Species | Data.Quality | Nymph.Stage | Nymph.Density | Adult.Stage | Adult.Density | species | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 7-Apr-17 | 143.144968 | -24.760067 | 11 | 6 | 0 | 0 | 7 | 1 | CT |
| 1 | 1 | 7-Apr-17 | 143.276912 | -24.473368 | 10 | 6 | 0 | 0 | 0 | 0 | CT |
| 2 | 1 | 7-Apr-17 | 143.278365 | -24.643177 | 11 | 6 | 0 | 0 | 7 | 1 | CT |
| 3 | 1 | 7-Apr-17 | 143.367765 | -24.354893 | 11 | 6 | 0 | 0 | 7 | 1 | CT |
| 4 | 1 | 7-Apr-17 | 143.459170 | -24.319528 | 11 | 6 | 0 | 0 | 7 | 1 | CT |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 303829 | 1 | 5-Jan-00 | 143.073900 | -25.521900 | 10 | 3 | 0 | 0 | 0 | 0 | AC |
| 303830 | 1 | 5-Jan-00 | 143.307100 | -25.209900 | 10 | 3 | 0 | 0 | 0 | 0 | AC |
| 303831 | 1 | 5-Jan-00 | 143.839400 | -25.032800 | 10 | 3 | 0 | 0 | 0 | 0 | AC |
| 303832 | 1 | 5-Jan-00 | 143.921900 | -24.971200 | 10 | 3 | 0 | 0 | 0 | 0 | AC |
| 303833 | 1 | 5-Jan-00 | 146.820000 | -31.751700 | 10 | 3 | 0 | 0 | 0 | 0 | AC |
303834 rows × 11 columns
In [13]:
columns_to_keep = ['Longitude', 'Latitude', 'Date','Species','Data.Quality','Nymph.Density','Adult.Density']
APL_dat = combined_df[columns_to_keep]In [14]:
# Convert DataFrame to GeoDataFrame
gdf_points = gpd.GeoDataFrame(
APL_dat,
geometry=gpd.points_from_xy(APL_dat.Longitude, APL_dat.Latitude)
)
gdf_points.set_crs(epsg=4326, inplace=True)| Longitude | Latitude | Date | Species | Data.Quality | Nymph.Density | Adult.Density | geometry | |
|---|---|---|---|---|---|---|---|---|
| 0 | 143.144968 | -24.760067 | 7-Apr-17 | 11 | 6 | 0 | 1 | POINT (143.14497 -24.76007) |
| 1 | 143.276912 | -24.473368 | 7-Apr-17 | 10 | 6 | 0 | 0 | POINT (143.27691 -24.47337) |
| 2 | 143.278365 | -24.643177 | 7-Apr-17 | 11 | 6 | 0 | 1 | POINT (143.27837 -24.64318) |
| 3 | 143.367765 | -24.354893 | 7-Apr-17 | 11 | 6 | 0 | 1 | POINT (143.36776 -24.35489) |
| 4 | 143.459170 | -24.319528 | 7-Apr-17 | 11 | 6 | 0 | 1 | POINT (143.45917 -24.31953) |
| ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 303829 | 143.073900 | -25.521900 | 5-Jan-00 | 10 | 3 | 0 | 0 | POINT (143.07390 -25.52190) |
| 303830 | 143.307100 | -25.209900 | 5-Jan-00 | 10 | 3 | 0 | 0 | POINT (143.30710 -25.20990) |
| 303831 | 143.839400 | -25.032800 | 5-Jan-00 | 10 | 3 | 0 | 0 | POINT (143.83940 -25.03280) |
| 303832 | 143.921900 | -24.971200 | 5-Jan-00 | 10 | 3 | 0 | 0 | POINT (143.92190 -24.97120) |
| 303833 | 146.820000 | -31.751700 | 5-Jan-00 | 10 | 3 | 0 | 0 | POINT (146.82000 -31.75170) |
303834 rows × 8 columns
In [15]:
joined_gdf = gpd.sjoin(gdf_points, aplc_survey_dataframe, how="left", predicate='within')In [16]:
# I wanted to ensure every point has an assoicated polygon_id
## which they do
na_counts = joined_gdf.isna().sum()
print(na_counts)Longitude 0
Latitude 0
Date 0
Species 0
Data.Quality 0
Nymph.Density 0
Adult.Density 0
geometry 0
index_right 0
polygon_id 0
dtype: int64
In [18]:
# Custom aggregation function to get min, median, and max dates
def date_aggregations(series):
series = pd.to_datetime(series, format='%d-%b-%y') # converts to a datatime
return [series.min(), series.median(), series.max()]
# aggregation function to count unique values
def count_unique_values(series):
return series.value_counts().to_dict()
# Group by polygon geometry and aggregate the data
aggregated_gdf = joined_gdf.groupby(['Species', 'polygon_id']).agg({
'Date': date_aggregations,
'Longitude': 'median',
'Latitude': 'median',
'Data.Quality': count_unique_values,
'Nymph.Density': count_unique_values,
'Adult.Density': count_unique_values
}).reset_index()In [19]:
aggregated_gdf| Species | polygon_id | Date | Longitude | Latitude | Data.Quality | Nymph.Density | Adult.Density | |
|---|---|---|---|---|---|---|---|---|
| 0 | 10 | 13 | [2014-03-08 00:00:00, 2014-03-08 00:00:00, 201... | 131.655639 | -28.119211 | {6: 4} | {0: 4} | {0: 4} |
| 1 | 10 | 3072 | [2017-04-01 00:00:00, 2017-04-01 00:00:00, 201... | 133.206243 | -26.350383 | {6: 4} | {0: 4} | {0: 4} |
| 2 | 10 | 4563 | [2006-03-16 00:00:00, 2006-03-16 00:00:00, 200... | 133.186734 | -26.237582 | {6: 4} | {0: 4} | {0: 4} |
| 3 | 10 | 4620 | [2006-03-16 00:00:00, 2006-03-16 00:00:00, 200... | 133.195693 | -26.075735 | {6: 4} | {0: 4} | {0: 4} |
| 4 | 10 | 4626 | [2017-04-01 00:00:00, 2017-04-01 00:00:00, 201... | 133.194124 | -26.009872 | {6: 4} | {0: 4} | {0: 4} |
| ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 99172 | 19 | 2089037 | [2008-01-15 00:00:00, 2008-01-15 00:00:00, 200... | 150.841310 | -26.695573 | {6: 1} | {0: 1} | {2: 1} |
| 99173 | 19 | 2089373 | [2007-03-18 00:00:00, 2007-03-18 00:00:00, 200... | 150.927725 | -26.933713 | {6: 1} | {0: 1} | {2: 1} |
| 99174 | 19 | 2099579 | [2007-11-07 00:00:00, 2007-11-07 12:00:00, 200... | 151.075760 | -27.261202 | {6: 2} | {0: 2} | {1: 2} |
| 99175 | 19 | 2100149 | [2007-03-18 00:00:00, 2007-03-18 00:00:00, 200... | 151.215045 | -27.163347 | {6: 1} | {0: 1} | {3: 1} |
| 99176 | 19 | 2101084 | [2007-03-18 00:00:00, 2007-03-18 00:00:00, 200... | 151.054728 | -27.023087 | {6: 1} | {0: 1} | {2: 1} |
99177 rows × 8 columns
In [20]:
# Lets count up the total number of observations in the dataframe:
# Define the aggregation functions for each column
agg_funcs = {
'Data.Quality': 'count',
'Adult.Density': 'count',
'Nymph.Density': 'count'
}
# Perform groupby operation and aggregate
grouped_joined_gdf = joined_gdf.groupby(['Species', 'polygon_id']).agg(agg_funcs).reset_index()
# Rename the aggregated columns
grouped_joined_gdf.rename(columns={
'Data.Quality': 'Data Quality Total Count',
'Adult.Density': 'Adult Density Total Count',
'Nymph.Density': 'Nymph Density Total Count'
}, inplace=True)
# These three columns should agree with one another....
## if any rows dont -- this command filters for them
grouped_joined_gdf[
(grouped_joined_gdf['Data Quality Total Count'] != grouped_joined_gdf['Adult Density Total Count']) |
(grouped_joined_gdf['Data Quality Total Count'] != grouped_joined_gdf['Nymph Density Total Count']) |
(grouped_joined_gdf['Adult Density Total Count'] != grouped_joined_gdf['Nymph Density Total Count'])
]| Species | polygon_id | Data Quality Total Count | Adult Density Total Count | Nymph Density Total Count |
|---|
In [21]:
final_df = pd.merge(aggregated_gdf, grouped_joined_gdf, on=['Species', 'polygon_id'], how='left')
final_df = pd.merge(final_df, aplc_survey_dataframe, on='polygon_id', how='left')
final_df.head()
na_counts = final_df.isna().sum()
print(na_counts)Species 0
polygon_id 0
Date 0
Longitude 0
Latitude 0
Data.Quality 0
Nymph.Density 0
Adult.Density 0
Data Quality Total Count 0
Adult Density Total Count 0
Nymph Density Total Count 0
geometry 0
dtype: int64
The APLC data is complete – lets write to disk
In [22]:
final_df.to_csv('data/processed/spatial_modeling/aplc_data_aggregated_to_polygon_grid.csv')