Classifying Shooting Incident Fatality in New York project presentation

Classifying Shooting
Incident Fatality in
New York City
leveraging machine learning for predicting shooting incident fatalities.
Presented by Indhu Reddy

Introduction
The public safety sector is evolving rapidly, influenced by technological advancements,
changing urban dynamics, and a growing need for data-driven decision-making.
Shooting incidents, particularly the fatal ones, pose significant challenges and
opportunities for law enforcement agencies. When a shooting incident results in a fatality,
it has a profound impact on community safety, public trust, and the strategic allocation of
police resources.
Machine learning, with its predictive capabilities, offers a transformative approach to
understanding and mitigating the challenges posed by shooting incidents.
Through data-driven insights and predictive modeling, this presentation aims to showcase my Machine
Learning Capstone Project focused on predicting shooting incident fatality in New York City.

Why Public Safety Domain?
The public safety sector is a unique blend of community well-being, technology, and regulatory
frameworks, presenting its own set of distinct challenges and opportunities. I chose the Public
Safety Domain for my Capstone Project because:
Community Impact: Public safety directly affects the quality of life in communities.
Understanding and predicting incidents can help save lives and enhance community trust.
Confidentiality: Handling sensitive incident data requires utmost care. Ensuring data privacy
and security while analyzing it is a complex but crucial task.
Diverse Incidents: Public safety incidents vary widely. Developing models to manage and
predict such a diverse range of incidents adds another layer of complexity.

Project’s Significance and Benefits to Law Enforcement
1 Enhanced Community Safety:
Anticipating fatal incidents allows for proactive strategies,
improving response times and community safety.
2 Resource Optimization:
Predicting and mitigating fatal incidents is cost-effective,
ensuring efficient allocation and utilization of police resources.
3 Risk Mitigation:
Identifying potential fatal incidents mitigates risks, enabling
preventive measures and strategic interventions.
4 Market Competitiveness:
Predictive incident management positions law enforcement
agencies as proactive and community-centric, offering a
competitive edge through improved public trust and safety.
5 Long-Term Community Trust:
By predicting and addressing fatal incidents, my project contributes
not only to public safety but also to the broader objectives of law
enforcement, fostering community trust and ensuring long-term
societal well-being.

Dataset Information
Column Name Description
INCIDENT_KEY Unique identifier for each incident
OCCUR_DATE Date of the incident
OCCUR_TIME Time of the incident
BORO Borough where the incident occurred
LOC_OF_OCCUR_DESC Description of the location of occurrence
PRECINCT
Police precinct where the incident was
reported
JURISDICTION_CODE Jurisdiction code for the incident
LOC_CLASSFCTN_DESC Location classification description
LOCATION_DESC Detailed location description
STATISTICAL_MURDER_FLAG Indicator if the incident was a murder
PERP_AGE_GROUP Age group of the perpetrator
PERP_SEX Sex of the perpetrator
PERP_RACE Race of the perpetrator
VIC_AGE_GROUP Age group of the victim
VIC_SEX Sex of the victim
VIC_RACE Race of the victim
X_COORD_CD X-coordinate of the incident location
Y_COORD_CD Y-coordinate of the incident location
Latitude Latitude of the incident location
Longitude Longitude of the incident location
Lon_Lat Combined longitude and latitude
Features/Columns: The dataset is
characterized by a diverse set of features,
each providing valuable insights into
shooting incidents, their locations, and
outcomes. In total, there are 21
features/columns that form the basis of our
predictive modeling.
Number of records: Our dataset
comprises a robust collection of data,
consisting of over 23,000 records. Each
record represents a unique shooting
incident, contributing to the richness and
depth of our analysis.
Source of the Data: The dataset is sourced
from the New York Police Department
(NYPD), provided by the institute, ensuring
reliability and relevance. The data's origin
plays a crucial role in shaping the context
and ensuring that our analysis is grounded
in real-world scenarios and industry
dynamics.
Here are the key details about the dataset used in this project:

1. Initial Data Cleaning
First, we made sure there were no null values and duplicates in
the dataset
 Null Values: Verified that there were no null values in the
dataset.
 Duplicates: Ensured there were no duplicate records,
maintaining the integrity of our data.
2. Feature Evaluation
 Column Relevance: We evaluated all columns to
determine their usefulness for our analysis.
 Dropped Columns: Columns like “INCIDENT_KEY” and
“LOC_OF_OCCUR_DESC” weren't contributing much to
the predictions, so we decided to drop them during
preprocessing.
3. Handling Categorical Variables
Categorical to Numerical:
The " STATISTICAL_MURDER_FLAG" column was categorical
variables. We converted these categorical features into
numerical format using label encoding to make them
compatible with our model.
Preprocessing

Exploratory Data Analysis (EDA)
EDA Insights: ( Visualizations)
Visualizations were essential in providing a clear representation of the data. They offered insights into patterns and helped identify
factors contributing to fatal shooting incidents.
• Feature Distribution: Analyzed the distribution of features to understand their characteristics.
• Correlation Analysis: Highlighted correlations between features using Heatmaps
• PCA Scatter Plot: Visualized the data before and after removal of outliers.
By performing a thorough EDA, we ensured our dataset was ready for predictive modeling, providing a solid foundation for developing
our machine learning model.
Columns Worked With:
PRECINCT, JURISDICTION_CODE, STATISTICAL_MURDER_FLAG, X_COORD_CD, Y_COORD_CD, Latitude, Longitude

Visualizations
• Feature Distribution: The histograms display the
distribution of key numerical columns in the dataset,
specifically 'PRECINCT', 'X_COORD_CD', and
'Y_COORD_CD'.
• PRECINCT: This histogram shows the distribution of
shooting incidents across different police precincts in New
York City. It helps in identifying precincts with higher or
lower frequencies of incidents.
• X_COORD_CD and Y_COORD_CD: These histograms
illustrate the distribution of the geographical coordinates of
shooting incidents. They provide insights into the spatial
spread and concentration of incidents based on their x and
y coordinates.

Visualizations
Boxplots are essential for visualizing the distribution of numerical features and identifying outliers within the dataset.
Outliers can significantly affect the performance of machine learning models, so it is crucial to detect and handle them
appropriately.
Observations
PRECINCT: The precinct column shows a fairly even distribution with no significant outliers.
JURISDICTION_CODE: The jurisdiction code column has a few noticeable outliers which could indicate special cases or
anomalies in the dataset.
X_COORD_CD and Y_COORD_CD: Both X coordinate column show a large number of outliers. This might indicate data
entry errors or rare but valid occurrences. where as in Y coordinate there aren’t outliers seen.
Latitude and Longitude: The longitude columns also display several outliers, which could be due to incorrect data entries
or actual rare geographical points. where as in latitude column there aren’t outliers seen.

Visualizations
By leveraging correlation analysis through a heatmap, we gain valuable insights into the interrelationships between
features, guiding us in building a more robust and accurate predictive model.
Observations
1. PRECINCT and Y_COORD_CD/Latitude: There is a strong negative correlation between PRECINCT and
Y_COORD_CD/Latitude, indicating that certain precincts are more associated with specific latitude positions.
2. X_COORD_CD and Longitude: These features have a perfect positive correlation, as expected, since they represent
the same spatial dimension.
3. Other Features: Most other features show low or moderate correlations with each other, suggesting that they
provide unique information to the model.

Visualizations
PCA (Principal Component Analysis): PCA was used to reduce the dimensionality of the dataset to two principal
components (PCA1 and PCA2) for easy visualization.
Visualization: The scatter plot visualizes the data points in the new PCA space, highlighting outliers in red and
inliers in blue.
Observations
Before Removal: The left plot shows the dataset with outliers included. blue points represent the outliers detected,
while red points represent the inliers.
-Outliers can be observed scattered around the inliers, indicating potential anomalies or errors in the data.
After Removal: The right plot shows the dataset after removing the outliers.
- The cleaned data (in blue) appears more compact and consistent, with fewer scattered points, indicating a more
reliable dataset for model training.

Train-Test Split
Splitting the Data into Training and Testing Sets In this step, we partitioned the dataset into two components: X and y.
Variable X: This includes all the independent variables or features that contribute to our predictions. It encapsulates the
input data for the model.
Variable y: This represents the dependent variable or target variable, which is the outcome we aim to predict. It
encapsulates the output data for the model.
Splitting the data into X and y
To evaluate the performance of our model, we split the dataset into training data and testing data.
Split Ratio: We used an 80:20 split, meaning 80% of our data is used as training data and 20% is used as testing
data. This means the test size was set to 0.2.
Random State: We used a random state of 42 to ensure the reproducibility of our results across different runs. This
means that every time we run the code, we get the same split, ensuring consistency in our evaluations.
Stratify: We used stratify=y to ensure that our target variable (y) is distributed proportionally in both the training
and testing sets.

Standard Scaler
Scaling Numerical Features:
To ensure consistent scales for numerical features, we employed Standard Scaler during
preprocessing.
This helped in normalizing the features, ensuring they contribute equally to the model's
predictions.

Applying Machine Learning Algorithms
This is a Binary Classification problem, the models used are:
Logistic Regression
When faced with binary classification problems, logistic regression can be used to model the probability of a binary
response depending on any number of explanatory variables. - For this dataset, it forecasts whether an occurrence is
statistically a homicidal based on certain features.
Decision Tree
Classification and regression tasks can be accomplished using decision trees which learn simple decision rule from data
features. - For this dataset, it lets one identify different rules and patterns that exist between outcomes in shooting
cases.
Random Forest
Random Forest is an ensemble method that combines multiple decision trees for better classification accuracy and to
avoid over fitting. - For this dataset, its results are more accurate compared to single decision tree alternatives as it
ensembles results from multiple random trees.
Support Vector Machine (SVM)
For classification purposes, SVM computes the best hyperplane that can separate classes within feature space. - For this
dataset, the goal is to have maximum difference from each other among various types of incidents.
Naive Bayes
Naive Bayes is a classification algorithm based on Bayes’ theorem with assumption of independence between predictors. -
This dataset contains a basic yet powerful classifier with probabilistic approach for determining class label
Gradient Boosting:
Gradient Boosting is an ensemble technique that builds models sequentially to correct the errors of the previous models,
enhancing accuracy. For this dataset, it incrementally improves the classification performance by focusing on difficult-to-
classify incidents

Evaluation Metrics
Before Removal of Outliers
Model Accuracy Precision Recall F1 Score
Logistic Regression 0.814469 0.000000 0.000000 0.000000
Decision Tree 0.710623 0.170732 0.145114 0.156884
Random Forest 0.798535 0.219355 0.033564 0.058219
SVM 0.814469 0.000000 0.000000 0.000000
KNN 0.775275 0.209239 0.076012 0.111513
Gradient Boosting 0.813919 0.200000 0.000987 0.001965
Naive Bayes 0.813370 0.000000 0.000000 0.000000

Evaluation Metrics
After Removal of Outliers
Model Accuracy Precision Recall F1 Score
Logistic Regression 0.8175 0.0000 0.0000 0.0000
Decision Tree 0.7548 0.2496 0.1711 0.2030
Random Forest 0.7764 0.2956 0.1626 0.2098
Gradient Boosting 0.8177 1.0000 0.0011 0.0021
SVM 0.8175 0.0000 0.0000 0.0000
Naive Bayes 0.8175 0.0000 0.0000 0.0000
KNN 0.7818 0.2559 0.1024 0.1463

Accuracy: The accuracy of most models decreased slightly after removing outliers, except for Gradient
Boosting which increased slightly.
Precision: The precision of most models remained the same or decreased slightly after removing outliers.
Gradient Boosting had a significant increase in precision.
Recall: The recall of most models increased after removing outliers, with Decision Tree and Random
Forest showing the largest improvements.
F1-Score: The F1-score of most models increased after removing outliers, with Decision Tree and Random
Forest showing the largest improvements.
changes in the metrics

Explanation of Model Selection
Why Not Use Accuracy Alone?
Logistic Regression, Support Vector Machine, Naive Bayes:
Accuracy: 0.8175 Precision, Recall, F1 Score: 0.0000
The accuracy is high, but these models fail to predict the positive class at all, leading to zero
precision, recall, and F1 score. This suggests that these models might be predicting all instances
as the negative class, which can still yield high accuracy if the dataset is imbalanced (i.e., the
negative class is much more frequent than the positive class).
Gradient Boosting:
Accuracy: 0.8177
Precision: 1.0000 (likely due to predicting very few positives correctly)
Recall: 0.0011
F1 Score: 0.0021
Gradient Boosting has slightly higher accuracy, but it also has a very low F1 Score, indicating
that its predictions for the positive class are almost negligible.
Importance of the F1 Score
The F1 Score is particularly useful in the context of imbalanced datasets as it balances precision
and recall. A high F1 Score indicates that the model is performing well in predicting both the
positive and negative classes.

Explanation Model Considerations
Random Forest Selection
Random Forest:
Accuracy: 0.7764
Precision: 0.2956
Recall: 0.1626
F1 Score: 0.2098
Random Forest does not have the highest accuracy, but it has the highest F1 Score among the
models, indicating a better balance between precision and recall. This suggests that Random
Forest is more capable of identifying the positive class correctly compared to the other models,
making it more reliable for practical use.
a very low F1 Score, indicating that its predictions for the positive class are almost negligible.
Conclusion
The model selection is based on the F1 Score because it provides a more holistic view of the model's
performance in scenarios where the dataset is imbalanced. Random Forest was chosen as the best
model because it has the highest F1 Score, indicating better performance in predicting the positive
class compared to other models that might be overfitting to the majority class.
Choosing a model based solely on accuracy can be misleading in such scenarios, leading to models
that do not effectively address the problem of interest.

Technical Implementation
Model Inference Pipeline:
• The predict function loads the trained machine learning model and scales input data for
prediction.
• Predictions classify whether a shooting incident is a murder based on the provided
coordinates.
• Using Gradio, we created an accessible interface for users to input coordinates and receive
predictions.
• The interface includes inputs for X_COORD_CD, Y_COORD_CD, Latitude, and Longitude, and
outputs a text classification.
• The tool is designed to be shared and utilized easily, promoting wider adoption and usage.
User-Friendly Interface:

Conclusion
Prediction platform :
By integrating Machine Learning with user-friendly tools, we provide valuable insights and
proactive solutions for public safety. This project exemplifies the power of predictive analytics in
addressing complex societal issues and underscores the importance of data-driven strategies in
enhancing operational efficiency and public safety.
By implementing such a solution, we can significantly contribute to making our cities safer
through advanced analytics and innovative technology solutions

Classifying Shooting Incident Fatality in New York project presentation

Related slideshows

Recommended for you

Recommended for you

Recommended for you

Recommended for you

Recommended for you

More Related Content

Similar to Classifying Shooting Incident Fatality in New York project presentation

Similar to Classifying Shooting Incident Fatality in New York project presentation (20)

More from Boston Institute of Analytics

More from Boston Institute of Analytics (20)

Recently uploaded

Recently uploaded (20)

Classifying Shooting Incident Fatality in New York project presentation