Group 21 Ahmed Tamim Sharif | Md Rafat Jitu |
Md Istain Ahmed
Objective & Datasets
Objective: To integrate messy, real-world hockey data from multiple sources and
prepare a clean dataset for machine learning.
Datasets Used:
Identity Cards (TSV & CSV)
Performance Metrics (TSV)
Scouting Notes (CSV)
Medical Information (Excel)
Moms' Notes (JSON) & Contracts (Pickle)
Data Integration (Merging Strategy)
Integration Process:
First, concatenated the two separate identity files.
Then, used international_id as our Primary Key to perform
Left Joins with the rest of the datasets.
Challenge Addressed: Successfully handled and integrated various file formats
(JSON, Pickle, Excel, TSV, CSV) into a single master dataframe without data loss.
Data Cleaning - The Roman Numeral Issue
Problem: In `identity_card_0`, players' IDs were recorded in Roman numerals (e.g.,
XXII, L).
Solution: Built a custom roman_to_int function to
parse and convert all Roman numerals into standard integers.
Impact: This was a critical step; without it, merging the identity table with the
performance and medical tables would have been impossible.
Advanced Cleaning - The Age Anomaly
Observation: Statistical summary revealed impossible age values, with some players
aged over 100 (e.g., 250, 300+).
Logical Deduction: Deduced that these extreme outliers were mistakenly entered in
months instead of years.
Action Taken: Applied a logical condition: if Age > 100,
divide by 12. This brought the age distribution back to a realistic range (18–39 years).
Handling "Unknown" Data (Crucial Step)
Observation: Approximately 2,257 players were missing entirely from the performance
records.
Strategic Decision: Split the dataset into Known Data (with
targets) and Unknown Data (without targets).
Why? Imputing these missing target values with "0" or "Median" would have
introduced massive noise and bias, severely damaging our Phase 2 prediction models.
Visualization 1 - Total Goals by Position
Insight 1: Surprisingly, the Defense position
has scored the highest total number of goals (300,000).
Insight 2: Right Wing, Center, and Left Wing show relatively similar
goal-scoring totals (150,000).
Significance: 'Position' proves to be a highly influential feature for future
predictive modeling.
Visualization 2 - Age vs. Shot Speed
Insight 1: Active players range precisely between 18 and 39 years.
Insight 2: Highest shot speeds are predominantly clustered among players aged
22 to 32.
Insight 3: We observe a slight decline in maximum shot speed as players cross
the age of 35.
Visualization 3 - Correlation Matrix
Insight 1: Correlation values between variables are extremely low (e.g., -0.023, 0.0079).
Insight 2: This implies player performance doesn't rely on just 1 or 2 metrics;
rather, it is a combined result of multiple factors.
Modeling Strategy: Due to the lack of direct linear relations, we will use
complex, non-linear ML models (e.g., Random Forest)
in Phase 2.
Conclusion & Next Steps
Phase 1 Summary: We successfully integrated 7 messy datasets, cleaned logical
outliers, safely isolated unknown data, and performed targeted imputation.
Status: Dataset is now 100% clean, noise-free, and
model-ready.
Next Steps (Phase 2): We will utilize this high-quality dataset to train machine
learning
algorithms to predict future hockey talent.