khalid's e-home
welcome to my virtual abode 👾
🇸🇩♟️🌞🦜🥩💻🏋️🍓🐣🪴🐼🏃🏝️🇪🇬✈️🍳🐈🫐📸🍯📚
about
hi, my name is khalid 👋 i'm a 2nd year mathematics & computer science undergraduate at @ the university of edinburgh, although i grew up in londoni'm interested in data science, health, islam, bettering myself, and reading – my favourite book is meditations by marcus aureliushere's a link to my goodreads profile that i recently set up if ur interested 🙂
machine learning / ai
to preface, 2 datasets were provided: athlete_events.csv and teams.csv - they can be found on kaggle
data preparation
import the pandas library:
import pandas as pd
load the dataset into a dataframe:
athletes = pd.read_csv("athlete_events.csv")
display the first few rows of the dataframe:
athletes.head()
filter the dataset to only include summer olympics data:
athletes = athletes[athletes["Season"] == "Summer"]
define a function to summarise each team's data:
def team_summary(data):
return pd.Series({
get the team code from the first row:
'team': data.iloc[0,:]["NOC"],
get the country name from the last row:
'country': data.iloc[-1,:]["Team"],
get the year from the first row:
'year': data.iloc[0,:]["Year"],
count the number of unique events:
'events': len(data['Event'].unique()),
count the number of rows (athletes):
'athletes': data.shape[0],
calculate the average age of athletes:
'age': data["Age"].mean(),
calculate the average height of athletes:
'height': data['Height'].mean(),
calculate the average weight of athletes:
'weight': data['Weight'].mean(),
count the total number of medals, excluding null values:
'medals': sum(~pd.isnull(data["Medal"]))})
apply the team summary function to each group of NOC and Year:
team = athletes.groupby(["NOC", "Year"]).apply(team_summary)
reset the index to get a clean dataframe:
team = team.reset_index(drop=True)
drop rows with any missing values:
team = team.dropna()
display the data frame:
team
define a function to add columns for previous medals data:
def prev_medals(data):
sort data by year in ascending order:
data = data.sort_values("year", ascending=True)
create a column for the previous year's medals:
data["prev_medals"] = data["medals"].shift(1)
create a column for the average of the previous 3 years' medals:
data["prev_3_medals"] = data.rolling(3, closed="left", min_periods=1).mean()["medals"]
return data
apply the prev medals function to each team:
team = team.groupby(["team"]).apply(prev_medals)
reset the index to get a clean dataframe:
team = team.reset_index(drop=True)
filter the dataframe to include only years after 1960:
team = team[team["year"] > 1960]
round numeric columns to one decimal place:
team = team.round(1)
display data for the team with the code "USA":
team[team["team"] == "USA"]
display the final dataframe:
team
now for the machine learning part
hypothesis:
i predict it's possible to work out how many medals a country will win at the olympics by using a variety of available historical data
the data:
a dataset of how many medals each country won at each olympics; other data, such as the number of athletes competing, age, height & weight could also serve useful
import the pandas library:
import pandas as pd
load the dataset into a dataframe:
teams = pd.read_csv("teams.csv")
display the dataframe:
teams
select relevant columns from the dataframe:
teams = teams[["team", "country", "year", "athletes", "age", "prev_medals", "medals"]]
calculate and display the correlation of each column with the 'medals' column:
teams.corr()["medals"]
import the seaborn library for plotting:
import seaborn as sb
create a scatter plot with a regression line to show the relationship between 'athletes' and 'medals':
sb.lmplot(x='athletes', y='medals', data=teams, fit_reg=True, ci=None)
plot a histogram of the 'medals' column:
teams.plot.hist(y="medals")
display the first 20 rows with any missing values:
teams[teams.isnull().any(axis=1)].head(20)
drop rows with any missing values
teams = teams.dropna()
display the shape of the cleaned dataframe:
teams.shape
split the data into training (years < 2012) and testing (years >= 2012) sets:
train = teams[teams["year"] < 2012].copy()
display the shape of the training set (~80% of the data):
test = teams[teams["year"] >= 2012].copy()
train.shape
display the shape of the testing set (~20% of the data):
test.shape
↑ it's often good practise to stick to a roughly 80/20 split regarding model training / testing - it helps ensure that the model is trained on a sufficient amount of data while also being evaluated on a separate set to gauge its accuracy and generalisability
accuracy metric
i'm going to use mean squared error since it's a good default regression accuracy metric - it's the average of squared differences between the actual results and your predictions
import the LinearRegression model from scikit-learn:
from sklearn.linear_model import LinearRegression
initialise the LinearRegression model:
reg = LinearRegression()
define predictors for the model:
predictors = ["athletes", "prev_medals"]
fit the model on the training data:
reg.fit(train[predictors], train["medals"])
make predictions on the test data:
predictions = reg.predict(test[predictors])
display the shape of the predictions array:
predictions.shape
add predictions to the test dataframe:
test["predictions"] = predictions
ensure no negative predictions by setting them to 0:
test.loc[test["predictions"] < 0, "predictions"] = 0
round the predictions to the nearest whole number:
test["predictions"] = test["predictions"].round()
import meanabsoluteerror from scikit-learn to evaluate model performance:
from sklearn.metrics import mean_absolute_error
calculate the mean absolute error between the actual and predicted medals:
error = mean_absolute_error(test["medals"], test["predictions"])
display the mean absolute error:
error
display statistical summary of the 'medals' column:
teams.describe()["medals"]
display rows where the team is "USA" in the test set:
test[test["team"] == "USA"]
display rows where the team is "IND" in the test set:
test[test["team"] == "IND"]
↑ in terms of the number of medals predicted away from the actual medals attained, it's higher for the USA than India, but in terms of %, our prediction for the USA was more precisefor example, in the case of 2016, our prediction for the USA was ~11% off, but for India it's a whopping ~480%!
calculate the absolute error for each prediction:
errors = (test["medals"] - predictions).abs()
calculate mean absolute error by team:
error_by_team = errors.groupby(test["team"]).mean()
calculate mean medals by team:
medals_by_team = test["medals"].groupby(test["team"]).mean()
calculate the ratio of error to medals for each team:
error_ratio = error_by_team / medals_by_team
remove infinite values from the error ratio:
import numpy as np
error_ratio = error_ratio[np.isfinite(error_ratio)]
plot a histogram of the error ratio:
error_ratio.plot.hist()
display the sorted error ratio values:
error_ratio.sort_values()
↑ our prediction seems to fair better for countries that tend to win more medals
define additional predictors including 'height':
predictors = ["athletes", "prev_medals", "height"]
initialise a new LinearRegression model:
reg = LinearRegression()
fit the model with the new predictors:
reg.fit(train[predictors], train["medals"])
make predictions with the new model:
predictions = reg.predict(test[predictors])
import meanabsoluteerror from scikit-learn to evaluate model performance:
from sklearn.metrics import mean_absolute_error
add new predictions to the test set:
test["predictions"] = predictions
ensure no negative predictions by setting them to 0:
test["predictions"] = test["predictions"].clip(lower=0)
round the predictions to the nearest whole number:
test["predictions"] = test["predictions"].round()
calculate the mean absolute error with the new model:
error = mean_absolute_error(test["medals"], test["predictions"])
display the mean absolute error with the new model:
print("Mean Absolute Error:", error)
↑ a mean absolute error (MAE) of 3.26 means that, on average, my model's predictions are off by about 3.26 medals from the actual number of medals won
import the RandomForestRegressor model from scikit-learn:
from sklearn.ensemble import RandomForestRegressor
initialise and train the RandomForestRegressor model:
rf = RandomForestRegressor(n_estimators=100)
rf.fit(train[predictors], train["medals"])
get and display feature importance from the RandomForestRegressor model:
importances = rf.feature_importances_
for feature, importance in zip(predictors, importances):
print(f"{feature}: {importance}")
import seaborn and matplotlib for plotting:
import seaborn as sns
import matplotlib.pyplot as plt
create a scatter plot of 'height' vs. 'medals':
sns.scatterplot(x='height', y='medals', data=teams)
add a title to the scatter plot:
plt.title("Height vs. Medals")
display the scatter plot:
plt.show()
↑ no correlation found between height and medals won across all sports, but perhaps if we filtered for certain sports we could find something interesting?
load the dataset into a dataframe:
athlete_events = pd.read_csv('athlete_events.csv')
display the first few rows of the dataframe:
print(athlete_events.head())
display the column names of the dataframe:
print(athlete_events.columns)
↓ gymnastics has a tendency to favour shorter athletes, so i attempted to filter for gymnastics to see if i could find a correlation
filter the dataframe to include only rows where the sport is gymnastics:
gymnastics_data = athlete_events[athlete_events['Sport'] == 'Gymnastics']
map the medal names to numeric values:
gymnastics_data['Medal'] = gymnastics_data['Medal'].map({'Gold': 1, 'Silver': 2, 'Bronze': 3})
drop rows with missing values in 'Height' or 'Medal':
gymnastics_data = gymnastics_data.dropna(subset=['Height', 'Medal'])
calculate the correlation between 'Height' and 'Medal':
correlation = gymnastics_data[['Height', 'Medal']].corr().loc['Height', 'Medal']
display the correlation value:
print(f"Correlation between height and medals in gymnastics: {correlation}")
display information about the 'Height' and 'Medal' columns:
print(gymnastics_data[['Height', 'Medal']].info())
↑ wasn't sure if I'd done something wrong here or if there just wasn't enough data, but nonetheless a fun introduction to machine learning!
write-up in progress...
résumé
(my first attempt at creating a résumé; feedback is appreciated!)
i'm currently in search a data science internship for summer '25 so if you know of any opportunities, i would greatly appreciate any information you could share and thank you very much in advance!
blog
bits and bobs on whatever interests me 🐼
powerlifting pbs
squat:
01/08/24 - 100 kg / 220 lbs
bench:
01/08/24 - 82.5 kg / 182 lb
deadlift:
01/08/24 - 120 kg / 265 lb
@ 63.5 kg / 140 lbs
race pbs
1 mile:
05/08/24 - 6:27
5K:
12/07/24 - 22:46 / 7:19 average
10K:
19/06/24 - 49:57 / 8:02 average
half marathon:
pending...
marathon:
also pending 🙂
♔♕♖♗♘♙
i haven't been putting much effort into chess but this changes today (hopefully)starting off at 1000 blitz as of 01/08/2024 which is a nice even number to work from! will update as i progressgoal: 1500 by end of 2024?
here's a link to my chess profile if you'd like to play 🙂 i'm always up to practise
tips to break 1200 elo
1) blunder check:
- step 1: check if your move is giving away a piece.
- step 2: analyse opponent’s last move for any obvious threats.
- step 3: ensure your king is not in danger of checks.
- step 4: check if your queen is in danger.
2) defending the king:
a) delay castling: wait a move or two to assess the opponent's plans.
b) counter-attack: if under attack, consider going on the offensive.
c) counter-attack in the center: if attacked on the flank, counter in the centre.
d) create a blockade: let opponent push first and then block their pawns.
c) knight forks:
- focus on vulnerable squares: c2, c7, f2, f7.
- plan against forks: if a square is under threat, prepare for additional attacks.
d) backrank mate:
- check for backrank issues: ensure your king has escape squares or avoid backrank mates.
- move the h-pawn: consider moving the h-pawn up one square to provide an escape route.
e) endgame mistakes:
- watch for pawn promotion: pay close attention to pawns that become queens and potential endgame scenarios.
- not planning ahead: don't make pointless one movers or attacks that risk your position.
📚
currently reading: data science for dummies by lillian piersonyou can find out more on my goodreads profile (though i sometimes forget to update 😅)
healthy food!
smoked salmon + half done eggs on sourdough w/ organic feta cheese + olives
scrambled eggs w/ tomato in pure butter, honey on sourdough, full fat greek yoghurt w/ organic banana, blueberries + kiwi & cinnamon / honey
gf burgers w/ cheddar + chilli, oven baked chips w/ paprika
steak + sausage 🤤
greek yoghurt w/ organic dark chocolate, banana, blueberries, cinnamon, honey + paleo coconut flour cookies
greek yoghurt w/ papaya, pineapple, dark chocolate, honey & cinnamon + fresh orange juice (yes i love greek yoghurt combos)
salmon w/ potato wedges (use extra virgin olive oil in air fryer)
stainless steel / ceramic air fryers are best if you can!
my touch typing progress
August 2021 - 68 wpm averageJuly 2022 - 80 wpm averageJuly 2023 - 91 wpm averageAugust 2024 - 105 wpm averagelink to my typeracer profile!
i enjoy playing mental maths games they're fun
personal bests
(all achieved on default settings)
zetamac: 73open quant arithmetic game: 67arcamedics multiplication: 45.40sarcamedics jet ski addition: 49.98sarcamedics ducky subtraction: 46.52sarcamedics drag race division: 51.19squantguide addition: 77quantguide subtraction: 84quantguide multiplication: 59quantguide division: 53
try out for urself and see how u do 🙂 mental maths often comes in handy
🛸☄️🪐🔭🚀👾👽🛰️
the hercules-corona borealis great wall
the hercules-corona borealis great wall, discovered in 2013, is an enormous galaxy supercluster located approximately 10 billion light-years from earth, 94.6 x 10^24 m (94,600,000,000,000,000,000,000,000 metres); it is currently the largest known structure in the universe, and challenges our understanding of cosmic scales due to its sheer size and the vast distances involvedto grasp how large the hercules-corona borealis great wall really is, consider the following dimensions:
length: ~94,600,000,000,000,000,000,000,000 metresdiameter: ~15,100,000,000,000,000,000,000,000 metresin fact, in its widest part, the hercules-corona borealis great wall is 1.7% of the entire observable universe 😱
here's a comparison of the diameter of the structure to various famous objects:
the sun: diameter - ~1.39 x 10 ^ 9 metrescomparison: 1.09×10 ^ 16 (10,000 trillion) times the diameter of the sunthe earth: diameter - ~1.27×10 ^ 7 metrescomparison: 1.19×10 ^ 18 (1 million trillion) times the diameter of the earth.the burj khalifa: height - 828 metrescomparison: 1.82×10 ^ 22 (10 billion trillion) times the height of the burj khalifa!a human: height - 1.7 metrescomparison: 8.9 x 10 ^ 24 (1000 billion trillion) times the height of a human
limitations and possibilities:
• observational limits - our observations are limited by the observable universe, which is about 93 billion light-years (8.79 x 10 ^ 25) in diameter; structures larger than what we can currently detect may exist, but they are beyond our observational capabilities• theoretical structures - there could also be even larger cosmic structures that we just haven’t discovered yet due to the limitations of our current technology and observational techniques
contact