khalid's e-home

welcome to my virtual abode 👾

🇸🇩♟️🌞🦜🥩💻🏋️🍓🐣🪴🐼🏃🏝️🇪🇬✈️🍳🐈🫐📸🍯📚

about
ml / ai
résumé
writing
photos
lifting
running
chess
reading
eats
islam
typing
mental maths
astronomy
contact

back

about

hi, my name is khalid 👋 i'm a 2nd year mathematics & computer science undergraduate at @ the university of edinburgh, although i grew up in londoni'm interested in data science, health, islam, bettering myself, and reading – my favourite book is meditations by marcus aureliushere's a link to my goodreads profile that i recently set up if ur interested 🙂

back

machine learning / ai

predicting medals @ olympics w/ linear regression

toxic comment detection using deep learning

back

to preface, 2 datasets were provided: athlete_events.csv and teams.csv - they can be found on kaggle

data preparation

import the pandas library:
import pandas as pd
load the dataset into a dataframe:
athletes = pd.read_csv("athlete_events.csv")
display the first few rows of the dataframe:
athletes.head()

filter the dataset to only include summer olympics data:
athletes = athletes[athletes["Season"] == "Summer"]
define a function to summarise each team's data:
def team_summary(data): return pd.Series({
get the team code from the first row:
'team': data.iloc[0,:]["NOC"],
get the country name from the last row:
'country': data.iloc[-1,:]["Team"],
get the year from the first row:
'year': data.iloc[0,:]["Year"],
count the number of unique events:
'events': len(data['Event'].unique()),
count the number of rows (athletes):
'athletes': data.shape[0],
calculate the average age of athletes:
'age': data["Age"].mean(),
calculate the average height of athletes:
'height': data['Height'].mean(),
calculate the average weight of athletes:
'weight': data['Weight'].mean(),
count the total number of medals, excluding null values:
'medals': sum(~pd.isnull(data["Medal"]))})
apply the team summary function to each group of NOC and Year:
team = athletes.groupby(["NOC", "Year"]).apply(team_summary)
reset the index to get a clean dataframe:
team = team.reset_index(drop=True)
drop rows with any missing values:
team = team.dropna()
display the data frame:
team

define a function to add columns for previous medals data:
def prev_medals(data):
sort data by year in ascending order:
data = data.sort_values("year", ascending=True)
create a column for the previous year's medals:
data["prev_medals"] = data["medals"].shift(1)
create a column for the average of the previous 3 years' medals:
data["prev_3_medals"] = data.rolling(3, closed="left", min_periods=1).mean()["medals"] return data
apply the prev medals function to each team:
team = team.groupby(["team"]).apply(prev_medals)
reset the index to get a clean dataframe:
team = team.reset_index(drop=True)
filter the dataframe to include only years after 1960:
team = team[team["year"] > 1960]
round numeric columns to one decimal place:
team = team.round(1)
display data for the team with the code "USA":
team[team["team"] == "USA"]

display the final dataframe:
team

now for the machine learning part

hypothesis:

i predict it's possible to work out how many medals a country will win at the olympics by using a variety of available historical data

the data:

a dataset of how many medals each country won at each olympics; other data, such as the number of athletes competing, age, height & weight could also serve useful

import the pandas library:
import pandas as pd
load the dataset into a dataframe:
teams = pd.read_csv("teams.csv")
display the dataframe:
teams

select relevant columns from the dataframe:
teams = teams[["team", "country", "year", "athletes", "age", "prev_medals", "medals"]]calculate and display the correlation of each column with the 'medals' column:
teams.corr()["medals"]

import the seaborn library for plotting:
import seaborn as sbcreate a scatter plot with a regression line to show the relationship between 'athletes' and 'medals':
sb.lmplot(x='athletes', y='medals', data=teams, fit_reg=True, ci=None)

plot a histogram of the 'medals' column:
teams.plot.hist(y="medals")

display the first 20 rows with any missing values:
teams[teams.isnull().any(axis=1)].head(20)

drop rows with any missing values
teams = teams.dropna()display the shape of the cleaned dataframe:
teams.shape

split the data into training (years < 2012) and testing (years >= 2012) sets:
train = teams[teams["year"] < 2012].copy() test = teams[teams["year"] >= 2012].copy()display the shape of the training set (~80% of the data):
train.shape

display the shape of the testing set (~20% of the data):
test.shape

↑ it's often good practise to stick to a roughly 80/20 split regarding model training / testing - it helps ensure that the model is trained on a sufficient amount of data while also being evaluated on a separate set to gauge its accuracy and generalisability

accuracy metric

i'm going to use mean squared error since it's a good default regression accuracy metric - it's the average of squared differences between the actual results and your predictions

import the LinearRegression model from scikit-learn:
from sklearn.linear_model import LinearRegression
initialise the LinearRegression model:
reg = LinearRegression()
define predictors for the model:
predictors = ["athletes", "prev_medals"]
fit the model on the training data:
reg.fit(train[predictors], train["medals"])
make predictions on the test data:
predictions = reg.predict(test[predictors])
display the shape of the predictions array:
predictions.shape
add predictions to the test dataframe:
test["predictions"] = predictions
ensure no negative predictions by setting them to 0:
test.loc[test["predictions"] < 0, "predictions"] = 0
round the predictions to the nearest whole number:
test["predictions"] = test["predictions"].round()
import meanabsoluteerror from scikit-learn to evaluate model performance:
from sklearn.metrics import mean_absolute_error
calculate the mean absolute error between the actual and predicted medals:
error = mean_absolute_error(test["medals"], test["predictions"])
display the mean absolute error:
error

display statistical summary of the 'medals' column:
teams.describe()["medals"]

display rows where the team is "USA" in the test set:
test[test["team"] == "USA"]

display rows where the team is "IND" in the test set:
test[test["team"] == "IND"]

↑ in terms of the number of medals predicted away from the actual medals attained, it's higher for the USA than India, but in terms of %, our prediction for the USA was more precisefor example, in the case of 2016, our prediction for the USA was ~11% off, but for India it's a whopping ~480%!

calculate the absolute error for each prediction:
errors = (test["medals"] - predictions).abs()
calculate mean absolute error by team:
error_by_team = errors.groupby(test["team"]).mean()
calculate mean medals by team:
medals_by_team = test["medals"].groupby(test["team"]).mean()
calculate the ratio of error to medals for each team:
error_ratio = error_by_team / medals_by_team
remove infinite values from the error ratio:
import numpy as np error_ratio = error_ratio[np.isfinite(error_ratio)]
plot a histogram of the error ratio:
error_ratio.plot.hist()

display the sorted error ratio values:
error_ratio.sort_values()

↑ our prediction seems to fair better for countries that tend to win more medals

define additional predictors including 'height':
predictors = ["athletes", "prev_medals", "height"]
initialise a new LinearRegression model:
reg = LinearRegression()
fit the model with the new predictors:
reg.fit(train[predictors], train["medals"])
make predictions with the new model:
predictions = reg.predict(test[predictors])
import meanabsoluteerror from scikit-learn to evaluate model performance:
from sklearn.metrics import mean_absolute_error
add new predictions to the test set:
test["predictions"] = predictions
ensure no negative predictions by setting them to 0:
test["predictions"] = test["predictions"].clip(lower=0)
round the predictions to the nearest whole number:
test["predictions"] = test["predictions"].round()
calculate the mean absolute error with the new model:
error = mean_absolute_error(test["medals"], test["predictions"])
display the mean absolute error with the new model:
print("Mean Absolute Error:", error)

↑ a mean absolute error (MAE) of 3.26 means that, on average, my model's predictions are off by about 3.26 medals from the actual number of medals won

import the RandomForestRegressor model from scikit-learn:
from sklearn.ensemble import RandomForestRegressor
initialise and train the RandomForestRegressor model:
rf = RandomForestRegressor(n_estimators=100) rf.fit(train[predictors], train["medals"])
get and display feature importance from the RandomForestRegressor model:
importances = rf.feature_importances_ for feature, importance in zip(predictors, importances): print(f"{feature}: {importance}")

import seaborn and matplotlib for plotting:
import seaborn as sns import matplotlib.pyplot as plt
create a scatter plot of 'height' vs. 'medals':
sns.scatterplot(x='height', y='medals', data=teams)
add a title to the scatter plot:
plt.title("Height vs. Medals")
display the scatter plot:
plt.show()

↑ no correlation found between height and medals won across all sports, but perhaps if we filtered for certain sports we could find something interesting?

load the dataset into a dataframe:
athlete_events = pd.read_csv('athlete_events.csv')
display the first few rows of the dataframe:
print(athlete_events.head())
display the column names of the dataframe:
print(athlete_events.columns)

↓ gymnastics has a tendency to favour shorter athletes, so i attempted to filter for gymnastics to see if i could find a correlation

filter the dataframe to include only rows where the sport is gymnastics:
gymnastics_data = athlete_events[athlete_events['Sport'] == 'Gymnastics']
map the medal names to numeric values:
gymnastics_data['Medal'] = gymnastics_data['Medal'].map({'Gold': 1, 'Silver': 2, 'Bronze': 3})
drop rows with missing values in 'Height' or 'Medal':
gymnastics_data = gymnastics_data.dropna(subset=['Height', 'Medal'])
calculate the correlation between 'Height' and 'Medal':correlation = gymnastics_data[['Height', 'Medal']].corr().loc['Height', 'Medal']
display the correlation value:
print(f"Correlation between height and medals in gymnastics: {correlation}")

display information about the 'Height' and 'Medal' columns:
print(gymnastics_data[['Height', 'Medal']].info())

↑ wasn't sure if I'd done something wrong here or if there just wasn't enough data, but nonetheless a fun introduction to machine learning!

back

write-up in progress...

back

résumé

pdf link

(my first attempt at creating a résumé; feedback is appreciated!)

i'm currently in search a data science internship for summer '25 so if you know of any opportunities, i would greatly appreciate any information you could share and thank you very much in advance!

back

blog

bits and bobs on whatever interests me 🐼

back

powerlifting pbs

squat:

01/08/24 - 100 kg / 220 lbs

bench:

01/08/24 - 82.5 kg / 182 lb

deadlift:

01/08/24 - 120 kg / 265 lb

@ 63.5 kg / 140 lbs

back

race pbs

1 mile:

05/08/24 - 6:27

5K:

12/07/24 - 22:46 / 7:19 average

10K:

19/06/24 - 49:57 / 8:02 average

half marathon:

pending...

marathon:

also pending 🙂

back

♔♕♖♗♘♙

i haven't been putting much effort into chess but this changes today (hopefully)starting off at 1000 blitz as of 01/08/2024 which is a nice even number to work from! will update as i progressgoal: 1500 by end of 2024?

here's a link to my chess profile if you'd like to play 🙂 i'm always up to practise

#1) tips to break 1200 elo

back

tips to break 1200 elo

1) blunder check:
- step 1: check if your move is giving away a piece.
- step 2: analyse opponent’s last move for any obvious threats.
- step 3: ensure your king is not in danger of checks.
- step 4: check if your queen is in danger.

2) defending the king:
a) delay castling: wait a move or two to assess the opponent's plans.
b) counter-attack: if under attack, consider going on the offensive.
c) counter-attack in the center: if attacked on the flank, counter in the centre.
d) create a blockade: let opponent push first and then block their pawns.

c) knight forks:
- focus on vulnerable squares: c2, c7, f2, f7.
- plan against forks: if a square is under threat, prepare for additional attacks.

d) backrank mate:
- check for backrank issues: ensure your king has escape squares or avoid backrank mates.
- move the h-pawn: consider moving the h-pawn up one square to provide an escape route.

e) endgame mistakes:
- watch for pawn promotion: pay close attention to pawns that become queens and potential endgame scenarios.
- not planning ahead: don't make pointless one movers or attacks that risk your position.

back

📚

currently reading: data science for dummies by lillian piersonyou can find out more on my goodreads profile (though i sometimes forget to update 😅)

back

healthy food!

smoked salmon + half done eggs on sourdough w/ organic feta cheese + olives

scrambled eggs w/ tomato in pure butter, honey on sourdough, full fat greek yoghurt w/ organic banana, blueberries + kiwi & cinnamon / honey

gf burgers w/ cheddar + chilli, oven baked chips w/ paprika

steak + sausage 🤤

greek yoghurt w/ organic dark chocolate, banana, blueberries, cinnamon, honey + paleo coconut flour cookies

greek yoghurt w/ papaya, pineapple, dark chocolate, honey & cinnamon + fresh orange juice (yes i love greek yoghurt combos)

salmon w/ potato wedges (use extra virgin olive oil in air fryer)

stainless steel / ceramic air fryers are best if you can!

back

my touch typing progress

August 2021 - 68 wpm averageJuly 2022 - 80 wpm averageJuly 2023 - 91 wpm averageAugust 2024 - 105 wpm averagelink to my typeracer profile!

back

i enjoy playing mental maths games they're fun

personal bests

(all achieved on default settings)

zetamac: 73open quant arithmetic game: 67arcamedics multiplication: 45.40sarcamedics jet ski addition: 49.98sarcamedics ducky subtraction: 46.52sarcamedics drag race division: 51.19squantguide addition: 77quantguide subtraction: 84quantguide multiplication: 59quantguide division: 53

try out for urself and see how u do 🙂 mental maths often comes in handy

back

🛸☄️🪐🔭🚀👾👽🛰️

#1 the hercules-corona borealis wall

back

the hercules-corona borealis great wall

the hercules-corona borealis great wall, discovered in 2013, is an enormous galaxy supercluster located approximately 10 billion light-years from earth, 94.6 x 10^24 m (94,600,000,000,000,000,000,000,000 metres); it is currently the largest known structure in the universe, and challenges our understanding of cosmic scales due to its sheer size and the vast distances involvedto grasp how large the hercules-corona borealis great wall really is, consider the following dimensions:

length: ~94,600,000,000,000,000,000,000,000 metresdiameter: ~15,100,000,000,000,000,000,000,000 metresin fact, in its widest part, the hercules-corona borealis great wall is 1.7% of the entire observable universe 😱

here's a comparison of the diameter of the structure to various famous objects:

the sun: diameter - ~1.39 x 10 ^ 9 metrescomparison: 1.09×10 ^ 16 (10,000 trillion) times the diameter of the sunthe earth: diameter - ~1.27×10 ^ 7 metrescomparison: 1.19×10 ^ 18 (1 million trillion) times the diameter of the earth.the burj khalifa: height - 828 metrescomparison: 1.82×10 ^ 22 (10 billion trillion) times the height of the burj khalifa!a human: height - 1.7 metrescomparison: 8.9 x 10 ^ 24 (1000 billion trillion) times the height of a human

limitations and possibilities:

• observational limits - our observations are limited by the observable universe, which is about 93 billion light-years (8.79 x 10 ^ 25) in diameter; structures larger than what we can currently detect may exist, but they are beyond our observational capabilities• theoretical structures - there could also be even larger cosmic structures that we just haven’t discovered yet due to the limitations of our current technology and observational techniques

back

contact