blog

akshitha singareddy
Oct 7, 2022
2 min read

I have started doing a machine learning titanic problem in kaggle. I went through kaggle and then participated in a competition. This is my first kaggle competition and my first time of using kaggle. I have gone through the tutorial video.

PROBLEM: the titanic ship is sunk with people in it and here we are supposed to predict how many people are survived and how many couldn't..

they have given three data sets

train dataset
test dataset
gender_submission

And below they also gave a basic tutorial code of how to start basic programing.

The basic program helped me to start with.

Firstly I have imported all the libraries that i need like matplotlib, seaborn, numpy, pandas and etc...

Then I have loaded the train dataset

In train dataset there are attributes 12 and total 891 entries

Now, load the test dataset

In test dataset shape (418,11)

now I have checked the null values in training dataset by using isnull we can see there are 866 of them

Here I have checked null values in test dataset there are 414 of them

when a model has more null values it cannot estimate i.e predict accurately. SoI have removed them in both training and test dataset by using median to better prediction of model.

I have then done feature selection and removed all the attributes that are not needed(ticket,pid,name)

less features then it is easy for the model as it becomes less complex.

dividing is also important so I have tried different variations of test and train split and i have got better results at 80% training and 20% testing which is better model

I have got accuracy of 0.78 with logistic regression model

Initially my leaderboard was at 0.77 it increased to 0.78 i have tried changing to get better result and i will continue to get more better result by learning more concepts in kaggle and applying those concepts.

My Contribution:

I have split the data to 80 % training and 20% testing for better results

the 80:20 combination is good for this model

I have checked the null values and there are many. I removed them by doing median of the data also removed pid, name by feature scaling as they are of no use in predicting . so I discarded them.

reference:

https://link.medium.com/0q52zuxSUtb