8 min read

Introduction

Background

India is a country in South Asia. It is the second-most populous and seventh-largest country by area. Delhi, Mumbai, Chennai and Kolkata are four most important metro cities in India. Metro city are the urban cities that are highly populated.

People move towards metro cities in search of better job and opportunities and a better life. These cities are popular and a place of interest for many due to their superior infrastructure like road, metro, safety, good quality education etc.

Problem

It is often difficult to decide to choose one of these cities for settlement. The deciding factor would be the superior and unique facilities these cities provide when compared to each other. This project aims to predict the best place to get settled in these metro cities.

Interest

People might be interested in knowing the analyze of different neighborhoods and the facilities and opportunities these neighborhoods can provide before settling or investing their money in any of these metro cities.

Data acquisition and Cleaning

Data Sources

For any data science project or analysis, data is most important. For this project, data can be found at government portal here. This dataset contains Indian postal codes along with their state name and coordinates. We have to download the CSV files and then load the data. This dataset, however, lacks data for latitudes and longitudes. We will use Google Geocoding APIs for filling data. We will also use Foursquare APIs to get the venues in each neighbourhood.

Data Preprocessing

There are several problems with the dataset. The dataset is huge and contains the data of all the states. However, we need data from the four metropolitan cities only. Also, there is a lot of missing data too.

Data Sources

For any data science project or analysis, data is most important. For this project, data can be found at government portal here. This dataset contains Indian postal codes along with their state name and coordinates. We have to download the CSV files and then load the data. This dataset, however, lacks data for latitudes and longitudes. We will use Google Geocoding APIs for filling data.

We will also use Foursquare APIs to get the venues in each neighborhood.

Data Preprocessing

There are several problems with the dataset. The dataset is huge and contains the data of all the states. However, we need data of the four metropolitan cities only. Also, there is a lot of missing data too.

Data Cleaning

We will select only those rows which have the name of those four cities in their taluk (administrative district).
Selecting data of four cities only
Also, there are same pin codes for different entries so we will keep the first entry only.
Dropping Pincodes

Filling missing data

We don’t have coordinates values in our data. Therefore, we will use Google Geocoding APIs to get latitudes and longitudes value using pin codes. For better understanding, we will add one more column as the neighbourhood in our dataset (as the ‘office name’ is not that insightful). However, some errors might get crept in.
filling_coordinates_data

Errors in the data

We will manually remove those rows which contain coordinates outside of India or the rows for which we are unable to fetch coordinates values.

Feature Selection

There are a lot of features in the dataset. However, we need only neighbourhood data and its coordinates for analyzing. Therefore, we will use Neighborhood, Taluk, Pin code, latitude and longitude. By using Foursquare APIs, we will extract different venues in each of the cities.

We will extract at most top 125 venues in range of 500 metres.

Data After Preprocessing

Our data for all four metro cities will look like the image below:-
data_after_preprocessing

Methodology

Analyzing the data

We have used one-hot encoding for the venues category and group the data by their neighbourhood. For analyzing the data, we will extract the top 10 most venues of the data.
using_one_hot_encoding

Modelling the data

We have used k-means clustering algorithm to cluster the venues in each city in five clusters.
k-means clustering is an unsupervised algorithm for clustering that works by defining k centroids, one for each cluster.

K Means Clustering Algorithm

Visualizing the data

We have used folium library to visualize the map.

Function to create Maps

We will be using this function to generate map of India and all other metro cities.

Creating Map function

Before Clustering

A Map of India Showing the four metro cities with their neighborhood plotted.
Map of Delhi showing all the neighborhoods
Map of Mumbai showing all it’s neighborhood
A Map of Chennai showing all it’s neighborhoods.
A map of Kolkata showing all its neighborhood

After Clustering

A map of Delhi showing clustering
Map of Mumbai after clustering.
Map of Chennai after clustering.
Map of Kolkata after clustering.

Results and Discussion

Predicting Value for each Cluster

In this project, we attempted to load dataset of India’s metropolitan cities and try

to analyze neighborhood in these metro cities based on most popular venues they have. We used k-means clustering algorithm to cluster the neighborhood.

The main aim of this project was to help people to relocate or settle in these metro cities. Given the cluster information of all metro cities.

We can say that Delhi people are good for art, craft, playgrounds and malls. Also, Delhi contains Indian Restaurants and Fast food places. So, it might not be good international visitors.

In Mumbai Cafe, Pub, Gym and Spa are quite famous here. These people are more health conscious. Also due to close proximity to sea shore, fish markets are also famous there.

Chennai and its neighborhoods are a great place for foodie. There are various types of restaurants and hotels in Chennai.

Kolkata and its neighborhoods contain Shopping malls, Multiplex, ATMs etc. It contains Park, Gyms, Cafe and Hotel too.

We can also analyze each neighborhood cluster wise to find the best neighborhood for targeted persons.

Conclusion

This project helps us to get better understandings of neighborhoods with respect to most common venues. For example, people who are foodies and like to taste different foods should search for neighborhood in Chennai etc.

The future of this project can be to include data related to crimes, pricing and more to get better insights of the neighborhoods and suggest better neighborhood.

References

25+
Categories: Uncategorized

Kapil Bansal

A student of B.Tech CSE working on competitive programming, a cybersecurity devotee working towards strengthing the concepts. I am also working on Python modules, Data Structures and Algorithms etc.

0 Comments

Leave a Reply

0