Battle of Neighborhoods – Chennai

Applied Capstone Project – IBM DataScience

By S, Dharshan

This Article is part of the IBM Data science capstone project. We saw how data science was used to cluster data into different clusters using different algorithms such as K-means. Today we are going to apply the model to real-world data. Here we are going to take location data from different areas in Chennai and use that data to conclude which area is suitable to open a particular type of Restaurant.

To checkout my notebook code at GitHub , Click Here


Chennai AKA Madras (the official name until 1996), is the capital of the Indian state of Tamil Nadu. Located on the Coromandel Coast off the Bay of Bengal, it is one of the largest cultural, economic and educational centers of south India. According to the 2011 Indian census, it is the sixth-most populous city and fourth-most populous urban agglomeration in India. The city together with the adjoining regions constitutes the Chennai Metropolitan Area, which is the 36th-largest urban area by population in the world.

Great tourist attraction

The traditional and de facto gateway of South India, Chennai is among the most visited Indian cities by foreign tourists. It was ranked the 43rd-most visited city in the world for the year 2015. The Quality of Living Survey rated Chennai as the safest city in India. Chennai attracts 45 percent of health tourists visiting India and 30 to 40 percent of domestic health tourists. As such, it is termed “India’s health capital”.Chennai has the fifth-largest urban economy in India. This gives us additional reasons to open restaurants in this great city

Problem Statement

Opening a restaurant is a lot of commitment and investors need to assess the risk factors before investing in the business. In this project, I’m going to analyze restaurant venues present in the different areas of Chennai and predict which location would be most suitable to open our restaurant

Data Source

Web Scrapping is an easy way to get real-world data from publicly available sources like Wikipedia. For our analysis, I web scrap data from a list of areas in the Chennai Wikipedia page and used that to create a data frame for further analysis. Here below is attached to the table from which data is scrapped.

Data preprocessing

Cleaning from Wikipedia

The data source is not clean and also we couldn’t get the zone name for respective areas. So we got all 161 areas from the city of Chennai and then we used geopy to get zone names for areas and added them to the data frame. Preprocessing data can be done manually but takes much time. writing own python scripts to clean data will be really helpful in the process.

Adding Lat, Long using Geopy

We use Geopy python lib to add receive location data for particular areas. Geopy is a Python client for several popular geocoding web services. geopy makes it easy for Python developers to locate the coordinates of addresses, cities, countries, and landmarks across the globe using third-party geocoders and other data sources.

We added location data based on area name and we created a data frame using all the data. We can check the data frame head below. Click here to check out Geopy

Locations on Map

Getting Venues using Folium API

The next thing is to get the data regarding the venues’ using the Foursquare API. We would collect data corresponding to venues present in a radius of 500 meters from each area. Also, we would limit the number of results returned to 100 per area.

We create a new data frame to put this data in, along with some of the relevant data from the previous one.

Grouping this data by areas and calculating the means(average occurrences) of the venue categories for each area provides us with information regarding the presence of venue categories by areas.

Too Many Different Venues

Our grouped venue data frame has too many columns. Here we are going to filter out restaurants alone from other types of venues. Then we will choose our venue based on the frequency of occurrence.

Decision to open Italian Restaurant

As we can see Italian restaurants are exotic places which are found in various parts of neighborhoods. So for our report, we are going with Italian Restaurants as they have a better chance of surviving in the city and also being exotic enough to attract many tourists to come to visit the place

Clustering using K-Means algorithm

We will cluster the areas according to the measure of occurrences of restaurants in them. For determining the optimal number of clusters, we need to plot the performances(inertia) against the range of values of ‘K’ and then select the number for performing the Clustering

k-means clustering is a method of vector quantization, originally from signal processing, that aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean (cluster centers or cluster centroid), serving as a prototype of the cluster.

We perform K-Means clustering with values for K from one through ten to find the optimal ‘K’ using the elbow method, which in our case is four.

Color Coding the Venues based on Cluster Label

We segregate the venues into four clusters and add the cluster labels to our final data frame. We examine the clusters by plotting them onto a map. Each color code represents the level of concentration of Italian restaurant present in that particular area

Following are the color codes for each cluster:

  • Cluster 0: Green – Least
  • Cluster 1: Violet – High
  • Cluster 2: Yellow – Moderate
  • Cluster 3: Grey – Very high

Lets Check out the map with concentration of Italian restaurant.

Cluster Analysis

Cluster 0 has the least number of Italian restaurants and so there is no competition. But also there’s a risk of having no customers in the surrounding neighborhood who likes to have Italian cuisine and might be the reason why there is the least concentration.

Clusters 2 has a moderate concentration of Italian restaurants. Property developers with unique selling propositions to stand out from the competition can also open new restaurants in neighborhoods in clusters 2 with moderate to high competition.

Lastly, restaurants in cluster 3 and cluster 1are probably suffering from strong competition due to oversupply and high concentration of restaurants. Hence, Property developers are advised to avoid neighborhoods in these which already have a high concentration of restaurants and are suffering from intense competition.


We got the winner : Kottivakkam

Kottivakkam belongs to cluster 2 which has a moderate concentration of Italian restaurants. The area is near the beach and surrounded by many tourist attractions. So I am coming to the conclusion that we should choose yellow as are cluster.

We might ask why to go for a moderately concentrated place instead of going to areas where no Italian Restaurants are found. The reason for this is people in the green areas may not prefer Italian restaurants and may not be even aware of Italian cuisines. Also, people in the grey areas will have too many Italian restaurants to choose from and our entity may go unnoticed during the process.

Our targeted location is surrounded by places where a high concentration of Italian restaurants present in the area but in kottivakam the concentration is moderate. This strategy is based on the Nash equilibrium.

Python for machine learning: numPy – Part 1

In this post, we will look into the NumPy library in python. NumPy is a powerful python library which adds multidimensional array support and functions to manipulate the arrays to python. In this post, we will look at some basic things about numPy. You can learn more about numPy from here. Since we are dealing with some basic machine learning I won’t go deep into numPy.

NumPy library is well documented and very easy to understand with many examples. So it is better to look at the documentation.

Let us see how we can create a basic numPy array. This post is organized as follows

Part 1: (this post)

  1. Basic 1-D array creation
  2. 2D and 3D array creation

Part 2 : (next post)

  1. Functions for creating arrays
  2. Visualization (using matplotlib library)
  3. Indexing and slicing

If you install anaconda distribution then numPy is already preinstalled otherwise please look at numPy installation guide it is very easy to install using PIP.

Before we proceed to array creation we need to import numPy using

import numpy as np

Basic 1-D array creation

a 1-D array is a simple array with only one dimension. I’ve given the basic code to create a 1-D array

a = np.array([0, 1, 2, 3])

The output will be

[0 1 2 3]

From this post onward i will not be including the examples by writing it separately as code and output instead i will present the above example as follows

>>> import numpy as np
>>> a = np.array([0, 1, 2, 3])
>>> print(a)
[0 1 2 3]

if you see the above example you can see the lines that have “>>>” are the code that we write and the lines without “>>>” is the output that we get when we execute the code.

Now lets see another example of 1-D array with words

>>> import numpy as np
>>> a = np.array(['hello','world','how', 'are','you'])
>>> print(a)
['hello' 'world' 'how' 'are' 'you']

if you see above we can create a word array also with numPy but for this topic we will stick to numbers.

we will see some basic functions to view the size and dimension of the array.

>>> import numpy as np
>>> a = np.array([0,1,2,3,4])
>>> print(a)
[0 1 2 3 4]
>>> a.size
>>> a.ndim
>>> len(a)

The size function will give the size of the array. The ndim function will give the dimension of the array. The length for the 1-D array will return the size of the array.

2D and 3D array creation

The 2D and 3D array unleashes the power of numPy. Many datasets will often include many dimensions so it is best to learn how to create multi dimension array. To see how we can create 2D array below

>>> b = np.array([[0, 1, 2], [3, 4, 5]])
>>> b
array([[0, 1, 2],
       [3, 4, 5]])

Now we will see some basic function in numPy. it is similar to the 1D array but the output will be somewhat different.

>>> b.ndim
>>> b.shape
(2, 3)
>>> len(b)

As you can see above the functions are very similar to 1D array.

We will now create 3D array and see how it is different

>>> c = np.array([[[1], [2]], [[3], [4]]])
>>> c

>>> c.shape
(2, 2, 1)

Now we know how to create array and some basic function to explore the dimension of it. In the next post we will see some functions for creating numPy array, visualizing the data in the numPy and Indexing and slicing of it.

Happy Coding

Python for machine learning – Collections

Welcome to another post about python for machine learning. In this post we will see Collections in python. There are four collections in python

  1. List
  2. Tuple
  3. Set
  4. Dictionary

Let us discuss about them one by one


A list in python is a collection which is ordered and changeable. A basic list is shown below. Lists are defined by square brackets ” [ ] “

thislist = ["apple", "banana", "cherry"]

The output of the program is

['apple', 'banana', 'cherry']

It is interesting to know how to retrieve the data in python

index and range of indexes

see how to access the element using index

thislist = ["apple", "banana", "cherry"]

The output of the above program will be banana. list is indexed from 0

let us see range of indexes by using the same program

thislist = ["apple", "banana", "cherry", "orange", "kiwi", "melon", "mango"]

The output of the above program will be [‘cherry’, ‘orange’, ‘kiwi’]

Since we are only covering some important topics in the post,You refer other websites for learning python completely. I am just focusing on the aspects that are needed for machine learning


The second collection we are going to see is tuple. The major difference between list and tuple is that list is changeable and tuple is unchangeable but both are ordered.

In python,Tuples are defined by Curve brackets ” ( ) ” as shown below

thistuple = ("apple", "banana", "cherry")

The output of the program is same as the list

(‘apple’, ‘banana’, ‘cherry’)

We can use the same method as in list to access the elements in tuple.


Set is a collection in python which is unordered and unindexed. sets are defined by curly brackets ” { } “

Let us see a basic set program

thisset = {"apple", "banana", "cherry"}

The output is the same as the tuple and list. Since sets are unindexed we cannot access the elements by index. So, we need to loop through the set to retrieve the elements. We will see a program to access the data in the set.

thisset = {"apple", "banana", "cherry"}

for x in thisset:

The output is



A dictionary is a unordered and indexed and changeable. Dictionary is written in python by same curly brackets ” { } ” but has keys and values.

We will see a basic program about dictionary

thisdict =	{
  "brand": "Ford",
  "model": "Mustang",
  "year": 1964

The output of the program is

{‘brand’: ‘Ford’, ‘model’: ‘Mustang’, ‘year’: 1964}

Now we shall see how to access data from dictionary

we can access the data in a dictionary using the associated key. The program below will demonstrate how to access the data.

x = thisdict["model"]

The output of the above program will be “Mustang”. How to change the data in a dictionary is as shown below:

thisdict["year"] = 2018

now the data in the dictionary will be changed and year value is changed to 2018.

We are not focusing on this much because we will be using numPy array and Pandas Dataframe for most of our tasks. So we won’t be using python list, tuple, sets, dictionaries. Since it is important to know these basics I’ve posted it.

In the next post we will see about numPy and what we can do with numPy.

Happy Coding

Python for machine learning – Control Structures

Prerequisite for this course

In my previous posts, I forgot to mention this, the prerequisite for learning python for machine learning will be a basic understanding of basic programming concepts from any programming language like C, C++, Java etc.

In this post we will see the most important part of a programming language control structure

  1. Selection
    1. if
    2. if…else
    3. if….elif…else
  2. Repetition
    1. while
    2. for


used for making decisions in a program that branches the program in 2 or more ways let us see the selection statements in python and one example program for each statement. It will have the same logic as the other programming languages like C and C++

if – statement

The first selection statement we are going to look at is if statement, the basic syntax is the same as the other programming languages. Let us see the syntax of if statement in python

if (expression):

if you see the syntax of python you will notice there is no parenthesis ” { ” instead of parenthesis python uses ” : ” and indentation if you see the syntax the statement is indented to let python know that we are inside the if statement.

Other than the simple change the logic of if statement works as the same as the other programming languages. now let us see an example program with if statement.

a = 10
if a==10:
    print('it is ten')

the output will be “it is ten

if…else – statement

The second selection statement is if…else statement if you look at the if…else statement we will see the syntax and basic program in python.

if (expression):

Example program

a = 11
if a==10:
    print('it is ten')
    print('it is not ten')

the output will be “it is not ten”

if…elif…else – statement

The third statement we are going to see is if…elif…else statement in python elseif is mentioned as elif that is why i am mentioning it as elif not as elseif. let us see the syntax of if…elif…else statement.

if (expression):
elif (expression):

Now we will see the program.

a = 11
if a==10:
    print('it is ten')
elif a==11:
    print('it is eleven')
    print('it is not ten')

the output will be “it is eleven”

Next we will see repetition statements

while – statement

The while statement functions as the same as in any other programming language we will see the syntax of while statement below.


We will see a example program

i = 1
while i < 6:
  i += 1

The above code will output ” 1 2 3 4 5 6 “.

for – statement

The next and last control statement we will see is for statement we will see the example below.

fruits = ["apple", "banana", "cherry"]
for x in fruits:

The output of the program is “apple banana cherry”

We will also see another basic program below.

for i in range(1,5):

The output will be ” 1 2 3 4″

That’s all for this post friends we will see Python collections in the next post

Happy Coding

Python for machine learning: variables, datatypes


  1. Variables & Datatypes
  2. Basic program

Variables & Datatypes

Variables are basic building blocks of a programming language and it applies to python programs also. Variables hold data in memory. Python variables differ by a slight than the other programming languages. In python, variables can be declared without specifying the data type and the data type will be determined by the values assigned to the variable. for example in c++, declaring a variable “a” with the value assigned to it will be like the code below.

int a = 10;

But in python it is declared like

a = 10 #Assign the values 10 to variable a 

Note : in python “;” is not need to end the statement

Let see more examples

x = 5         # assign variable x the value 5
y = x + 10     # assign variable y the value of x plus 10
z = y         # assign variable z the value of y

As with other programming languages, variables in python is case-sensitive and can include letter, number, ( _ ) char but the variable name cannot start with a number.

Let us see how python assigns the datatype based on the values by an example program.

x = 1
print(type(x)) # outputs: <class 'int'>

x = 1.0
print(type(x)) # outputs: <class 'float'>

if we see the first line


we can see that it is a integer value and the output is displayed as

<class 'int'>

if we see the second variable declaration and ask it to print its type


we can see that it is a decimal number and its type is float

<class 'float'>

This applies to other data types too because python is a dynamically typed programming language and it assigns a data type to its variables based on the type of values assigned to it.

Basic Program

Let us see a basic program about declaring variables and datatypes.

print("Data type of variable x and y" + str(type(x)) + str(type(y)))

print("Data type of variable d is " + str(type(d)))

s = 'Arvin Education'
print("Data type of variable s is " + str(type(s)))

The output is as follows

Data type of variable x and y<class 'int'><class 'int'>
Data type of variable d is <class 'float'>
Arvin Education
Data type of variable s is <class 'str'>

That is all for this post. In the next post we will see control statements in python.

Please see for தமிழ் Version of this post.

Happy coding

Getting started with python and Jupyter notebook

To get started with machine learning we need to start with Jupyter notebook. You can also see this under our YouTube channel Arvin Education.

What is Jupyter notebook?

Jupyter notebook is an open-source web application which allows you to write code, use visualizations, write narrations, and write equations.

Now let us see how to work with Jupyter notebook. I will provide a screenshot for each step on how to start anaconda navigator to how to navigate Jupyter notebook and what are the features that are available in Jupyter notebook.

In windows start menu select Anaconda3 (64-bit) > Anaconda Navigator (Anaconda3) see previous posts to install Anaconda Distribution

Once you’ve clicked Anaconda Navigator you will see a windows that looks like the screenshot below

Click the launch button in the notebook section. see the screenshot above. Once you click the launch button Jupyter notebook will open in your default web browser ( Mozilla Firefox or Google Chrome).

This is the file explorer section in Jupyter. If you already have a notebook file you can browse and open it from here. The extension of the notebook is .ipynb.

Select the folder in which you want the new notebook file to be created.

In the new option, select python 3 under notebook it will open a new notebook with python 3 as the kernel.

In the screenshot above the different options are marked by red arrows and the functions of it are specified below. you can also see some useful shortcuts

Now let us write a small python program and check. Copy the below program and paste in the cell and press shift+enter to execute the current cell.

print('hello world')

you will see the result as below

In the next posts we will saw some basics of python like variables, functions, machine learning specific packages like NumPy, Pandas, Sklearn, Tensorflow, Kares, etc. This post is also available in Tamil in the below blog. Be sure to follow this blog if you want to learn machine learning in Tamil.

Blog address

Python for Machine learning

In this series we will see python programming language and how we can use it for machine learning and Data Science. In this series we cover

  1. Installing Anaconda Distribution for Data Science.
  2. Basic Python programming language
  3. numPy
  4. Pandas
  5. Sklearn
  6. Basic programs for Data exploration and Data cleaning.

In the next post i will discuss about installing anaconda distribution and how to get started with Python

Essentials for learning Machine learning

To get started with machine learning we need some things.

  1. Python
  2. Pandas
  3. numPy
  4. Scikit-learn
  5. Jupyter notebooks or Jupyter Lab

You can get all these and install it manually or download Anaconda Distribution so it can set everything for you. to download anaconda click the link below

If you have any difficulty with the installation comment on the forum post below. I have setup a forum so that we can discuss all the things that are needed. To go to the forum click the link below

Happy Learning