Monday, 10 August 2015

Airline Route Histogram

Creating Histogram

Histogram: a graphical display of data using bars of different heights.
It is similar to a Bar Chart, but a histogram groups numbers into ranges,
Histograms are a great way to show results of continuous data, such as:
  • weight
  • height
  • how much time
  • etc.
But when the data is in categories (such as Country or Favorite Movie), we should use a Bar Chart.
histogram is a graphical representation of the distribution of numerical data. It is an estimate of the probability distribution of a continuous variable (quantitative variable) and was first introduced by Karl Pearson.

Reading And writing Comma-Separated data

Comma-separated values (CSV) is a way of expressing structured data in flat text files:
Code will showing the Names of all routes in that file:

Reading Airport Data


We're going to do some processing of real-world data now, using freely available airline data sets from the OpenFlights project.

Let's get deeper into the code. As you see below, we are working with two different input dataset 1. airports.dat to get airport details and 2. routes.dat to get route details. And now we've to calculate geo_distance from both those data and record it in a list distance[] 



Airline Route Histogram:

The distance we measure is the "great circle distance".
In order to calculate "great circle distance"., we have to import a module named geo distance , a pre-built function geo_distance.distance() helps us in calculating the distance of each airline route.

Histogram:

Plot a histogram based on the route lengths to show the distribution of different flight distances.











Creating Charts Using Matplotlib

Seaborn is a Python visualization library based on matplotlib. It provides a high-level interface for drawing attractive statistical graphics.
Bar Plot of Radish Votes:
Let's see how to generate bar plot of the data.We will display the individual counts of radish variety across the 
data.

 matplotlib to generate a bar graph


import matplotlib.pyplot as plth 

matplotlib can also output charts in other formats like image files.

plt.bar(range(len(counts)), counts.values(), align='center')

Bar code : Use below python code to generate bar Graph to display the vote counts from the radish variety program
plt.show()

Here the graph with no information of x and y axis is of no use, So to label x-axis and y-axis from which we can extract the information or observations
plt.ylabel(s = "Votes")
plt.xticks(range(len(counts)), counts.keys(),rotation=90)
We create a range of indexes for the X values in the graph, one entry for each entry in the "counts" dictionary (ie len(counts)), numbered 0,1,2,3,etc.This will spread out the graph bars evenly across the X axis on the plot.
np.arange is a NumPy function like the range() function in Python, only the result it produces is a "NumPy array".
plt.xticks() specifies a range of values to use as labels ("ticks") for the X axis.
x + 0.5 is a special expression because x is a NumPy array. NumPy arrays have some special capabilities that normal lists orrange() objects don't have.

Working with String to Solve Voting Problem

Strings are amongst the most popular types in Python. We can create them simply by enclosing characters in quotes. Python treats single quotes the same as double quotes.

Problem : To solve the vote counting problem 


As a vegetable retailer we have to find most required radish variety of the vegetable by the customers.

The data file contains 300 line with two variable which are character datatype , separated by hyphen .
The name of the file is "radishsurvey.txt".


# Main objective of this study is to find out the following:

  • What's the most popular radish variety?
  • What are the least popular?
  • Did anyone vote twice?

# Things covered in the program:

  • Reading the data
  • Operations
  • Looping
  • Use of list and Dictionary

# Approach Towards the problem

Save the file radishsurvey.txt to your computer. How do we write a program to find out which person voted for each radish preference.

we are creating an empty "dictionary" counts to store the vote counts and an empty "list" voted to track the duplicate voters. Comments in the below code explain the purpose of every step.
The optional argument to "split" is the number of times to split the string, not the number of parts to split it into. So splitting it one time creates two strings, splitting it two times creates three strings, etc. A little "got you" moment for the unwary Python programmer!

Counting votes for each radish variety is a bit time consuming, you have to know all the names in advance and you have to loop through the file multiple times. How about if you could automatically find all the varieties that were voted for, and count them all in one pass?
You'll need a data structure where you can associate a radish variety with the number of votes counted for it. A dictionary would be perfect!
Then to check if any one has voted twice
You will need to start making a list of the names of everyone who has voted so far. Each time you see a new name, check if it is already in the list of names. Starting with an empty list of names, you can use voterlist.append(newentry) to append a new entry to the end.
Then need to apply the same data munging techniques to clean up people's names, 
Then to find the winner at the end .

Then our program prints the number of votes cast for each radish variety, but it doesn't declare a winner, So we can use a for loop which iterates over all of the keys in a dictionary.

Her is the Output for this problem :-

 Phoebe Barwell has already voted
Procopio Zito has already voted
{'April cross': 72,
 'Bunny tail': 72,
 'Champion': 76,
 'Cherry belle': 58,
 'Daikon': 63,
 'French breakfast': 72,
 'Plum purple': 56,
 'Red king': 56,
 'Sicily giant': 57,
 'Snow belle': 63,
 'White icicle': 64}
('Champion', 76)

Conclusion :

Champion is the most popular variety and Red King and Plum Purple are the least popular.

Reference:http://opentechschool.github.io/python-data-intro/core/strings.html