Making Data Management Decisions

This is the third post in the blog series for the solutions to the assignments in the Data Management and Visualization course offered through Coursera.

The week 3 assignment involves posting the program, the results/output that displays at least 3 of the data managed variables as frequency distributions, and the explanations describing these frequency distributions in terms of the values the variables take, how often they take them, the presence of missing data, etc.

For this assignment, I performed some of the data management techniques which include the following:

  1. Coding missing values in the variables as NaN values.
  2. Binning or grouping values of the three data managed variable.
  3. Creating secondary variables from three variables in the dataset.

Python program and results of Data Management

I replaced all the empty cells of all the variables selected for the research with NaN values using the replace function. The code snippets and output are below:

Next, I created three secondary variables from the three variables initially selected in week 2 assignment by grouping the values of these selected variables into 5 groups (1, 2, 3, 4 and 5). The code snippets for creating the secondary variables are below:

Python Program and results of the Frequency distribution

After making data management decisions on the three selected variables, I created the frequency tables of these variables and also included the NaN values in the frequency tables. The code snippets and results of the frequency tables are below:

The above output shows the frequency table for the secondary variable ‘OilClass’ indicating that there were a total of 150 missing values in the ‘OilClass’ variable. 57 (26.76% of ) values in the variable are less than 3 tonnes per capita, 4 (1.88% of ) values are between 3 to 6 tonnes per capita, 1 (0.47% of ) values are between 7 to 9 tonnes per capita and 1 (0.47% of ) values are between 13 to 15 tonnes per capita.

The above output shows the frequency table of the secondary variable ‘Co2Class’ indicating that there were 13 missing values in the variable. 197 (92.5% of ) values are less than 70 billion metric tons, 2 (0.94% of) values fall between 70 billion metric tons and 140 billion metric tons, and 1 (0.47% of) values fall between 280 billion metric tons and 350 billion metric tons.

The above output shows the frequency distribution of the secondary variable ‘ElectricClass’ indicating that there are 77 missing values in the variable. 121 (56.81% of) values are less than 2500 kWh, 11 (5.16% of) values fall between 2500 and 5000 kWh, 2 (0.94% of ) values fall between 5000 and 7500 kWh, 1 (0.47% of ) values fall between 7500 and 9000 kWh, and 1 (0.47% of) values fall between 9000 and 11500 kWh.

This is the end of the solution to the week 3 assignment (Making Data Management Decisions ) for the Data Management and Visualization course. In the next post, I would be sharing the solution to the week 4 assignment on Visualizing data.

A Data Scientist and Visual Storyteller with a strong interest in Data Analytics and Business Intelligence.