i

R Programming Complete Tutorial

Data Pre-processing and Missing Value Analysis

Data Pre-processing

In this chapter, we will discuss Missing Value Analysis and Outlier Analysis and their treatment in detail.

Missing Value Analysis

In general missing data means incomplete data; whatever the reasons are, its incomplete. In statistics, missing data, or missing values, occur when no data value is stored for the variable in an observation.In R, missing values are represented by:

NA:Which means Not Available. In a data set missing data points are represented by NA.

NULL: Null is used whenever there is a need to indicate or specify that an object is missing.

NaN: It means not a number and is for arithmetic purposes. Usually, NaN comes from 0/0.

Inf: Like NaN, Inf is also produced by numerical computation such as 1/0. Inf is not an NA. In fact, it is a very large number. 


Why Missing value?

In this section, we are going to explain the reasons for Missing data:

  • Respondents forgot / refused / failed to answer certain questions.
  • A sensor failed.
  • Someone purposefully turned off recording equipment.
  • A data transfer was cut short.
  • A hard drive became corrupt.
  • An internet connection was lost and could not complete the full transaction.
  • Impossible values can be generated from wrong mathematical calculations like anything divide by zero.

Missing data mechanism

This section is especially for Missing data mechanism:

Missing completely at Random (MCAR): This means the missing of a particular value has nothing to do with the hypothetical value and the values of other variables

Example: Removed 5% of medical survey data

Missing at Random (MAR): This means that the propensity for a data point to be missing is not related to the missing data, but it is related to some of the observed data

Example: People who come from poor families might be less inclined to answer questions about drug use, and so the level of drug use is related to the family income.

Missing not at Random (MNAR): Two possible reasons are that the missing value depends on the hypothetical value or missing value is dependent on some other variable’s value.

Example: Students skipped the question on drug use because they feared that they would be expelled from school.

Problem with Missing data

The main problems we face with missing data in data analysis are: 

  • We can't predict when missing data are problematic because sometimes results are affected, and sometimes they are not.
  • Each variable may only have a small number of missing responses, but in combination, the missing data could be numerous.
  • Most statistical procedures automatically eliminate cases with missing data.
  • The analysis might run, but the results may not be statistically significant because of the small amount of missing data.