--- title: "Introduction to DataExplorer" author: "Boxuan Cui" date: "`r Sys.Date()`" output: rmarkdown::html_vignette: toc: TRUE toc_depth: 6 vignette: > %\VignetteIndexEntry{Introduction to DataExplorer} %\VignetteEngine{knitr::rmarkdown} %\usepackage[utf8]{inputenc} --- ```{r setup, include=FALSE} library(rmarkdown) library(knitr) library(parallel) library(DataExplorer) library(data.table) library(ggplot2) library(nycflights13) library(networkD3) set.seed(1) opts_chunk$set( collapse = TRUE, fig.width = 6, fig.height = 6, fig.align = "center", warning = FALSE, screenshot.force = FALSE ) ``` This document introduces the package [DataExplorer](http://boxuancui.github.io/DataExplorer/), and shows how it can help you with different tasks throughout your data exploration process. There are 3 main goals for **DataExplorer**: 1. [Exploratory Data Analysis (EDA)](https://en.wikipedia.org/wiki/Exploratory_data_analysis) 1. [Feature Engineering](https://en.wikipedia.org/wiki/Feature_engineering) 1. Data Reporting The remaining of this guide will be organized in accordance with the goals. As the package evolves, more content will be added. ## Data We will be using the [nycflights13](https://cran.r-project.org/package=nycflights13) datasets for this document. If you have not installed the package, please do the following: ```{r install-data, eval=FALSE} install.packages("nycflights13") library(nycflights13) ``` There are 5 datasets in this package: * airlines * airports * flights * planes * weather If you want to quickly visualize the structure of all, you may do the following: ```{r plot-str-template, eval=FALSE} library(DataExplorer) data_list <- list(airlines, airports, flights, planes, weather) plot_str(data_list) ``` ```{r plot-str-run, echo=FALSE} data_list <- list(airlines, airports, flights, planes, weather) diagonalNetwork( plot_str(data_list, print_network = FALSE), width = 800, height = 800, fontSize = 20, margin = list( "left" = 50, "right" = 50 ) ) ``` You may also try `plot_str(data_list, type = "r")` for a radial network. --- Now let's merge all tables together for a more robust dataset for later sections. ```{r merge-data} merge_airlines <- merge(flights, airlines, by = "carrier", all.x = TRUE) merge_planes <- merge(merge_airlines, planes, by = "tailnum", all.x = TRUE, suffixes = c("_flights", "_planes")) merge_airports_origin <- merge(merge_planes, airports, by.x = "origin", by.y = "faa", all.x = TRUE, suffixes = c("_carrier", "_origin")) final_data <- merge(merge_airports_origin, airports, by.x = "dest", by.y = "faa", all.x = TRUE, suffixes = c("_origin", "_dest")) ``` ## Exploratory Data Analysis Exploratory data analysis is the process to get to know your data, so that you can generate and test your hypothesis. Visualization techniques are usually applied. To get introduced to your newly created dataset: ```{r eda-introduce-template, eval=FALSE} introduce(final_data) ``` ```{r eda-introduce-run, echo=FALSE} kable(t(introduce(final_data)), row.names = TRUE, col.names = "", format.args = list(big.mark = ",")) ``` To visualize the table above (with some light analysis): ```{r eda-plot-intro, fig.width=10, fig.height=8} plot_intro(final_data) ``` You should immediately notice some surprises: 1. 0.3% complete rows: This means only 0.3% of all rows are not completely missing! 1. 5.7% missing observations: Given the 0.3% complete rows, there are only 5.7% total missing observations. Missing values are definitely creating problems. Let's take a look at the missing profiles. ### Missing values Real-world data is messy, and you can simply use `plot_missing` function to visualize missing profile for each feature. ```{r eda-plot-missing-template, eval=FALSE} plot_missing(final_data) ``` ```{r eda-plot-missing, echo=FALSE, fig.width=8, fig.height=8} plot_missing(final_data, geom_label_args = list(size = 3, label.padding = unit(0.1, "lines"))) ``` From the chart, **speed** variable is mostly missing, and probably not informative. Looks like we have found the culprit for the 0.3% complete rows. Let's drop it: ```{r eda-drop-speed} final_data <- drop_columns(final_data, "speed") ``` Note: You may store the missing data profile with `profile_missing(final_data)` for additional analysis. ### Distributions #### Bar Charts To visualize frequency distributions for all discrete features: ```{r eda-plot-bar-template, eval=FALSE} plot_bar(final_data) ``` ```{r eda-plot-bar-run, echo=FALSE, fig.width=8, fig.height=12} plot_bar(final_data, theme_config = list("text" = element_text(size = 6)), nrow = 4L, ncol = 3L) ``` Upon closer inspection of **manufacturer** variable, it is not hard to identify the following duplications: * *AIRBUS* and *AIRBUS INDUSTRIE* * *CANADAIR* and *CANADAIR LTD* * *MCDONNELL DOUGLAS*, *MCDONNELL DOUGLAS AIRCRAFT CO* and *MCDONNELL DOUGLAS CORPORATION* Let's clean it up and look at the **manufacturer** distribution again: ```{r eda-update-manufacturer} final_data[which(final_data$manufacturer == "AIRBUS INDUSTRIE"),]$manufacturer <- "AIRBUS" final_data[which(final_data$manufacturer == "CANADAIR LTD"),]$manufacturer <- "CANADAIR" final_data[which(final_data$manufacturer %in% c("MCDONNELL DOUGLAS AIRCRAFT CO", "MCDONNELL DOUGLAS CORPORATION")),]$manufacturer <- "MCDONNELL DOUGLAS" plot_bar(final_data$manufacturer) ``` Feature **dst_origin**, **tzone_origin**, **year_flights** and **tz_origin** contains only 1 value, so we should drop them: ```{r eda-drop-bar-features} final_data <- drop_columns(final_data, c("dst_origin", "tzone_origin", "year_flights", "tz_origin")) ``` Frequently, it is very beneficial to look at bi-variate frequency distribution. For example, to look at discrete features by **arr_delay**: ```{r eda-plot-bar-with-template, eval=FALSE} plot_bar(final_data, with = "arr_delay") ``` ```{r eda-plot-bar-with-run, echo=FALSE, fig.width=8, fig.height=10} plot_bar(final_data, with = "arr_delay", theme_config = list("text" = element_text(size = 6))) ``` The resulting distribution looks quite different from the regular frequency distribution. You may choose to break out all frequencies by a discrete variable: ```{r eda-plot-bar-by-template, eval=FALSE} plot_bar(final_data, by = "origin") ``` ```{r eda-plot-bar-by-run, echo=FALSE, fig.width=8, fig.height=10} plot_bar(final_data, by = "origin", theme_config = list("text" = element_text(size = 6))) ``` #### Histograms To visualize distributions for all continuous features: ```{r eda-plot-histogram-template, eval=FALSE} plot_histogram(final_data) ``` ```{r eda-plot-histogram, echo=FALSE, fig.width=8} suppressWarnings(plot_histogram(final_data, nrow = 3L, ncol = 3L)) ``` Immediately, you could observe that there are datetime features to be further treated, e.g., concatenating year, month and day to form date, and/or adding hour and minute to form datetime. For the purpose of this vignette, I will not go deep into the analytical tasks. However, we should treat the following features based on the output of the histograms. * Set **flight** to categorical, since that is the flight number with no mathematical meaning: ```{r eda-update-flight} final_data <- update_columns(final_data, "flight", as.factor) ``` * Remove **year_flights** and **tz_origin** since there is only one value: ```{r eda-drop-histogram-features} final_data <- drop_columns(final_data, c("year_flights", "tz_origin")) ``` #### QQ Plot Quantile-Quantile plot is a way to visualize the deviation from a specific probability distribution. After analyzing these plots, it is often beneficial to apply mathematical transformation (such as log) for models like linear regression. To do so, we can use `plot_qq` function. By default, it compares with normal distribution. Note: The function will take a long time with many observations, so you may choose to specify an appropriate `sampled_rows`: ```{r eda-plot-qq-template, eval=FALSE} qq_data <- final_data[, c("arr_delay", "air_time", "distance", "seats")] plot_qq(qq_data, sampled_rows = 1000L) ``` ```{r eda-plot-qq, echo=FALSE} qq_data <- final_data[, c("arr_delay", "air_time", "distance", "seats")] plot_qq( qq_data, sampled_rows = 1000L, geom_qq_args = list("na.rm" = TRUE), geom_qq_line_args = list("na.rm" = TRUE), nrow = 2L, ncol = 2L ) ``` From the chart, **air_time**, **distance** and **seats** seems skewed on both tails. Let's apply a simple log transformation and plot them again. ```{r eda-plot-qq-log-features-template, eval=FALSE} log_qq_data <- update_columns(qq_data, 2:4, function(x) log(x + 1)) plot_qq(log_qq_data[, 2:4], sampled_rows = 1000L) ``` ```{r eda-plot-qq-log-features, echo=FALSE} log_qq_data <- update_columns(qq_data, 2:4, function(x) log(x + 1)) plot_qq( log_qq_data[, 2:4], sampled_rows = 1000L, geom_qq_args = list("na.rm" = TRUE), geom_qq_line_args = list("na.rm" = TRUE), nrow = 2L, ncol = 2L ) ``` The distribution looks better now! If necessary, you may also view the QQ plot by another feature: ```{r eda-plot-qq-by-template, eval=FALSE} qq_data <- final_data[, c("name_origin", "arr_delay", "air_time", "distance", "seats")] plot_qq(qq_data, by = "name_origin", sampled_rows = 1000L) ``` ```{r eda-plot-qq-by, echo=FALSE, fig.width=8, fig.height=8} qq_data <- final_data[, c("name_origin", "arr_delay", "air_time", "distance", "seats")] plot_qq( qq_data, by = "name_origin", geom_qq_args = list("na.rm" = TRUE), geom_qq_line_args = list("na.rm" = TRUE), sampled_rows = 1000L, nrow = 2L, ncol = 2L ) ``` ### Correlation Analysis To visualize correlation heatmap for all non-missing features: ```{r eda-plot-correlation, fig.width=8, fig.height=8} plot_correlation(na.omit(final_data), maxcat = 5L) ``` You may also choose to visualize only discrete/continuous features with: ```{r eda-plot-correlation-type, eval=FALSE} plot_correlation(na.omit(final_data), type = "c") plot_correlation(na.omit(final_data), type = "d") ``` ### Principal Component Analysis While you can always do `plot_prcomp(na.omit(final_data))` directly, but PCA works better with cleaner data. To perform and visualize PCA on some selected features: ```{r eda-plot-prcomp, fig.width=8, fig.height=8} pca_df <- na.omit(final_data[, c("origin", "dep_delay", "arr_delay", "air_time", "year_planes", "seats")]) plot_prcomp(pca_df, variance_cap = 0.9, nrow = 2L, ncol = 2L) ``` ### Slicing & dicing Often, slicing and dicing data in different ways could be crucial to your analysis, and yields insights quickly. #### Boxplots Suppose you would like to build a model to predict arrival delays, you may visualize the distribution of all continuous features based on arrival delays with a boxplot: ```{r eda-plot-boxplot-template, eval=FALSE} ## Reduce data size for demo purpose arr_delay_df <- final_data[, c("arr_delay", "month", "day", "hour", "minute", "dep_delay", "distance", "year_planes", "seats")] ## Call boxplot function plot_boxplot(arr_delay_df, by = "arr_delay") ``` ```{r eda-plot-boxplot, echo=FALSE, fig.width=8, fig.height=8} arr_delay_df <- final_data[, c("arr_delay", "month", "day", "hour", "minute", "dep_delay", "distance", "year_planes", "seats")] plot_boxplot(arr_delay_df, by = "arr_delay", geom_boxplot_args = list("na.rm" = TRUE), nrow = 3L, ncol = 3L) ``` Among all the subtle changes in correlation with arrival delays, you could immediately spot that planes with 300+ seats tend to have much longer delays (16 ~ 21 hours). You may now drill down further to verify or generate more hypotheses. #### Scatterplots An alternative visualization is scatterplot. For example: ```{r eda-plot-scatterplot-template, eval=FALSE} arr_delay_df2 <- final_data[, c("arr_delay", "dep_time", "dep_delay", "arr_time", "air_time", "distance", "year_planes", "seats")] plot_scatterplot(arr_delay_df2, by = "arr_delay", sampled_rows = 1000L) ``` ```{r eda-plot-scatterplot, echo=FALSE, fig.width=8, fig.height=8} arr_delay_df2 <- final_data[, c("arr_delay", "dep_time", "dep_delay", "arr_time", "air_time", "distance", "year_planes", "seats")] plot_scatterplot(arr_delay_df2, by = "arr_delay", sampled_rows = 1000L, geom_point_args = list("size" = 0.5, "na.rm" = TRUE)) ``` ## Feature Engineering Feature engineering is the process of creating new features from existing ones. Newly engineered features often generate valuable insights. For functions in this section, it is preferred to use [data.table](https://cran.r-project.org/package=data.table) objects as input, and they will be [updated by reference](https://cran.r-project.org/package=data.table/vignettes/datatable-reference-semantics.html). Otherwise, output object will be returned matching the input class. ### Replace missing values Missing values may have meanings for a feature. Other than imputation methods, we may also set them to some logical values. For example, for discrete features, we may want to group missing values to a new category. For continuous features, we may want to set missing values to a known number based on existing knowledge. In **DataExplorer**, this can be done by `set_missing`. The function automatically matches the argument for either discrete or continuous features, i.e., if you specify a number, all missing continuous values will be set to that number. If you specify a string, all missing discrete values will be set to that string. If you supply both, both types will be set. ```{r fe-set-missing, collapse=TRUE} ## Return data.frame final_df <- set_missing(final_data, list(0L, "unknown")) plot_missing(final_df) ## Update data.table by reference # library(data.table) # final_dt <- data.table(final_data) # set_missing(final_dt, list(0L, "unknown")) # plot_missing(final_dt) ``` ### Group sparse categories From the bar charts above, we observed a number of discrete features with sparse categorical distributions. Sometimes, we want to group low-frequency categories to a new bucket, or reduce the number of categories to a reasonable range. `group_category` will do the work. Take **manufacturer** feature for example, suppose we want to group the long tail to another category. We could try with bottom 20% (by count) first: ```{r fe-group-category-count-trial} group_category(data = final_data, feature = "manufacturer", threshold = 0.2) ``` As we can see, manufacturer will be shrinked down to 4 categories, i.e., AIRBUS, BOEING, EMBRAER, and OTHER. If you like this threshold, you may specify `update = TRUE` to update the original dataset: ```{r fe-group-category-count-update, results='hide'} final_df <- group_category(data = final_data, feature = "manufacturer", threshold = 0.2, update = TRUE) plot_bar(final_df$manufacturer) ``` Instead of shrinking categories by frequency, you may also group the categories by another continuous metric. For example, if you want to bucket the carrier with bottom 20% distance traveled, you may do the following: ```{r fe-group-category-metric-trial} group_category(data = final_data, feature = "name_carrier", threshold = 0.2, measure = "distance") ``` Similarly, if you like it, you may add `update = TRUE` to update the original dataset. ```{r fe-group-category-metric-update, results='hide'} final_df <- group_category(data = final_data, feature = "name_carrier", threshold = 0.2, measure = "distance", update = TRUE) plot_bar(final_df$name_carrier) ``` ### Dummify data (one hot encoding) To transform the data into binary format (so that ML algorithms can pick it up), `dummify` will do the job. The function preserves original data structure, so that only eligible discrete features will be turned into binary format. ```{r fe-dummify-template, eval=FALSE} plot_str( list( "original" = final_data, "dummified" = dummify(final_data, maxcat = 5L) ) ) ``` ```{r fe-dummify-run, echo=FALSE} diagonalNetwork( plot_str(list("original" = final_data, "dummified" = dummify(final_data, maxcat = 5L)), print_network = FALSE), width = 800, height = 1500, fontSize = 20, margin = list( "left" = 50, "right" = 50 ) ) ``` Note the `maxcat` argument. If a discrete feature has more categories than `maxcat`, it will not be dummified. As a result, it will be returned untouched. ### Drop features After viewing the feature distribution, you often want to drop features that are insignificant. For example, features like **dst_dest** has mostly one value, and it doesn't provide any valuable information. You can use `drop_columns` to quickly drop features. The function takes either names or column indices. ```{r fe-drop-columns} identical( drop_columns(final_data, c("dst_dest", "tzone_dest")), drop_columns(final_data, c(36, 37)) ) ``` ### Update features To quickly update many features with the same treatment, you may use `update_columns`. A very common use case is to set features to a different type, e.g., continuous to discrete. To quickly set all time related features to discrete: ```{r update-feature-type} temporal_features <- c("month", "day", "hour", "minute", "tz_dest") final_data <- update_columns(final_data, temporal_features, as.factor) str(final_data[, c("month", "day", "hour", "minute", "tz_dest")]) ``` You may also use this function to transform selected features with a customized function: ```{r update-feature-transformation} bin_seat <- function(x) cut(x, breaks = c(0L, 50L, 100L, 150L, 200L, 500L)) transformed_data <- update_columns(final_data, "seats", bin_seat) plot_bar(transformed_data$seats) ``` ## Data Reporting To organize all the data profiling statistics into a report, you may use the `create_report` function. It will run most of the EDA functions and output a html file. ```{r dr-create-report, eval=FALSE} create_report(final_data) ``` To maximize the usage of this function, always supply a response variable (if applicable) to automate various bivariate analyses. For example, ```{r dr-create-report-with-y, eval=FALSE} create_report(final_data, y = "arr_delay") ``` You may also customize each individual section using `configure_report` function. It returns all specified arguments in a list, and then will be passed to `do.call` to be invoked. There are 3 types of arguments in this function: 1. switches (`add_*`): turn on/off a section 1. arguments (`*_args`): customize each section 1. global settings (`global_*`): global theme settings To turn off a few sections and set a global theme: ```{r dr-configure-report} configure_report( add_plot_str = FALSE, add_plot_qq = FALSE, add_plot_prcomp = FALSE, add_plot_boxplot = FALSE, add_plot_scatterplot = FALSE, global_ggtheme = quote(theme_minimal(base_size = 14)) ) ``` The output can be passed directly to `config` argument in `create_repot`, e.g., ```{r dr-create-report-customize, eval=FALSE} config <- configure_report( add_plot_str = FALSE, add_plot_qq = FALSE, add_plot_prcomp = FALSE, add_plot_boxplot = FALSE, add_plot_scatterplot = FALSE, global_ggtheme = quote(theme_minimal(base_size = 14)) ) create_report(final_data, config = config) ``` To configure the report without using `configure_report` function, you may edit the template below and pass it directly to `config` argument: ```{r dr-configure-report-customize, eval=FALSE} config <- list( "introduce" = list(), "plot_intro" = list(), "plot_str" = list( "type" = "diagonal", "fontSize" = 35, "width" = 1000, "margin" = list("left" = 350, "right" = 250) ), "plot_missing" = list(), "plot_histogram" = list(), "plot_density" = list(), "plot_qq" = list(sampled_rows = 1000L), "plot_bar" = list(), "plot_correlation" = list("cor_args" = list("use" = "pairwise.complete.obs")), "plot_prcomp" = list(), "plot_boxplot" = list(), "plot_scatterplot" = list(sampled_rows = 1000L) ) create_report(final_data, config = config) ```