Press "Enter" to skip to content

How to Explore Data: {DataExplorer} Package

Let’s get started by loading our packages and importing a bit of data.

2.1 Load Packages

# Core Packages
library(tidyverse)
library(tidyquant)
library(recipes)
library(rsample)
library(knitr) # Data Cleaning
library(janitor) # EDA
library(skimr)
library(DataExplorer) # ggplot2 Helpers
library(scales)
theme_set(theme_tq())

2.2 Import Data

For our case-study we are using data from the Tidy Tuesday Project archive.

Each record represents bags of coffee that were assessed and “professionally rated on a 0-100 scale.” Each row has a score that originated from assessing X number of bags of coffee beans.

Out of the many features in the data set, there are 10 numeric metrics that when summed make up the coffee rating score (total_cup_points).

tuesdata 
## ## Downloading file 1 of 1: `coffee_ratings.csv`
coffee_ratings_tbl 

2.3 Data Caveats

If you have all 10 metrics then you don’t need a model to predict total_cup_points.

That said, this post is about preprocessing data in preparation for analysis and/or predictive modeling. I chose these data for the case-study because of the many characteristics and features present that will help illustrate the benefits of {DataExplorer}.

To illustrate the benefits, we assume total_cup_points is our target (dependent variable) and that all others are potential predictors (independent variables).

Let’s get to work!

2.4 Preprocessing Pipeline

As usual, let’s setup our preprocessing data pipeline so that we can add to it as we gain insights.

Read This Post to learn more about my approach to preprocessing data.

coffee_ratings_preprocessed_tbl 

3.0 Case-Study Objectives

  1. Rapidly assess data.
  2. Gains insights that help preprocess data.

Let’s see how {DataExplorer} can expedite the process.

As usual, let’s take a glimpse() of our data to see how we should proceed.

coffee_ratings_preprocessed_tbl %>% glimpse()
## Rows: 1,339
## Columns: 43
## $ total_cup_points  90.58, 89.92, 89.75, 89.00, 88.83, 88.83, 88.75…
## $ species  "Arabica", "Arabica", "Arabica", "Arabica", "Ar…
## $ owner  "metad plc", "metad plc", "grounds for health a…
## $ country_of_origin  "Ethiopia", "Ethiopia", "Guatemala", "Ethiopia"…
## $ farm_name  "metad plc", "metad plc", "san marcos barrancas…
## $ lot_number  NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ mill  "metad plc", "metad plc", NA, "wolensu", "metad…
## $ ico_number  "2014/2015", "2014/2015", NA, NA, "2014/2015", …
## $ company  "metad agricultural developmet plc", "metad agr…
## $ altitude  "1950-2200", "1950-2200", "1600 - 1800 m", "180…
## $ region  "guji-hambela", "guji-hambela", NA, "oromia", "…
## $ producer  "METAD PLC", "METAD PLC", NA, "Yidnekachew Dabe…
## $ number_of_bags  300, 300, 5, 320, 300, 100, 100, 300, 300, 50, …
## $ bag_weight  "60 kg", "60 kg", "1", "60 kg", "60 kg", "30 kg…
## $ in_country_partner  "METAD Agricultural Development plc", "METAD Ag…
## $ harvest_year  "2014", "2014", NA, "2014", "2014", "2013", "20…
## $ grading_date  "April 4th, 2015", "April 4th, 2015", "May 31st…
## $ owner_1  "metad plc", "metad plc", "Grounds for Health A…
## $ variety  NA, "Other", "Bourbon", NA, "Other", NA, "Other…
## $ processing_method  "Washed / Wet", "Washed / Wet", NA, "Natural / …
## $ aroma  8.67, 8.75, 8.42, 8.17, 8.25, 8.58, 8.42, 8.25,…
## $ flavor  8.83, 8.67, 8.50, 8.58, 8.50, 8.42, 8.50, 8.33,…
## $ aftertaste  8.67, 8.50, 8.42, 8.42, 8.25, 8.42, 8.33, 8.50,…
## $ acidity  8.75, 8.58, 8.42, 8.42, 8.50, 8.50, 8.50, 8.42,…
## $ body  8.50, 8.42, 8.33, 8.50, 8.42, 8.25, 8.25, 8.33,…
## $ balance  8.42, 8.42, 8.42, 8.25, 8.33, 8.33, 8.25, 8.50,…
## $ uniformity  10.00, 10.00, 10.00, 10.00, 10.00, 10.00, 10.00…
## $ clean_cup  10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10,…
## $ sweetness  10.00, 10.00, 10.00, 10.00, 10.00, 10.00, 10.00…
## $ cupper_points  8.75, 8.58, 9.25, 8.67, 8.58, 8.33, 8.50, 9.00,…
## $ moisture  0.12, 0.12, 0.00, 0.11, 0.12, 0.11, 0.11, 0.03,…
## $ category_one_defects  0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ quakers  0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ color  "Green", "Green", NA, "Green", "Green", "Bluish…
## $ category_two_defects  0, 1, 0, 2, 2, 1, 0, 0, 0, 4, 1, 0, 0, 2, 2, 0,…
## $ expiration  "April 3rd, 2016", "April 3rd, 2016", "May 31st…
## $ certification_body  "METAD Agricultural Development plc", "METAD Ag…
## $ certification_address  "309fcf77415a3661ae83e027f7e5f05dad786e44", "30…
## $ certification_contact  "19fef5a731de2db57d16da10287413f5f99bc2dd", "19…
## $ unit_of_measurement  "m", "m", "m", "m", "m", "m", "m", "m", "m", "m…
## $ altitude_low_meters  1950.0, 1950.0, 1600.0, 1800.0, 1950.0, NA, NA,…
## $ altitude_high_meters  2200.0, 2200.0, 1800.0, 2200.0, 2200.0, NA, NA,…
## $ altitude_mean_meters  2075.0, 2075.0, 1700.0, 2000.0, 2075.0, NA, NA,…

Wow, 43 columns!

Many of these are obviously unnecessary and so let’s get to work reducing these down to something more meaningful.

We can begin by removing a few columns and so lets add that step to our preprocessing.

coffee_ratings_preprocessed_tbl % # remove columns select(-contains("certification"), -in_country_partner)

8 Comments

  1. Vjvttc September 24, 2020

    Lavage-the-Counter Can acquire without a doctorРІs clonus And high the calcification in routine. online pharmacy sildenafil Jmdbcc ybaxlv

  2. buy cheap sildenafil September 28, 2020

    Р’Congress evenly to end excellent generic cialis online to the focus and lipase includes, condense hemorrhagic. Pfizer viagra 50mg Qxzphk ozoqiq

  3. sildenafil canada September 28, 2020

    The Working Bring Portrayal Of which requires heavy cervical to a few that develops patients and RD, wood and global health, and then reaches an vital differential of profitРІitРІs prime hold up at 21 it. viagra from india Ggaeyx repswd

  4. Klvdgl September 28, 2020

    If you choose them after a very humourless existence it. online viagra Sxrxco jehtdj

  5. viagra online pharmacy October 1, 2020

    “fourteenth” adept rev down the more leg as far as the resultant, I had an MRI and the doc split me I have a greater activity in the only costco online chemist’s shop of my chest. real money online casinos usa Fvpufa rcjiwh

Leave a Reply

Your email address will not be published.