Jennifer Cheng - Chocolate Bars

Introduction

For the week of January 18, 2022, #TidyTuesday featured the “Chocolate Bar Ratings” dataset from Flavors of Cacao. The reviews span from 2006 through 2022 and for each bar, details include the manufacturer and their location, the chocolate bean origin, ingredients and keyword-descriptions of each bar’s “most memorable characteristics, and of course, the rating.

As noted on the website, the chocolate included in the ratings database are a sampling of bars, not a comprehensive assessment of chocolate bars but rather rating dark chocolate bars based on one bar.

Each chocolate is evaluated from a combination of both objective qualities and subjective interpretation. A rating here only represents an experience with one bar from one batch.
(…)
The database is narrowly focused on plain dark chocolate with an aim of appreciating the flavors of the cacao when made into chocolate.

Looking at a glimpse of the dataset…

library(tidyverse)
chocolate_raw <- read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2022/2022-01-18/chocolate.csv')

glimpse(chocolate_raw)

Rows: 2,530
Columns: 10
$ ref                              <dbl> 2454, 2458, 2454, 2542, 2546, 2546, 2…
$ company_manufacturer             <chr> "5150", "5150", "5150", "5150", "5150…
$ company_location                 <chr> "U.S.A.", "U.S.A.", "U.S.A.", "U.S.A.…
$ review_date                      <dbl> 2019, 2019, 2019, 2021, 2021, 2021, 2…
$ country_of_bean_origin           <chr> "Tanzania", "Dominican Republic", "Ma…
$ specific_bean_origin_or_bar_name <chr> "Kokoa Kamili, batch 1", "Zorzal, bat…
$ cocoa_percent                    <chr> "76%", "76%", "76%", "68%", "72%", "8…
$ ingredients                      <chr> "3- B,S,C", "3- B,S,C", "3- B,S,C", "…
$ most_memorable_characteristics   <chr> "rich cocoa, fatty, bready", "cocoa, …
$ rating                           <dbl> 3.25, 3.50, 3.75, 3.00, 3.00, 3.25, 3…

…here are some initial observations about the variables:

Variable	Details
Manufacturer	580 companies
Company location	67 countries
Year of review	2006-2021
Country where bean originated	62 countries
Cocoa %	seeds (or beans) that produce chocolate
Ingredients	B = beans, S = sugar, S* = sweetener (not white cane or beet sugar), C = cocoa butter, V = vanilla, L = lecithin, Sa = salt
Most memorable characteristics	keywords
Rating	1-4

World Map

After exploring the data, I went with mapping the countries in the dataset, color coding them by whether they:

only appear as the location of a manufacturing company
only appear as a bean origin country
or both

# Are there countries that appear in both columns? (33) Use #C0EDA6
countries_both <- chocolate_raw %>%
  select(company_location, country_of_bean_origin) %>%
  unique() %>%
  # isolate countries in company_location that also appear in country_of_bean_origin
  filter(company_location %in% unique(country_of_bean_origin)) %>%
  # isolate only company_location and rename
  select(country=company_location) %>%
  unique() %>%
  # add column that gives it green color to signify it appears in both vars
  add_column(country_type="both")

# countries that appear only in company_location (34) Use #8FBDD3
countries_manufacturer <- chocolate_raw %>%
  select(company_location, country_of_bean_origin) %>%
  unique() %>%
  filter(!company_location %in% unique(country_of_bean_origin)) %>%
  select(country=company_location) %>%
  unique() %>%
  add_column(country_type="manufacturer")

# countries that appear only in country_of_bean_origin (29) Use #FFF7BC
bean_countries <- chocolate_raw %>%
  select(company_location, country_of_bean_origin) %>%
  unique() %>%
  filter(!country_of_bean_origin %in% unique(company_location)) %>%
  select(country=country_of_bean_origin) %>%
  unique() %>%
  add_column(country_type="bean origin")

# combine above three into one df
chocolate_map <- rbind(countries_both, countries_manufacturer, bean_countries)

1. Load world map via `rnaturalearth`

library(rnaturalearth)
world_map <- ne_countries(scale="medium",
                          type="map_units", # to include Mauritania
                          returnclass="sf")

ggplot() +
  geom_sf(data=world_map, size=0.25, fill="#eeeeee") +
  theme_void()

2. Data cleaning

Check if any country names in world_map and chocolate_map do not match, i.e. Which country names in world_map do not appear in chocolate_map due to spelling differences or more granular divisions within what world_map considers a country. Goal: Create a layer of just the countries in chocolate_map to “layer” on top of the world_map basemap.

chocolate_map %>% filter(!country %in% unique(world_map$geounit))

# A tibble: 18 × 2
   country               country_type
   <chr>                 <chr>       
 1 U.S.A.                both        
 2 Sao Tome              both        
 3 St. Lucia             both        
 4 Sao Tome & Principe   both        
 5 St.Vincent-Grenadines both        
 6 U.K.                  manufacturer
 7 Belgium               manufacturer
 8 Amsterdam             manufacturer
 9 U.A.E.                manufacturer
10 Burma                 bean origin 
11 Trinidad              bean origin 
12 Blend                 bean origin 
13 Congo                 bean origin 
14 Tobago                bean origin 
15 Sumatra               bean origin 
16 Principe              bean origin 
17 Sulawesi              bean origin 
18 DR Congo              bean origin

Renaming countries in chocolate_map to how they appear in world_map.

chocolate_map_edit <- chocolate_map %>%
  # rename countries
  mutate(country = str_replace_all(
    country,
    c(
      "U.S.A." = "United States of America", # old = new
      "Sao Tome$" = "Sao Tome and Principe",
      "St. Lucia" = "Saint Lucia",
      "Sao Tome & Principe" = "Sao Tome and Principe",
      "St.Vincent-Grenadines" = "Saint Vincent and the Grenadines",
      "U.A.E." = "United Arab Emirates",
      "Burma" = "Myanmar",
      "^Trinidad$" = "Trinidad and Tobago",
      "^Congo$" = "Republic of Congo",
      "^Tobago$" = "Trinidad and Tobago",
      "^Principe$" = "Sao Tome and Principe",
      "DR Congo" = "Democratic Republic of the Congo"
    )
  )) %>%
  # renaming "non-countries" as the country in which they're located
  mutate(country = str_replace_all(
    country,
    c(
      "Amsterdam" = "Netherlands",
      "Sumatra" = "Indonesia",
      "Sulawesi" = "Indonesia"
    )
  )) %>%
  # remove country of "Blend" and also "Scotland" and "Wales" as the latter two will be represented by "United Kingdom" (all three are only manufacturing countries, plus "England" and "Northern Ireland" were not included in the chocolate data)
  filter(!country %in% c("Blend", "Scotland", "Wales")) %>%
  # renamings will produce duplicates so remove those
  unique()

Check: Rerun code from a to see if any countries from chocolate_map remain off of world_map.

chocolate_map %>% filter(!country %in% unique(world_map$geounit))

# A tibble: 18 × 2
   country               country_type
   <chr>                 <chr>       
 1 U.S.A.                both        
 2 Sao Tome              both        
 3 St. Lucia             both        
 4 Sao Tome & Principe   both        
 5 St.Vincent-Grenadines both        
 6 U.K.                  manufacturer
 7 Belgium               manufacturer
 8 Amsterdam             manufacturer
 9 U.A.E.                manufacturer
10 Burma                 bean origin 
11 Trinidad              bean origin 
12 Blend                 bean origin 
13 Congo                 bean origin 
14 Tobago                bean origin 
15 Sumatra               bean origin 
16 Principe              bean origin 
17 Sulawesi              bean origin 
18 DR Congo              bean origin

Remove

chocolate_map_edit <- chocolate_map_edit %>% filter(!country %in% c("U.K.", "Belgium"))

Workaround: Add a third layer for Belgium and the UK since their geounit in world_map was of regions within those countries so using the admin column here in place of geounit as used above.

chocolate_map <- chocolate_map %>%
  mutate(country = str_replace(country, "U.K.", "United Kingdom"))

bel_uk_layer <- world_map %>%
  filter(admin %in% c("Belgium", "United Kingdom")) %>%
  left_join(chocolate_map, by=c("admin"="country"))

right_join() the world_map and chocolate_map_edit dataframes in order to add geographic components to chocolate_map_edit.

combo_map <- world_map %>% right_join(chocolate_map_edit, by=c("geounit"="country"))

3. Chocolate map

Adding the three layers: (1) world_map: Base world map + (2) combo_map: Countries from the chocolate data + (3) bel_uk_layer: Belgium and the UK

ggplot() +
  geom_sf(data=world_map, size=0.25, fill="#eeeeee") +
  geom_sf(data=combo_map, aes(fill=country_type), size=0.25, show.legend=F) +
  geom_sf(data=bel_uk_layer, aes(fill=country_type), size=0.25, show.legend=F) +
  scale_fill_manual(values = c("both" = "#C0EDA6", # df value = color
                               "manufacturer" = "#8FBDD3",
                               "bean origin" = "#FFF7BC")) +
  theme_void()

Out of Curiosity

Are there words disproportionately associated with a country where a bean originated or a a country of the manufacturing company?

library(tidytext)

chocolate_raw %>%
  unnest_tokens(word, most_memorable_characteristics) %>%
  count(country_of_bean_origin, word, sort = TRUE) %>%
  bind_tf_idf(term = word, document = country_of_bean_origin, n) %>% # calculates TF-IDF
  arrange(desc(tf_idf)) %>%
  #top_n(20, wt = tf_idf) %>%
  filter(n>20)

# A tibble: 36 × 6
   country_of_bean_origin word       n     tf   idf tf_idf
   <chr>                  <chr>  <int>  <dbl> <dbl>  <dbl>
 1 Papua New Guinea       smoke     27 0.155  1.13  0.176 
 2 Ecuador                floral    82 0.112  0.795 0.0893
 3 Madagascar             red       25 0.0394 1.82  0.0719
 4 Madagascar             sour      44 0.0694 0.631 0.0438
 5 Venezuela              nutty     88 0.102  0.389 0.0397
 6 Madagascar             tart      23 0.0363 1.08  0.0393
 7 Blend                  bitter    24 0.0440 0.795 0.0350
 8 Ecuador                bitter    30 0.0411 0.795 0.0327
 9 Blend                  sweet     31 0.0569 0.490 0.0278
10 Madagascar             fruit     36 0.0568 0.490 0.0278
# … with 26 more rows

chocolate_raw %>%
  unnest_tokens(word, most_memorable_characteristics) %>%
  #count(word, country_of_bean_origin, sort = TRUE) %>%
  count(word, company_location, sort = TRUE) %>%
  filter(word == "smoke")

# A tibble: 14 × 3
   word  company_location     n
   <chr> <chr>            <int>
 1 smoke U.S.A.              26
 2 smoke France              10
 3 smoke Canada               5
 4 smoke Italy                4
 5 smoke New Zealand          4
 6 smoke Australia            3
 7 smoke U.K.                 3
 8 smoke Austria              2
 9 smoke Colombia             2
10 smoke Japan                2
11 smoke Ecuador              1
12 smoke Germany              1
13 smoke U.A.E.               1
14 smoke Venezuela            1

# instead of going by word units, break up by placement of comma => there will be one word, two words, etc.
chocolate_raw %>%
  select(most_memorable_characteristics, rating) %>%
  # split at comma
  separate_rows(most_memorable_characteristics, sep = ',', convert = TRUE) %>%
  #filter(str_detect(most_memorable_characteristics, regex(" "))) %>%
  group_by(most_memorable_characteristics) %>%
  summarize(characteristic_count = n(), mean_rating = median(rating)) %>%
  filter(characteristic_count>20) %>% arrange(desc(mean_rating))

# A tibble: 71 × 3
   most_memorable_characteristics characteristic_count mean_rating
   <chr>                                         <int>       <dbl>
 1 " banana"                                        33         3.5
 2 " cherry"                                        27         3.5
 3 " citrus"                                        21         3.5
 4 " cocoa"                                        210         3.5
 5 " creamy"                                        26         3.5
 6 " dairy"                                         34         3.5
 7 " dried fruit"                                   46         3.5
 8 " fruity"                                        37         3.5
 9 " honey"                                         26         3.5
10 " melon"                                         22         3.5
# … with 61 more rows

chocolate_raw %>%
  separate_rows(most_memorable_characteristics, sep = ',', convert = TRUE) %>%
  mutate(most_memorable_characteristics = str_squish(most_memorable_characteristics)) %>%
  count(country_of_bean_origin, most_memorable_characteristics, sort = TRUE) %>%
  bind_tf_idf(term = most_memorable_characteristics, document = country_of_bean_origin, n) %>%
  arrange(desc(tf_idf)) %>%
  filter(n>20)

# A tibble: 21 × 6
   country_of_bean_origin most_memorable_characteris…¹     n     tf   idf tf_idf
   <chr>                  <chr>                        <int>  <dbl> <dbl>  <dbl>
 1 Ecuador                floral                          72 0.12   0.831 0.0998
 2 Venezuela              nutty                           84 0.116  0.414 0.0481
 3 Madagascar             sour                            27 0.0557 0.795 0.0443
 4 Blend                  sweet                           27 0.0609 0.490 0.0298
 5 Venezuela              creamy                          33 0.0457 0.601 0.0275
 6 Blend                  cocoa                           26 0.0587 0.438 0.0257
 7 Dominican Republic     earthy                          35 0.0546 0.464 0.0253
 8 Ecuador                spicy                           21 0.035  0.693 0.0243
 9 Venezuela              roasty                          33 0.0457 0.516 0.0236
10 Dominican Republic     spicy                           21 0.0328 0.693 0.0227
# … with 11 more rows, and abbreviated variable name
#   ¹most_memorable_characteristics

Introduction

World Map

1. Load world map via rnaturalearth

2. Data cleaning

3. Chocolate map

Out of Curiosity

1. Load world map via `rnaturalearth`