Data viz | Amy Mitchell-Whittington

Hot Ones

Tue, 08 Aug 2023 00:00:00 +0000

Hot Ones</>

The data this week comes from Wikipedia articles: Hot Ones and List of Hot Ones episodes.

Hot Ones is an American YouTube talk show, created by Chris Schonberger, hosted by Sean Evans and produced by First We Feast and Complex Media. Its basic premise involves celebrities being interviewed by Evans over a platter of increasingly spicy chicken wings.

I probably watched way too many episodes of Hot Ones while creating this visualisation, which I’m not mad about. It’s a pretty funny show - I really just skipped to the end of each episode to see how everyone handled #10, the hottest wing.

For my visualisation, I wanted to get an idea of whether the spice levels increased by season. Each season uses the same 10 sauces across each of it’s episodes so I didn’t need to worry about variation between episodes, just seasons. I used the sauces data to sum every 10 scoville ratings (equivalent to the 10 sauces used per season), averaged it out and then mapped to a data frame, which I used to plot my visualisation.

A point to remember: If I want to count every n consecutive numbers (as I did for this), I need to use colSums. If I want to sum every nth number, I need to use rowSums, but this will only work if n divides length(v).

I originally wanted to see what the % change of the average scoville score was season on season, but when I mapped it out, it looked a little too complicated. I’m including it here for reference:

GPT Detectors

Tue, 18 Jul 2023 00:00:00 +0000

GPT Detectors</>

The data this week comes from Simon Couch’s detectors R package, containing predictions from various GPT detectors. The data is based on the pre-print: GPT Detectors Are Biased Against Non-Native English Writers. Weixin Liang, Mert Yuksekgonul, Yining Mao, Eric Wu, James Zou.

language model-based chatbot ChatGPT is reportedly the fastest-growing consumer application in history, after attracting 100 million active users just two months after it was launched. It’s easy to see why. The ability for such models to create large amounts of content within such a short period of time has opened the door wide open in terms of boosting productivity and creativity. From building cover letters for job applications to setting up company OKRs and everything in-between, people are using it in all kinds of ways. AI is even being used to detect AI-generated content, but how accurate has it been so far?

Liang et al. set out to test the accuracy of several GPT detectors:

The study authors carried out a series of experiments passing a number of essays to different GPT detection models. Juxtaposing detector predictions for papers written by native and non-native English writers, the authors argue that GPT detectors disproportionately classify real writing from non-native English writers as AI-generated.

From the data, I was able to determine that GPT detectors correctly identified native English writers in 97% of cases, with only 3% being misclassified as AI-generated. However, the same was not true for non-native English writing samples, which were only correctly identified in 39% of cases.

Interestingly, AI incorrectly identified AI-generated content as being created by a human in 69% of cases.

predicted_class	native_English_text	non.native_English_text	AI_text
Human	97	39	69
AI	3	61	31

I used a waffle chart to display the data because I think it’s a great way to visualise the small number of categories and is easy to interpret and understand. This was my first waffle chart, so there was a lot of trial and error! I struggled to work out the correct variables to map at first (I kept getting a reps() error), but once that was sorted it was relatively smooth sailing.

I worked out how to reformat my data by changing from a “wide” format with each variable in its own column to a “long” format, so I could facet the data correctly. Then I changed the titles

I also discovered theme(plot.margin = unit()), which I used to change the margins of my plot.

I originally opted to use the iron() function to knit all three waffle charts together, but it didn’t give me as much freedom to tweak and alter the plot as facet_wrap(), so I switched it up. I also chose not to include the AI-generated content waffle chart, because it seemed to make the chart more complicated (with the headline/subline especially) than I wanted it to be.

knitr::opts_chunk$set(warning = FALSE, message = FALSE)
#human and AI
ggplot(gpt2datalong, aes(values = value, fill = predicted_class)) +
geom_waffle(rows = 5, na.rm = FALSE, show.legend = TRUE, flip = TRUE, colour = "white") +
facet_wrap(~measure) +
theme(panel.spacing.x = unit(0, "npc")) +
theme(strip.text.x = element_text(hjust = 0.5)) +
coord_equal() +
theme_void() +
scale_fill_manual(values = c("sienna3", "royalblue")) +
labs(
title = "How accurate are GPT dectectors at correctly classifying human-written content vs. AI generated content?",
subtitle = "",
fill = "GPT Classification") +
theme(
plot.title = element_text(size = 8, hjust = 0),
plot.subtitle = element_text(size = 5, face = "italic"),
legend.title = element_text(size = 8))

Scurvy

Tue, 18 Jul 2023 00:00:00 +0000

Scurvy</>

The data this week comes from the medicaldata R package. This is a data package from Peter Higgins, with 19 medical data sets for teaching Reproducible Medical Research with R.

The specific data set I visualised this week is from a study published in 1757 in A Treatise on the Scurvy in Three Parts, by James Lind. I would suggest taking a read of Peter Higgins’ post on the study, it’s quite interesting.

This data set contains 12 participants with scurvy. In 1757, it was not known that scurvy is a manifestation of vitamin C deficiency. A variety of remedies had been anecdotally reported, but Lind was the first to test different regimens of acidic substances (including citrus fruits) against each other in a randomized, controlled trial. 6 distinct therapies were tested in 12 seamen with symptomatic scurvy, who were selected for similar severity.

Of note, there is some dispute about whether this was truly the first clinical trial, or whether it actually happened, as there are no contemporaneous corroborating accounts.

I really enjoyed reading a little into this study and the data, especially the dosage for each of the treatments. I didn’t include them in the end graph, so as not to crowd out the information too much, but I’ll add them here, just for interest.

Also, worth noting: Based on this study, only one of the 12 seamen showed no symptoms after treatment - he was treated with two lemons and an orange, daily.

treatment	dosing_regimen_for_scurvy
cider	1 quart per day
cider	1 quart per day
dilute sulfuric acid	25 drops of elixir of vitriol, three times a day
dilute sulfuric acid	25 drops of elixir of vitriol, three times a day
vinegar	two spoonfuls, three times daily
vinegar	two spoonfuls, three times daily
sea water	half pint daily
sea water	half pint daily
citrus	two lemons and an orange daily
citrus	two lemons and an orange daily
purgative mixture	a nutmeg-sized paste of garlic, mustard seed, horseradish, balsam of Peru, and gum myrrh three times a day
purgative mixture	a nutmeg-sized paste of garlic, mustard seed, horseradish, balsam of Peru, and gum myrrh three times a day

I saw Nicola Rennie’s wonderful visualisation for this data set and wanted to set out to recreate something similar!

In terms of cleaning the data this week (it came with a lot of “_” and numbers), I familiarised myself with the stringr package, especially the str_replace_all() function, which was fun. Although, I struggled to apply this function for more than one variable to replace, so had to input manually.

At first, I thought it would be fun to use the geom_lime() function (each point is the shape of a lime on a graph), but realised I couldn’t map the symptom severity to the size of each lime so I scrapped that idea. I am on the lookout for a data set to use it on in the future!

I wanted to try out different shapes as a way to visualise the severity of symptoms after the treatments, but given there was more than one sailor for each treatment, it meant there was an overlap of shapes, which I thought looked confusing. For example, one sailor being treated with vinegar might still have severe gum rot while another also being treated with vinegar might only have mild gum rot.

While I ended up representing the severity of symptom by size, I wasn’t 100% happy with the outcome so I rearranged the y-axis so as to have cider and citrus (the two best performers) at the top.

Perhaps it would have been better to signify the symptoms of each of the 12 patients. Might be something to consider with a different visualisation.

U.S. Historical Markers

Tue, 27 Jun 2023 00:00:00 +0000

U.S.Historical Markers </>

The data this week comes from the Historical Marker Database USA Index.

This searchable online catalogue was so fun to explore! I didn’t realise there was such a thing online (probably because there are only 36 Australian markers recorded on the site compared to 183k U.S. markers).

According to the database, it is “an illustrated searchable online catalog of historical information viewed through the filter of roadside and other permanent outdoor markers, monuments, and plaques”. Anyone can contribute, either by adding new markers or updating existing marker pages.

For this Tidy Tuesday project, only the U.S. marker database was provided and what caught my eye the most was the column relating to missing markers.

I wanted to use percentages because I felt it was the best way to represent the data, and was surprised by the amount of markers that were reported or confirmed missing. I wonder what happened to them?

This was also a great way to test out the labs() annotate function and play around with positioning.

U.S. Populated Places

Tue, 27 Jun 2023 00:00:00 +0000

U.S.Populated Places </>

Data this week comes from the National Map Staged Products Directory from the US Board of Geographic Names..

This was a fun #TidyTuesday data set to work with because it timed well with some friends who flew to the U.S. last week to have an Elvis wedding in Las Vegas 💞

They’re renting an RV and hitting the road, with plans to stop at a few major national parks in the area, so I thought I would use their honeymoon travel itinerary as a guide to visualising this week’s U.S. Populated Places data.

Using the usmap package was a little tricky, as I had to transform my data so I could plot it properly. I think after my attempt at using a different maps package to plot reported UFO sightings in Australia last week, I might need to try out ggmaps to see if it’s a little more streamlined.

I had such a hard time trying to work out why there was so much blank space around my map, see here:

But, after A LOT OF GOOGLING, I found this:

The Cartesian coordinate system is the most familiar, and common, type of coordinate system. Setting limits on the coordinate system will zoom the plot (like you’re looking at it with a magnifying glass), and will not change the underlying data like setting limits on a scale will. Via this ggplot2 ref. site

So I added coord_cartesian() to my ggplot and voila! I now need to do some digging into how EXACTLY this works 😅

Reported UFO Sightings in Australia

Tue, 20 Jun 2023 00:00:00 +0000

Reported UFO Sightings </>

This week’s Tidy Tuesday data comes from NUFORC and includes more than 80,000 recorded UFO sightings around the world. I had a lot of fun with this, especially reading some of the descriptions from reported sightings in Australia:

“It was a huge black round thing and it was leaving a green trail of smoke behind it. It made a buzzing sound.”

“A brilliant blue-white light performs amazing acrobatics for an hour and a half over the city of Brisbane.”

“Fireballs dance in the sky over Sydney, Australia”

I was interested in the shapes people were reporting, as it wasn’t all just discs or flying saucers. People reported seeing eggs, triangles, formations, cigars and more.

To get a better sense of the shapes, I decided to group them into three categories - I am not sure I really did this grouping enough justice, to be honest, but I do think it managed to simplify the shapes for a graph format. Here’s my code for grouping the shapes:

#sort UFO shapes sighted in Australia into three categories and sort by decade:
ufo_aus_year <- ufo_sightings %>%
filter(country == "au") %>%
mutate(shape = case_when(
ufo_shape %in% c("cylinder", "cigar", "dome", "circle", "teardrop", "fireball", "egg", "sphere", "disk", "round", "oval", "crescent") ~ "Round",
ufo_shape %in% c("diamond", "hexagon", "rectangle", "chevron", "triangle", "cross", "pyramid", "delta", "cone") ~ "Angled",
ufo_shape %in% c("light", "flash", "changed", "formation", "changing", "flare") ~ "Flash or changing light")) %>%
select(date_time, shape, longitude, latitude) %>%
mutate(year = as.numeric(format(as.Date(date_time, format = "%m/%d/%Y"),"%Y"))) %>%
filter(shape != c("NA", "other")) %>%
count(shape, year)

Alternatively, I could have put a Top 10 list together (added below), but I didn’t have time to think about how to properly visualise it.

## ufo_shape n
## 1 light 119
## 2 circle 62
## 3 disk 50
## 4 triangle 43
## 5 fireball 34
## 6 oval 30
## 7 unknown 25
## 8 other 22
## 9 formation 20
## 10 cigar 15
## 11 sphere 15
## 12 egg 12
## 13 diamond 10
## 14 rectangle 10
## 15 teardrop 10
## 16 changing 9
## 17 cylinder 9
## 18 cone 6
## 19 flash 4
## 20 chevron 3
## 21 cross 1

I also attempted to map the sightings of these three shape categories to a map of Australia. I was super excited to learn how to use a map to portray data, but I think I should have used different data to map, or made it more interactive or looked at the data at state level (QLD, NSW e.t.c.,) so it was clearer where these sightings were reported.

Another fun idea I had was to overlay flight path data to the map to see if any of the reported sightings, especially light/flare sightings, were near a flight path. But that could be for another time.

Here’s my first attempt at using a map in my visualisations!

Food Insecurity in Mozambique and Tanzania

Tue, 13 Jun 2023 00:00:00 +0000

SAFI Survey </>

This week’s Tidy Tuesday data comes from the SAFI (Studying African Farmer-Led Irrigation) project team. The aim of SAFI is to better understand farming and small-scale irrigation methods used in rural areas of Africa to see if these methods can “offer a model for broad-based economic growth”.

Between 2016 and 2017, SAFI sent out a survey to households in Tanzania and Mozambique to learn more about each household (e.g., number of rooms and household members, type of home) and their agricultural practices (e.g., water usage, number of livestock).

It took me a little time to get back into the swing of things, given this was my first week back on the tools since 2021, so I kept it straightforward. I focused on a question in the survey where households were asked to “indicate which months, In the last 12 months have you faced a situation when you did not have enough food to feed the household?”. I found that across Tanzania and Mozambique, almost 75% of respondents said they’d had to go through at least one month without enough food for their household.

After some initial reading, I found that households in both Mozambique and Tanzania depend heavily on rain-fed agriculture, which can make livelihood and food security vulnerable to climate change. If I had of had more time, I would have liked to pair my chart with precipitation data for the same period - and provide yearly averages - to see whether there was any correlation, but I ran out of time on this one!

Paralympians excelling in more than one sport

Tue, 03 Aug 2021 21:13:14 -0500

The Paralympic Games </>

This week’s Tidy Tuesday comes from the International Paralympic Committee. The data details all medal winners from the 1980-2016 Paralympic Games.

The Paralympics is the largest sports competition for athletes with an impairment worldwide. According to the IPC website, “It involves athletes from several impairment categories. The six main disability categories are: amputee, cerebral palsy, intellectual impairment, visually impaired, spinal injuries and Les Autres (French for”the others”, a category that includes conditions that do not fall into the categories mentioned before.)”

Researching the Paralympics, I found there were instances where more than one athlete was competing in not just more than one event under a type of sport (eg: competing in butterfly and breaststroke) but also competing and winning medals in more than one sport (eg: athletics and table tennis).

When I started to analyse the data, I found 692 athletes won medals in more than one sport in each of the Paralympic Games between 1980 and 2016, compared to just seven Olympic athletes in that same time period. I decided to put together a simple connected scatterplot to showcase the percentage of Paralympic athletes who won medals at each of the Games to best reflect any changes over the years.

Interestingly, Kyung Mook Kim, representing the Republic of Korea, has won medals at every Paralympic Games from 1992 - 2016, in table tennis and wheelchair tennis.

It seems many Paralympic athletes often compete at elite levels in multiple sports at the Games.

Tug-of-War at the Olympics

Tue, 27 Jul 2021 21:13:14 -0500

Olympic medals </>

This week’s Tidy Tuesday data comes from Kaggle.

It’s an historical dataset on the modern Olympic Games, including all the Games from Athens 1896 to Rio 2016. As I went through the list of sports, I was surprised to see Tug-Of-War, so I decided to focus on that for my data visualization.

The event was part of the Games from 1900 to 1920.

The rules

The first team to pull the other team over a line six feet from their starting point was named the winner.
Matches had a five minute time limit. If neither team was pulled across the line, then the team that got the other team closest to that point was declared the winner.

The most successful country was Great Britain, which won five medals in total.

At the 1908 Games, held in London, three of five the competing teams were police departments - Liverpool Police, City of London Police, and Metro Police “K” Division. The other two teams were United States and Sweden.

According to the Tug of War Association London, the American team protested it’s first-round loss to the Liverpool Police team, claiming their service boots were “…so heavy in fact that it was only with great effort that they could lift their feet from the ground”. The protest was dismissed, and the American team withdrew from the competition.

The three police teams representing Great Britain went on to win gold, silver, and bronze that year.

I struggled quite a bit trying to work out how to fix the spacing issues between points on the graph.

People living in severe, extreme, or exceptional drought in the western states of the U.S.

Tue, 20 Jul 2021 21:13:14 -0500

U.S. drought: The number of people living in the western states of the U.S. impacted by drought </>

This week’s Tidy Tuesday data comes from the U.S. Drought Monitor.

The dataset details the drought level across U.S. states from 2001-2021. I wanted to look at the number of people who have experienced severe, extreme, and/or exceptional drought conditions over this time in the western states of the U.S.

This one was a bit tricky. I wanted to show the impact of people affected by severe+ drought and decided to use the western states as an example because on first glance, that area seemed to be hit quite hard with drought over the years.

Using the pop_pct would have been a lot easier to read at graph level, but I struggled to work out how I could add % of pop. for each drought level by each week to create the graph I wanted.

The outcome was a stacked graph that ended up looking quite messy. This was a good learning curve.

Data Reference: The U.S. Drought Monitor is jointly produced by the National Drought Mitigation Center at the University of Nebraska-Lincoln, the United States Department of Agriculture, and the National Oceanic and Atmospheric Administration. Map courtesy of NDMC.

Scooby Doo monster motives

Tue, 13 Jul 2021 21:13:14 -0500

My first Tidy Tuesday submission </>

These past few months I’ve been playing around a lot in RStudio.

I really wanted to get stuck into data visualization, so I figured the best way would be to challenge myself weekly with TidyTuesday, a weekly data project in R from the R4DS community.

So, here’s my first of hopefully many submissions. Luckily, I started off with a fun one - data from every single Scooby Doo episode and movie since 1969.
The dataset came from Kaggle, manually aggregated by plummye.

For this one, I wanted to see whether the motives changed for the Scooby Doo monsters over the decades using treemaps.

I also decided to try out gganimate for the first time with my treemaps - it’s a little hectic 😄.