DAMN., That’s Offensive

A f’ing fun introduction to tidytext analysis with geniusR

16 min readFeb 2, 2018

My recent package geniusR was created with the idea of a tidytext analysis of song lyrics in mind. I now wish to introduce you to the concepts and application of tidytext analysis through the use of geniusR. If you would like an introduction to geniusR please read my Introduction to geniusR. Additionally, I recommend that you give Text Mining in R: A Tidy Approach by Julia Silge and David Robinson a read.

Initially I wanted to perform an exploratory text analysis of Kendrick Lamar’s recent album DAMN. (2017) and compare it to his older album Section.80 (2011). During my first analysis I could not help but notice that a lot of the most common words are swear words. For this reason, I want to embark in a small exploratory analysis of profanity and offensive language in Kendrick Lamar’s music.

With that in mind, please be aware that this analysis will produce plots with multiple expletives and words that may be considered NSFW (not safe for work). Though, if you are familiar with Kendrick and his contemporaries you may be more accustomed to the language.

Continue with caution.

Song Lyrics with `geniusR`

To begin any analysis, we need to prep our working environment. We will be loading three packages to be used throughout this analysis.

Setting up your workspace

library(geniusR) 
library(tidytext) 
library(tidyverse) 
library(ggthemr) ## Personal ggplot theme 
josi_theme <- function () { 
  ggthemr::ggthemr("fresh") 
  theme_minimal() + 
  theme(text = element_text(family = "Lato"),
        plot.title = element_text(face = "bold"),
        plot.subtitle = element_text(color = "#333333"),
        legend.position = "bottom",
        strip.text.x = element_text(face = "bold"))
}

If you do not have these packages installed run the below code:

install.packages("devtools") devtools::install_github("josiahparry/geniusR") install.packages("tidyverse") 
install.packages("tidytext")
install.packages("ggthemr")

The first task is simple. It is time to get the lyrics. We could go the old school way of downloading the lyrics one album at a time and then binding the rows, but we have the power of purrr at our fingertips.

We will make a tibble (an improvement upon the data frame) with two columns: artist and album.

kdot_albums <- tibble(
  artist = "Kendrick Lamar", 
  album = c("Section 80", "DAMN.")) kdot_albums 
## # A tibble: 2 x 2 
## artist           album 
## <chr>            <chr> 
## 1 Kendrick Lamar Section 80 
## 2 Kendrick Lamar DAMN.

This tibble will be helpful in being able to iteratively download each album. We can now feed these columns to the genius_album() function by using purrr::map2(). map2() allows you to “iterate over multiple arguments” at the same time and apply a function. We will use map2() within a mutate() call to create a nested tibble column. This means that the values of our tracks column will be another tibble.

album_lyrics <- kdot_albums %>% 
  mutate(tracks = map2(artist, album, genius_album))head(album_lyrics) 
## # A tibble: 2 x 3 
## artist           album      tracks 
## <chr>            <chr>      <list> 
## 1 Kendrick Lamar Section 80 <tibble [16 × 3]> 
## 2 Kendrick Lamar DAMN.      <tibble [14 × 3]>

We can now access the tracklist by unnesting the tracks column. But the tracks column also contains another nested tibble containing the actual lyrics. Let us first unnest tracks and view the song lists.

# Unnest `tracks` to view tracklist of each album album_lyrics %>% 
  unnest(tracks) %>% 
  select(title, album, everything()) ## # A tibble: 30 x 5 
## title album artist track_n lyrics 
## <chr> <chr> <chr> <int> <list> 
## 1 Fuck Your Ethnicity Section… Kendrick … 1 <tibble [54… 
## 2 Hol' Up Section… Kendrick … 2 <tibble [66… 
## 3 A.D.H.D Section… Kendrick … 3 <tibble [93… 
## 4 No Makeup (Her Vice) (Ft. Col… Section… Kendrick … 4 <tibble [78… 
## 5 Tammy's Song (Her Evils) Section… Kendrick … 5 <tibble [49… 
## 6 Chapter Six Section… Kendrick … 6 <tibble [23… 
## 7 Ronald Reagan Era (His Evils)… Section… Kendrick … 7 <tibble [63… 
## 8 Poe Man's Dreams (His Vice) (… Section… Kendrick … 8 <tibble [13… 
## 9 The Spiteful Chant (Ft. ScHoo… Section… Kendrick … 9 <tibble [95… 
## 10 Chapter Ten Section… Kendrick … 10 <tibble [15… 
## # ... with 20 more rows

You can tell by the first song title on Section 80, Fuck Your Ethnicity that Kendrick doesn’t shy away from “offensive language”.

If we unnest one step further, we can access all of the lyrics and start prepping for a tidytext analysis.

lyrics <- album_lyrics %>% 
  unnest(tracks) %>% 
  unnest(lyrics) head(lyrics) 
## # A tibble: 6 x 6 
## artist album title track_n text line 
## <chr>  <chr> <chr> <int> <chr>  <int> 
## 1 Kendrick Lamar Section 80 Fuck Your Ethnicity 1 Gather arou… 1 ## 2 Kendrick Lamar Section 80 Fuck Your Ethnicity 1 Now everybo… 2 ## 3 Kendrick Lamar Section 80 Fuck Your Ethnicity 1 If you don'… 3 ## 4 Kendrick Lamar Section 80 Fuck Your Ethnicity 1 Now I don't… 4 ## 5 Kendrick Lamar Section 80 Fuck Your Ethnicity 1 Black, Whit… 5 ## 6 Kendrick Lamar Section 80 Fuck Your Ethnicity 1 That don't … 6

The resultant tibble, lyrics has 6 variables. Let’s get to know them before we dive into the tidytext analysis.

artist: I bet you figured that one out
album: The album title, but you knew that
title: The song title
track_n: The track number on the album. This is the position on the tracklist.
text: The line as written on Genius
line: The line number associated with text

Tidytext principles

The tidy approach was formally introduced in Hadley Wickham’s 2014 paper, Tidy Data. As the progenitors of the tidytext approach, I’d like to steal from Julia Silge’s and David Robinson’s book Text Mining with R.

We thus define the tidy text format as being a table with one-token-per-row. A token is a meaningful unit of text, such as a word, that we are interested in using for analysis, and tokenization is the process of splitting text into tokens.

Okay, but what the does this mean? It means that we’re going to split each line into their individual words and that each word will be it’s own row.

Let’s get into the thick of it and start using some functions from tidytext. The first we will use is unnest_tokens(). This function will turn our text column into a tidy data frame format that fits the above description. After piping (the %>% thing) our data to the unnest_tokens() function, we need to supply it with 2 arguments.

output: the new column name, unquoted
input : the column which will be transformed, unquoted

Tokenizing

Since we are using the default arguments, it will split the text into one word units, also called unigrams. This comes from n-grams where n is the number of bits, and the gram is the unit. But don’t trip on that right now.

unigrams <- lyrics %>% 
  unnest_tokens(word, text) unigrams 
## # A tibble: 17,704 x 6 
## artist album title track_n line word 
## <chr> <chr> <chr> <int> <int> <chr> 
## 1 Kendrick Lamar Section 80 Fuck Your Ethnicity  1 1 gather 
## 2 Kendrick Lamar Section 80 Fuck Your Ethnicity  1 1 around 
## 3 Kendrick Lamar Section 80 Fuck Your Ethnicity  1 1 i'm 
## 4 Kendrick Lamar Section 80 Fuck Your Ethnicity  1 1 glad 
## 5 Kendrick Lamar Section 80 Fuck Your Ethnicity  1 1 everybody 
## 6 Kendrick Lamar Section 80 Fuck Your Ethnicity  1 1 came 
## 7 Kendrick Lamar Section 80 Fuck Your Ethnicity  1 1 out 
## 8 Kendrick Lamar Section 80 Fuck Your Ethnicity  1 1 tonight 
## 9 Kendrick Lamar Section 80 Fuck Your Ethnicity  1 1 as 
## 10 Kendrick Lamar Section 80 Fuck Your Ethnicity 1 1 we 
## # ... with 17,694 more rows

Now since this is an analysis of only profanities, we have to be rid of all other less offensive words. We can do this by loading the profanities data set that comes with geniusR and joining the datasets.

The profanities data set was acquired from Luis Von Ahn’s (founder of reCAPTCHA & Duolingo) research group at Carnegie Mellon University.

The description of the data set is:

“A list of 1,300+ English terms that could be found offensive. The list contains some words that many people won’t find offensive, but it’s a good start for anybody wanting to block offensive or profane terms on their Site.”

Isolating words using a join

In order to isolate all of the offensive words, we can preform an inner_join(). An inner join is a type of table join (that is the combining of one table to the other based on a shared feature). To steal from the dplyr documentation, given two tables, x & y, an inner join “return[s] all rows from x where there are matching values in y, and all columns from x and y. If there are multiple matches between x and y, all combination of the matches are returned.”

data("profanities") bad_words <- unigrams %>% 
  inner_join(profanities) bad_words 
## # A tibble: 929 x 6 
## artist album title track_n line word 
## <chr> <chr> <chr> <int> <int> <chr> 
## 1 Kendrick Lamar Section 80 Fuck Your Ethnicity 1 1 fire 
## 2 Kendrick Lamar Section 80 Fuck Your Ethnicity 1 1 color 
## 3 Kendrick Lamar Section 80 Fuck Your Ethnicity 1 1 fuck 
## 4 Kendrick Lamar Section 80 Fuck Your Ethnicity 1 1 shit 
## 5 Kendrick Lamar Section 80 Fuck Your Ethnicity 1 3 fuck 
## 6 Kendrick Lamar Section 80 Fuck Your Ethnicity 1 4 fuck 
## 7 Kendrick Lamar Section 80 Fuck Your Ethnicity 1 5 black 
## 8 Kendrick Lamar Section 80 Fuck Your Ethnicity 1 5 asian 
## 9 Kendrick Lamar Section 80 Fuck Your Ethnicity 1 5 goddammit 
## 10 Kendrick Lamar Section 80 Fuck Your Ethnicity 1 6 shit 
## # ... with 919 more rows

Now that we have isolated only offensive words within each album, we can begin exploring these data.

Word Frequency

An initial exploratory analysis of text often involves looking at word frequency. Word frequency can give insight into what sort of language might be prominent. This portion of an analysis often leads one to identifying things of interest.

A simple first question to ask is “which has more”? This can be explored by aggregating by the album and summing the number of occurences each “offensive” word has.

Which Album Is More Profane?

We will first count the number of offensive words in each album and create a new tibble called bad_count.

bad_count <- bad_words %>% 
  count(album, word) bad_count 
## # A tibble: 168 x 3 
## album word n 
## <chr> <chr> <int> 
## 1 DAMN. african 1 
## 2 DAMN. american 3 
## 3 DAMN. ass 24 
## 4 DAMN. asshole 1 
## 5 DAMN. babe 1 
## 6 DAMN. babies 2 
## 7 DAMN. beast 1 
## 8 DAMN. bible 2 
## 9 DAMN. bitch 48 
## 10 DAMN. bitches 7 
## # ... with 158 more rows bad_count %>% 
  group_by(album) %>% 
  summarise(n_bad = sum(n)) %>% 
  ggplot(aes(album, n_bad, fill = album)) + 
  geom_col() + josi_theme()

It is important to note that raw counts can be misleading sometimes. What if one album twice as many words as the other? That would invariably lead to a larger count of profanities in one than the other. The proportion is also known as the relative frequency which is in this case # profanities / total # of words.

total_words <- unigrams %>% 
  count(album) %>% 
  select(-album) bad_count %>% 
  group_by(album) %>% 
  summarise(n_bad = sum(n)) %>%
  bind_cols(total_words) %>% 
  mutate(prop_bad = n_bad / n) %>% 
ggplot(aes(album, prop_bad, fill = album)) + 
  geom_col() + josi_theme()

It looks like there has been a reduction in the amount of offensive words used between these two albums.

Most Common Offensive Words by Album

From this point, we can dive even deeper by isolating the most frequent words per album. This can be done with a lot of helpful functions from dplyr.

We will select the top 10 most frequent words in each album by group the bad_count tibble by the album and using the top_n() function. We will then pipe the resultant tibble into a nice plot.

bad_count %>% 
  group_by(album) %>% 
  top_n(10, n) %>% 
  ungroup() %>% 
  mutate(word = reorder(word, n)) %>% 
ggplot(aes(word, n, fill = album)) +  
 geom_col() + 
 facet_wrap(~album, scales = "free_y") + 
 coord_flip() + 
 labs(title = "Most Common Offensive Words") + 
 josi_theme()

How does offensive language progress through the each album?

Similar to the above, we can group by the album then the title and count up the number of occurances of offensive words by song. You will see that the below code has the line group_by(album, track_n, title). We group by both track_n and title to mainting the numeric ordering from the track numer. These are grouped identically so it will not throw any wrenches into the analysis.

bad_words %>% 
  count(album, track_n,title, word) %>% 
  group_by(album, track_n, title) %>%
  summarise(n = sum(n)) %>% 
  ungroup() %>%
  mutate(title = reorder(title, -track_n)) %>% 
ggplot(aes(title, n, fill = album)) + 
  geom_col() + 
  josi_theme() + 
  facet_wrap(~album, scales = "free") + 
  labs(title = "Offensive Words",  
       subtitle = "Count of offensive words by track",
        x = NULL, y = NULL) +
  coord_flip()

From this we can see that the most offensive songs in DAMN. and Section.80 are “FEAR.”, and “The Spiteful Chant” respectively.

Bigrams

Moving past the unigram, we can evaluate word pairings. This will require a bit more wrangling than the unigrams. We will utilize the tidyr::separate() and tidyr::unite() functions to help select only offensive words.

We will do this in a few steps that we will chain together. Run each step independently and print it to the console to get an understand of what is happening.

What’s happening:

the plain lyrics (the text column) is fed using the pipe, %>% into the unnest_tokens() which then separates the text into every word pair, or bigram.
Note: the prefix bi means 2, and gram is the text unit (word). Thus bigram can be understood as word pairs in this setting.
Those word pairs are then being separated into their constituent parts. This way we can identify which bigram has an offensive word in it.
We then filter to identify bigrams with an offensive word in it using dplyr::filter().
This filter statement in plain language says:
“If word1 is in profanities give me that row. OR (the single bar, or pipe depending on your background, |) word2 is inprofanities, give me that row.”

bigrams <- lyrics %>% 
  unnest_tokens(bigram, text, token = "ngrams", n = 2) bigrams_sep <- bigrams %>% 
  separate(bigram, into = c("word1", "word2"), sep = " ") %>% filter((word1 %in% profanities$word) | (word2 %in% profanities$word))

Now that we have all of the rows which contain an offensive word, we can evaluate most common word pairings. We can then identify, for example, the word that most follows “fuck”.

bigrams_sep %>% 
  count(word1, word2) %>% 
  arrange(-n) ## # A tibble: 771 x 3 
## word1 word2 n 
## <chr> <chr> <int> 
## 1 fuck with  21 
## 2 fuck that  20 
## 3 like hoes  19 
## 4 my dick    19 
## 5 die to     16 
## 6 lil bitch  16 
## 7 ho ho      15 
## 8 my nigga   15 
## 9 suck my    14 
## 10 yo ass    14 
## # ... with 761 more rows

From this we can tell that the most common offensive word in the first position, that is word1 is “fuck” and that “fuck” is most often followed by “with”, or “that”.

We can plot words pairwise.

bigrams_sep %>% 
  # Grouping by album, then word, and the counting word pairs     
  count(album, word1, word2) %>% 
  # Group by album to select top 10 word pairings in each album  
  group_by(album) %>% 
  top_n(10, n) %>% 
  # Reordering word1, based on the number of occurrences for prettier plotting 
  mutate(word1 = reorder(word1, n)) %>% 
  # Plotting word1 on x axis, word2 on y axis, size = number of occurance, color by album 
  ggplot(aes(word1, word2, size = n, color = album)) + 
  # Make a little transparent geom_point(alpha = .7) + 
  # Flip axes and add personal style coord_flip() + 
  # Make a plot for each album facet_wrap(~album, scales = "free") +   
  # Angle the text to look prettier 
  theme(axis.text.x = element_text(angle = 90)) + josi_theme()

We could also visualize this a litle differently by looking at the count of bigrams using a fun adaptation of the bar chart called the lollipop chart.

bigrams %>% 
  count(album, title, bigram) %>% 
  filter(n > 1) %>% 
  group_by(album) %>% 
  top_n(10, n) %>% 
  ungroup() %>%
   mutate(bigram = reorder(bigram, n)) %>% 
ggplot(aes(bigram, n, size = n, color = album, label = n)) +  
  geom_segment(aes(y = 0, x = bigram, yend = n, xend = bigram, size = 1)) + 
  geom_point() + 
  facet_wrap(~album, scales = "free_y") + 
  geom_text(color = "white", size = 2) + 
  coord_flip() + josi_theme()

Looking at Section 80, we can tell that “fuck” appears most frequently with “that” and “with”. If you are familiar with the song “A.D.H.D” you might be able to guess that the number of “fuck that” ocurances are probably from that song. Let’s check that! Let’s do this by uniting the word1 and word2 columns back into a single string that contains both words. We will do this the tidyr::unite() funciton.

bigrams <- bigrams_sep %>% 
  unite(bigram, word1, word2, sep = " ")

Now we can look at just the song “A.D.H.D” by filtering and counting.

bigrams %>% 
  filter(title == "A.D.H.D") %>% 
  count(bigram, sort = T) %>% 
  filter(n > 1) %>% 
  mutate(bigram = reorder(bigram, n)) %>%
ggplot(aes(bigram, n, label = paste("n =", n))) + 
  geom_col() +
  josi_theme() + 
  geom_text(vjust = 3, color = "white")

From this we see that the phrase “fuck that” occurs 11 times alone in “A.D.H.D”.

Document Level Frequency

There a number of statistics we can use to measure occurances of a word or a phrase and it’s relationship to other documents. In this case, we can treat each album as a unique document and compare between these.

The tidytext package comes with a function called bind_tf_idf(). In your head you should hear bind term-frequency inverse-document-frequency. Before we can get into what this is, we need to be able to manipulate our bigrams tibble into a format that can be used for this function. The data frame that we feed the function needs to be in a tidytext format. This means that there will be one row per token (bigram) per document with a count of how many times that token occurs.

Thus the data frame will have three columns:

album : the document
bigram: the token
n: the count

We will make our bigrams tibble into this structure by counting the bigrams by album (what we are considering a document).

bigram_tidy <- bigrams %>%
  count(album, bigram) head(bigram_tidy) 
## # A tibble: 6 x 3 
##  album   bigram   n 
##  <chr>   <chr>  <int> 
## 1 DAMN. a asshole 1 
## 2 DAMN. a bitch   9 
## 3 DAMN. a blind   1 
## 4 DAMN. a color   1 
## 5 DAMN. a damn    2
## 6 DAMN. a fuck    4

Making a tf-idf data frame

What the is tf_idf? Again, it stands for term-frequency inverse-document-frequency. The bind_tf_idf() function will create a tibble with 6 columns. It will maintain the three that we created (document,token, n) but will append 3 other columns to it. These are:

tf: term-frequency
Term frequency is the relative frequency that the token occurs. That is the number of times the token occurs divided by the number of occurances of all tokens.
idf: inverse-document-frequency

In essense, this statistic is a measure of the rareness of a term in a collection of documents.

The larger the number, the more unique it is to a document.
It’s calculated by taking the number of terms in a document set and divide by the number of documents containing a specific term.
The formula is defined generally as $idf_j = log[\frac{N}{df_j}]$
For a good introduction to the statistic I recommend reading this post
tf_idf: term-frequency inverse-document-frequency
This is the term-frequency adjusted for how rarely a term is used
tf* × *id**f
The larger the number, the more unique
From the tidytext book:
“The statistic tf-idf is intended to measure how important a word is to a document in a collection (or corpus) of documents, for example, to one novel in a collection of novels or to one website in a collection of websites.”
Note: use this as only a heuristic, not a golden rule.

If we feed the bigram_tidy to bind_tf_idf() function, the resultant table will have all of the above statistics. We can take the tf-idf score and try identify word pairings that are the most unique to each album.

As a note, be aware that we are working with a very small corpus of text. This means that the statistics that we are generating are not very informative.

bigram_tf_idf <- bigram_tidy %>% 
  bind_tf_idf(bigram, album, n) %>% 
  arrange(-n) head(bigram_tf_idf) 
## # A tibble: 6 x 6 
## album bigram n tf idf tf_idf 
## <chr> <chr> <int> <dbl> <dbl> <dbl> 
## 1 Section 80 fuck that 20 0.0246 0.693 0.0171 
## 2 Section 80 fuck with 20 0.0246 0 0 
## 3 Section 80 like hoes 19 0.0234 0.693 0.0162 
## 4 Section 80 my dick 18 0.0222 0 0 
## 5 DAMN. lil bitch 16 0.0242 0.693 0.0168 
## 6 Section 80 die to 16 0.0197 0.693 0.0137

Plot tf-idf score by album

bigram_tf_idf %>% 
  group_by(album) %>% 
  top_n(12, tf_idf) %>%
  ungroup() %>% 
  mutate(bigram = reorder(bigram, tf_idf)) %>% 
ggplot(aes(bigram, tf_idf, fill = album, label = round(tf_idf, 3))) + geom_col() + facet_wrap(~album, scales = "free") + coord_flip() + josi_theme() + geom_text(size = 2, color = "white", hjust = 1.5)

This plot shows us that the phrases “lil bitch” and “fuck that” are the most unique to each album with similar scores of 0.017.

Wrapping Up

I hope that this has helped you in being able to grasp tidytext principles in an enjoyable manner. I also recommend that you try adapting the lessons learned in this to your favorite albums or songs. The same principles explained here apply to topics other than offensive language.

Future writings will explore topic modelling of song lyrics in a tidy format.

If you have any questions please feel free to reach out to me on twitter. I try to be as responsive as possible.

Originally published at josiahparry.com on February 2, 2018.