Chapter 5 Genre Classification — An NLP approach
There may be marked differences between the musical features of country and rock songs, but to utilize audio features by themself would ignore another important aspect of music: the lyrics. Lyrics contain information that the audio features may not. Lyrics may betray the feeling that the sound may portray. We can use natural language processing (NLP) techniques on the song lyrics to quantify song lyrics. This will enable us to identify the sentiment of a song based on it’s lyrics alongside of the mood that is set by its tone. In addition we are able to identify clusters of related words, as we will later see. All of these techniques can be used to distinguish songs, artists, and genres from one and another.
The following will outline a process for creating clusters of related words and using them as a basis for a lyric based genre classification model.
To do this we will use three additional packages:
genius
: acquiring song lyricstidytext
: text mining with tidy toolstopicmodels
: package containing R implementations of topic models
In order to best understand the following code I recommend reading Tidy Text Mining by Julia Silge and David Robinson.
From here we frame the issue of genre classification as one of language rather than strict musical features. In creating this model we are attempting to see if there are linguistic differences between the lyrics of rock and country songs that can be captured, quantified, and classified using NLP and predictive modeling.
The steps that we will take to create this classification model are roughly as follows:
- Retrieve song lyrics
- Tokenization
- Removal of stop words
- Word stemming
- Create a document term matrix
- Create an LDA topic model
- Calculate LDA topic probabilities
- Create a model using class probabilities as features
Latent Dirichlet Allocation (LDA) is an unsupervised classification model that uses term-frequency (tf) of a document as input foollowing a bag of words approach. To create this LDA model, We first need to calculate the term-frequency of our documents (songs). And to do that, we need to process the song lyrics by tokenizing, removing stop-words, and stemming.
But we can’t get ahead of ourselves. We need the lyrics first.
5.1 Retrieving song lyrics
We will use add_genius()
with the charts table to get the song lyrics for each song. Below we are specifying the columns that contain the name of the artist and tracks respectively. The type
argument tells genius whether it will be fetching lyrics for a song or an album. Here we specify we are interested in "lyrics"
.
This is can be rather time consuming. So maybe go for a walk around the block. Make some tea. Have a moment to yourself.
library(genius)
charts_lyrics <- add_genius(charts, artist, title, type = "lyrics")
Now that we have charts and their lyric, let’s see if there were any songs that were missed by add_genius()
.
anti_join(charts, charts_lyrics %>%
count(year, artist, title)) %>%
distinct(artist, title)
## Joining, by = c("year", "artist", "title")
## # A tibble: 16 x 2
## artist title
## <chr> <chr>
## 1 Florida Georgia Line H.O.L.Y.
## 2 Dan + Shay From The Ground Up
## 3 Chris Young Duet With Cassadee Pope Think Of You
## 4 Dan + Shay Nothin' Like You
## 5 Dan + Shay How Not To
## 6 Dan + Shay Tequila
## 7 Dan + Shay Speechless
## 8 David Lee Murphy & Kenny Chesney Everything's Gonna Be …
## 9 Brantley Gilbert Ones That Like Me
## 10 Lil Wayne, Wiz Khalifa & Imagine Dragons With L… Sucker For Pain
## 11 Nathaniel Rateliff & The Night Sweats S.O.B.
## 12 Coldplay Up&Up
## 13 The Dirty Heads Vacation
## 14 Florence + The Machine Hunger
## 15 Sir Sly &Run
## 16 Imagine Dragons + Khalid Thunder/Young Dumb & B…
There looks to be some inconsistencies in naming. We can either manually fix these, or omit them. For this case, I will omit them (as fixing them would invariable mean making a GitHub issue; if you want to tackle it, please do!).
5.2 Lyric Preprocessing
To use song lyrics in a model, we need to quantify them in some manner. NLP allows us to add quantitative structure to inherently unstructured text data. We will follow the tidy text mining approach outlined in Tidy Text Mining in R.
We will impose structure on song lyrics by tokenizing words and estimating document topics (groups of related words).
The code below takes the charts_lyrics
data frame and deduplicates the songs, then splits thee line
column into unigrams. Next, stop words are removed. Finally, unigrams are stemmed using wordStem()
from the SnowballC
package (for more on word stemming visit here).
library(tidytext)
# create unigrams
lyric_unigrams <- charts_lyrics %>%
# if a song appears more than once, keep only one observation
distinct(artist, title, line, .keep_all = TRUE) %>%
# create unigrams
unnest_tokens(word, lyric) %>%
# remove stop words
anti_join(get_stopwords()) %>%
# stem each word
mutate(word = SnowballC::wordStem(word))
## Joining, by = "word"
lyric_unigrams
## # A tibble: 79,885 x 9
## rank year chart artist featured_artist title track_title line word
## <dbl> <dbl> <chr> <chr> <chr> <chr> <chr> <dbl> <chr>
## 1 2 2016 Hot C… Thomas… <NA> Die … Die a Happ… 1 babi
## 2 2 2016 Hot C… Thomas… <NA> Die … Die a Happ… 1 last
## 3 2 2016 Hot C… Thomas… <NA> Die … Die a Happ… 1 night
## 4 2 2016 Hot C… Thomas… <NA> Die … Die a Happ… 1 hand
## 5 2 2016 Hot C… Thomas… <NA> Die … Die a Happ… 2 on
## 6 2 2016 Hot C… Thomas… <NA> Die … Die a Happ… 2 best
## 7 2 2016 Hot C… Thomas… <NA> Die … Die a Happ… 2 night
## 8 2 2016 Hot C… Thomas… <NA> Die … Die a Happ… 2 doubt
## 9 2 2016 Hot C… Thomas… <NA> Die … Die a Happ… 3 bottl
## 10 2 2016 Hot C… Thomas… <NA> Die … Die a Happ… 3 wine
## # … with 79,875 more rows
We’re almost there. We want to create a LDA model, but LDA models are based on term-frequency. At this moment, we only have terms. Now we need to count the number of times each word occurs in each document (song).
To accomplish this we will count()
by each unique combination of song and word. Below we create a common identifier as a combination of the artist name and title, then count the number of times each word occurs in the song.
# create word counts and an id column to simplify counting for me
lyric_counts <- lyric_unigrams %>%
mutate(id = glue::glue("{artist}....{title}")) %>%
count(id, word, sort = TRUE)
lyric_counts
## # A tibble: 33,662 x 3
## id word n
## <glue> <chr> <int>
## 1 Eric Church....Desperate Man boo 147
## 2 Imagine Dragons....Thunder thunder 113
## 3 Dierks Bentley....Woman, Amen oh 103
## 4 The Head And The Heart....All We Ever Knew la 103
## 5 Vance Joy....Saturday Sun ba 99
## 6 Eric Church....Mr. Misunderstood na 96
## 7 Fall Out Boy....Hold Me Tight Or Don't na 95
## 8 alt-J....In Cold Blood la 81
## 9 CHVRCHES....Get Out get 74
## 10 twenty one pilots....Message Man eh 64
## # … with 33,652 more rows
Now these counts can be cast into a DocumentTermMatrix
using tidytext::cast_dtm()
. A document term matrix takes the structure of one document per row and one column per term. This is important when we are thinking about the models that we will be creating.
For our model, we want to know the genre of each song. This means there needs to be only one row per song rather than a row per song and word pairing as is the case with our tidy lyric_counts
object.
lyric_dtm <- lyric_counts %>%
cast_dtm(id, word, n)
lyric_dtm
## <<DocumentTermMatrix (documents: 513, terms: 4638)>>
## Non-/sparse entries: 33662/2345632
## Sparsity : 99%
## Maximal term length: NA
## Weighting : term frequency (tf)
Notice how the term weighting is done with term-frequency (tf)? This is ideal as LDA requires term-frequency.
5.3 Topic Modeling
Now that we have our DocumentTermMatrix
we can create our LDA model. We will use the LDA()
function from topicmodels
to do this. LDA will createw as many topics as specified. Much like a KNN model, we can specify k
clusters. In this case we will use 5.
library(topicmodels)
lda_5 <- LDA(lyric_dtm, k = 5, control = list(seed = 0))
lda_5
## A LDA_VEM topic model with 5 topics.
Now that the model has been fit, we can use it to calculate the posterior probabilities for each document’s membership to the k classes. posterior()
creates a list
object with two matrices. The first matrix contains the terms
that were used to create the model. The second matrix contains the topics
probability of each document. We will extract the topic probabilities and coerce the matrix into a tibble
so we can join it back on to the original charts table.
# calculate the posterior probabilities for each document's classificaiton
lda_inf <- posterior(lda_5, lyric_dtm)
# extract document class probabilities
chart_lda <- lda_inf[[2]] %>%
as_tibble(rownames = "id")
chart_lda
## # A tibble: 513 x 6
## id `1` `2` `3` `4` `5`
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Eric Church....Desperate Man 0.999 0.000187 0.000187 1.87e-4 1.87e-4
## 2 Imagine Dragons....Thunder 0.000200 0.000200 0.999 2.00e-4 2.00e-4
## 3 Dierks Bentley....Woman, Amen 0.000208 0.999 0.000208 2.08e-4 2.08e-4
## 4 The Head And The Heart....Al… 0.999 0.000288 0.000288 2.88e-4 2.88e-4
## 5 Vance Joy....Saturday Sun 0.000189 0.999 0.000189 1.89e-4 1.89e-4
## 6 Eric Church....Mr. Misunders… 0.999 0.000155 0.000155 1.55e-4 1.55e-4
## 7 Fall Out Boy....Hold Me Tigh… 0.999 0.000200 0.000200 2.00e-4 2.00e-4
## 8 alt-J....In Cold Blood 0.999 0.000241 0.000241 2.41e-4 2.41e-4
## 9 CHVRCHES....Get Out 0.999 0.000312 0.000312 3.12e-4 3.12e-4
## 10 twenty one pilots....Message… 0.999 0.000200 0.000200 2.00e-4 2.00e-4
## # … with 503 more rows
To join back onto the original charts data we need to recreate the id
column in the charts
tibble. We then join only the unique songs to avoid any duplication and then clean the column headers.
# join back on to charts to get the genre
chart_topics <- charts %>%
mutate(id = glue::glue("{artist}....{title}")) %>%
distinct(chart, id) %>%
right_join(chart_lda) %>%
janitor::clean_names()
## Joining, by = "id"
## Warning: Column `id` has different attributes on LHS and RHS of join
head(chart_topics)
## # A tibble: 6 x 7
## chart id x1 x2 x3 x4 x5
## <chr> <glue> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Hot Count… Eric Church....Despe… 0.999 1.87e-4 1.87e-4 1.87e-4 1.87e-4
## 2 Rock Songs Imagine Dragons....T… 0.000200 2.00e-4 9.99e-1 2.00e-4 2.00e-4
## 3 Hot Count… Dierks Bentley....Wo… 0.000208 9.99e-1 2.08e-4 2.08e-4 2.08e-4
## 4 Rock Songs The Head And The Hea… 0.999 2.88e-4 2.88e-4 2.88e-4 2.88e-4
## 5 Rock Songs Vance Joy....Saturda… 0.000189 9.99e-1 1.89e-4 1.89e-4 1.89e-4
## 6 Hot Count… Eric Church....Mr. M… 0.999 1.55e-4 1.55e-4 1.55e-4 1.55e-4
From here we now have a dataset that can be utilized for creating a genre classification model.
5.4 Genre Classification of Topics
Below we follow the same steps as were done for song feature classfications. The only differences here are that we are specifying the formula of the recipe as chart ~ x1 + x2 + x3 + x4 + x5
, and are using randomForest
rather than ranger
.
#------------------------------- pre-processing -------------------------------#
set.seed(0)
init_split <- initial_split(chart_topics, strata = "chart")
train_df <- training(init_split)
test_df <- testing(init_split)
# create recipe
chart_rec <- recipe(chart ~ x1 + x2 + x3 + x4 + x5, data = train_df) %>%
prep()
# bake the training and testing to have clean dfs
baked_train <- bake(chart_rec, train_df)
baked_test <- bake(chart_rec, test_df)
#--------------------------------- model fit ----------------------------------#
rf_fit <- rand_forest(mode = "classification") %>%
set_engine("randomForest") %>%
fit(chart ~ ., data = baked_train)
lyric_classifier <- rf_fit
#------------------------------ model evaluation ------------------------------#
rf_estimates <- predict(rf_fit, baked_test) %>%
bind_cols(baked_test) %>%
yardstick::metrics(truth = chart, estimate = .pred_class)
rf_estimates
## # A tibble: 2 x 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 accuracy binary 0.567
## 2 kap binary 0.131
This is not a very good model. But what happens if we combine it with the audio feature model?