⇽ home

Predicting the authorship of Beatles lyrics

One of my favorite things about working in a public library is the opportunity for serendipitous book recommendations—I've discovered, read, and enjoyed many of my favorite titles[1] on trips to and from the bathroom, or just when gazing into space during a moment of contemplation.

And so it was that, one afternoon this summer when I sought refuge among the temperature-controlled desks of the Mountain View Public Library, I found myself staring at a thick red spine bearing the title, "All the Songs: The Story Behind Every Beatles Release".

I dropped whatever it was that I had been in the middle of and grabbed the tome before some other fan could, and started flipping through—the authors, Margotin and Guesdon, had certainly done their research, providing copious details[2] about each song's genesis, development, recording, and eventual release. And crucially, they did their best to pinpoint who ultimately deserved authorship for the countless "Lennon-McCartney" (and less frequent, early-album "McCartney-Lennon") credits, citing statements given by John or Paul or others involved in various interviews and biographies. Of course, many songs were highly collaborative efforts, especially where the big 2 were concerned, with Paul having come up with some verses and John later supplying a chorus, or Paul helping to fill in gaps in John's initial lyrics. But setting aside these messier cases, Margotin and Guesdon provided a path to the cleanest dataset of Beatles lyrics labeled for writer that I had ever encountered.

Armed with this trove of data, I set out to see how well a classifier could predict a (Beatles) song's authorship based on its lyrics, and which lyrical features might be most associated with each Beatle.

Assembling the dataset

I assembled the metadata on Beatles songs manually into a .csv file from Margotin and Guesdon, which I also used to provide a verdict on each song's lyrics' authorship. In cases where the verdict is not a single Beatle, I indicate this in the songwriter column and provide some details in the notes column. I also regularized the title of each song by removing words in brackets and non-alphanumeric characters, converting hyphenated compound words into two separate tokens, and lowercasing all characters.

As Fig. 1 below indicates, a number of songs defy straightforward authorship attribution. I removed these songs from all subsequent analyses. Fig. 2 shows the number of songs each Beatle wrote over time, adhering to what we know about the asymmetry in each member's contributions.

Processing lyrics

I scraped the lyrics for each non-instrumental song from the website, A-Z lyrics, using BeautifulSoup. I excluded any songs whose lyrics were not available on the site (N=) from all subsequent analyses.

One immediately noticeable aspect of the lyrics is the heavy use of repetition, both of individual words and of entire lines (both within and outside of choruses).

Since it's not particularly interesting to find that Paul uses, e.g., "raccoon" in his lyrics way more than the other Beatles, I also created deduplicated versions of the lyrics for each song by removing repeats of adjacent words and repeats of non-adjacent lines, after ignoring punctuation.

Here is a demo of how the deduplication script works (red tokens are the ones being removed):

Next, I used spaCy to preprocess all the deduplicated lyrics by applying tokenization, lemmatization, POS-tagging, and dependency parsing. Let's load the serialized results into a dictionary keyed by song title, song_title2doc:

And inspect some sample preprocessed output:

After deduplication, there is a total of 23.4K tokens and 1.8K unique lemma types in the entire dataset of 164 songs.

Now, we can move onto training some classifiers!

Training classifiers

I experimented with using both a deep learning and non-deep logistic regression model to classify lyrics for author, namely:

I also implemented the following baseline models for comparison:

Finally, since Ringo is only represented a grand total of 2 times in the dataset, I limit the classification task to the other 3. Sorry, Ringo!!

Next, I extracted the following features for each song's lyrics:

Train and evaluate the performance of the multi-class logistic regression model:

Ablation analysis to see which features are important:

The summary of ablation scores indicates that total.num.lemmas and ngram are the most important features, both resulting in a 10% decrease in performance when removed. valence and dominance are also important, resulting in a 7% and 4% decrease, respectively. The second person pronoun feature hurts performance, resulting in a 3% increase when dropped.

Ablated feature Change in accuracy Absolute accuracy (with remaining features)
ngram **-10%** 0.42
tf-idf 0 0.52
ttr 0 0.52
first.sg 0 0.52
first.pl 0 0.52
second **+3%** 0.55
third.sg.f 0 0.52
third.sg.m 0 0.52
third.sg.n 0 0.52
third.pl 0 0.52
valence **-7%** 0.45
arousal 0 0.52
dominance **-4%** 0.48
negation 0 0.52
total.num.lines 0 0.52
total.num.words 0 0.52
total.num.lemmas **-10%** 0.42
mean.words.per.line 0 0.52
mean.chars.per.word 0 0.52

The best performing model/feature set includes all features except the second person pronoun feature, and attains an accuracy of 54.5%.

Let's also examine the feature weights of the best performing model to see which features are most associated with each class (i.e., Beatle).

It looks like **John's** lyrics tend to have **longer lines** in the average number of tokens, and contain words with **high dominance and arousal**. On the other hand, **Paul's** lyrics tend to have **high valence, or positive sentiment**, words, and he also uses a lot of **"he," "she," and "we" pronouns**. Finally, **George** uses a greater **diversity of token types**, as indicated by the high feature importance of total.lemmas, and he also tends to use a lot of **"they" pronouns and negation**.

Can we beat this performance with a fine-tuned BERT? It seems like the feature-based LR model actually performs better on this dataset!

But at the same time, the best-performing LR classifier (w/o the second person feature) is comparable to our baselines:

Let's examine the misclassifications to get a sense of the classifiers' shortcomings:

Observations about the songs misclassified by BERT:

The BERT confusion matrix shows some serious majority class behavior, guessing only Paul!

How well do the classifiers do on the Beatles' solo-career lyrics?

Does augmenting the training data with solo-career data improve performance on Beatles-era lyrics?

I used this neat tool to bypass the need for scraping, combined with the following Wikipedia lists:

Now we'll filter to songs that had a single credit, i.e., that were not written in collaboration with other artists (such as Bob Dylan).

Let's downsample the Paul datapoints, as he is over-represented by almost a factor of 3 compared to both George and John in the training data and the confusion matrix suggests the classifier is learning to mostly predict Paul.

The best accuracy, 60.6%, is obtained when dropping the she pronoun feature.

Downsampling the training data does seem to improve the model's predictions--now there is a clearer diagonal, at least for John and Paul. Moreover, the absolute accuracy and F1 score both improve slightly compared to pre-augmentation + downsampling.

What about BERT? Does augmentation (+ downsampling) improve performance in this case?

For BERT, it seems that augmenting the training data with solo career lyrics hurts rather than helps performance!

Conclusion + other ideas

Overall, it appears that authorship attribution is a difficult task, with both feature-based and deep learning models performing at around the level of a moderate difficulty baseline. The difficulty stems in part from the sparse nature of song lyrics— even for musicians as prolific as The Beatles, each song provides only around 150 tokens after removing duplication in the form of choruses and repeated lines. (Fig. 5 shows the distribution of the unique number of token types (lemmas) and total number of tokens in all Beatles and solo career songs.)

Augmentation (+downsampling Paul's lyrics) using solo-career lyrics helps a bit for a feature-based model, but still not very much given that the size of the larger training set is almost three times as large as the original! Meanwhile, augmentation results in a large drop-off in performance for BERT.

It would appear that BERT is learning aspects about each Beatle's song-writing that are not captured by any of the features input to the logistic regression model, and which crucially differ across their group vs. solo careers, as suggested by the drop in performance when BERT is trained on the augmented dataset compared to the slight increase in performance when the LR model is trained on the same additional solo-career data.

In other words, the stylistic features input to the LR model have stayed relatively consistent across each Beatle's group vs. solo careers, so additional data helps.[3] However, there are features beyond counts of ngrams, pronouns, negation, and VAD sums that have evolved as they embarked on their solo careers, so BERT is thrown off track when trained on this additional data.

It would be interesting to conduct some quantitative as well as qualitative comparative analysis to discover what these evolved features might be. For instance, we have not considered non-BOW-style features such as syntactic structure and thematic roles—e.g., John might be writing about himself more positively but the rest of the world more negatively in his solo career, or having more power over society compared to the earlier helplessness he felt in "Help!", but this turning of tables can't be detected from individual token counts or sums across token categories.

Footnotes