Screenplay Analysis Tool using Natural Language Processing

The source code of this app can be found in my github link given below:
https://github.com/Birkbeck/msc-project-source-code-files-22-23-hakkam10

This app can be used in the following link:

https://huggingface.co/spaces/hakkam10/Screenplay_Emotion_Analysis

You can also try out the app here:

This app was developed as my dissertation project for my Masters in Data Science degree. This app attempts to study a new approach in screenplay analysis and movie recommendations using the transformers model and new visualizing techniques.

This project started with three questions:

What if you can identify the overall emotion pattern of a screenplay without reading it?

What if you can make a machine understand a screenplay by vectorising the emotions contained in the screenplay?

What if you can predict what movie a person might like using the emotions contained in the screenplay?

The screenplays were scrapped from IMSdb website and split into 100 segments of equal length. I have trained an emotion classification model and used the model to classify each segment in each screenplay into one of six emotion categories and stored it as a vector. The emotion categories are joy, sadness, surprise, love, fear and anger.

Users can select a movie of their interest and select a similarity measure of their choice to see movies that follow the same emotional pattern as the selected movie.

Data Collection and Modelling

Datasets for training emotion classification model

Ideally, to train an emotion classification model specific for screenplay analysis, we would need a dataset of scenes from a screenplay annotated by people to the emotion that the scene evokes. But this was out of this project’s scope as this is a project done by a single person on a small budget.

Hence, three datasets from various sources were used to train the model. One of them was twitter tweets emotion classification dataset that can be accessed in this link: https://huggingface.co/datasets/dair-ai/emotion

The other one was a dataset containing reddit comments and was classified based on 16 emotions. It can be accessed here: https://huggingface.co/datasets/go_emotions

The next one was crowdflower dataset which can be accessed here: https://data.world/crowdflower/sentiment-analysis-in-text

But the crowdflower dataset added more noise to the training model and hence it was rejected and only the first two datasets were used. Both the datasets followed different type of taxonomy. This raised the need to map the emotions to the same ekman taxonomy of 6 basic emotions: Joy, Fear, Anger, Sadness, Disgust, Surprise where disgust was replaced with love to adapt to the movie scenario since love emotion plays a major part in movies.

Collecting and segmenting screenplays

The screenplays were collected from the IMSdB website which is a collection of more than 2000 screenplays. The screenplays were scrapped from the website using Beautiful Soup python package.

The screenplays were split into 100 segments to be classified by the trained model. The classified emotions are stored as a vector of length 100 for each movie. This vector is then used to carry out further visualization and analysis.

I have uploaded the screenplays, their segments and the emotion vectors as a dataset which can be accessed in this website: https://huggingface.co/datasets/hakkam10/screenplay_emotions

Training the emotion classifier model

DistilBERT uncased language model has been used to train the modelled emotion dataset. It was trained using the huggingface’s transformers library. The trained model can be accessed in this link: https://huggingface.co/hakkam10/ekman-emotion-classifier

This model was then used to classify the segments in the screenplay.

Deployment

The app was designed to let the users choose a movie of interest and analyse their emotion pattern. For each movie the app visualizes the emotion pattern using the emotion vectors and displays the ’emotion barcode’ of the film which is a sequence of color segments with each color representing a emotion at that point in the screenplay.

For example, if you choose La La Land the emotion barcode will be displayed as given below:

Here each color represents the emotion evoked at a particular point in the film. The colors representing the emotions is given below:

Joy – yellow

Love – Pink

sadness – dark blue

surprise – light green

fear – dark green

anger – red

In contrast if you choose a darker film like Joker, where the dominating emotion is fear and anger you would get a barcode like this:

Through this you can get a glimpse of what the movie is about without watching or reading the whole screenplay.

Similarity Measures

Another functionality of the app is to let users to choose a similarity measure to output top 5 similar movie screenplays that follows a similar emotion sequence. This is still not an accurate representation of similarity between screenplays, but it serves as an initiation in the direction where it can be further developed to study screenplay similarities and used in movie recommendation systems. Currently there are six similarity measures deployed in this app: Euclidean distance, Cosine similarity, Needleman-Wunsch algorithm, Jaro metric, Hamming Metric and Levenshtein Distance.

Analysing a new screenplay

Another functionality of the app is that users can also analyse a new screenplay or their own screenplay to visualize the emotion trajectory of their screenplay and find movies that follow a similar trajectory. The screenplay should be uploaded in a text file format with utf-8 encoding. Once uploaded the same classifier model used to classify other screenplays will classify this new screenplay and output the visualization and similar screenplays.

Further Development

This is currently the initiation of a more ambitious project where the model would be trained on a custom dataset of human annotated dataset of scenes from a screenplay mapped to the emotions they evoke with a more fine grained taxonomy of emotions. The model should also be able to handle large chunk of text to reduce the number of segments which will result in a more accurate similarity measures. This will result in a more accurate depiction of movie emotions and will aid in further research in this area of screenplay analysis.