Using Scrapy to Analyse British Airways reviews from skytrax

Github Repository

Scrapy is a powerful python library to scrape the internet and mine data. I have used Scrapy to mine the review data for British Airways from the Skytrax review website and store it in a json file format. Regex is used to format the text review and matplotlib is used to get insights about the response to various aircrafts in the fleet of British airways.

Starting the Scrapy project:

Download the Scrapy project using the command pip install Scrapy. Type the command scrapy start project BritishAirways in the terminal to start the Scrapy project. Navigate to the file spiders/__init__.py. This is where we are going to implement our crawlers and parsers to scrape the reviews.

Analysing the structure of the review website

This is how a review block in the skytrax website looks like. We need to scrape the overall rating, the text, name of the reviewer, date, and all the rows below. We will look at how to scrape each of the element from this review block.

We can look at the html code for this review block to help us in extracting each and every data in the table. It can be done by right clicking on the page and selecting ‘inspect’. On the right side of the page you can see the pane showing the html code for each block of the website. The below image shows the html code of the table where it shows the categories: Type of Traveller as Family Leisure, Seat as Business class, and so on. Each of this text is enclosed in a <td> element with a unique class which can be used to extract the text we want.

Code to extract the text from the website

Using this information I have written the following code to extract all the information from a review block and return as a json string.

import scrapy

#create a class review to crawl the review website
class review(scrapy.Spider):
    name = "review"

    #provide the start url

    start_urls = ["https://www.airlinequality.com/airline-reviews/british-airways/?sortby=post_date%3ADesc&pagesize=50"]

    #function to parse the reviews

    def parse(self, response):
        #storing the reviews in a variable
        reviews = response.xpath("//article[@itemprop='review']") 

        #parsing the individual reviews
        for review in reviews:
            review_data =  {
                "name": review.css("span[itemprop='name']::text").get(),
                "rating": review.css("span[itemprop='ratingValue']::text").get(),
                "date": review.css("time[itemprop='datePublished']").attrib['datetime'],
                "text": review.css("div.text_content::text")[-1].get().replace(" |  ","").replace(" | ","")
            }
            #extracting data from the table element
            rows = review.css("tr")
            columns = []
            for row in rows:
                columns.append(row.css("td"))
            for column in columns:
                if column[1].css("::text").getall() == ['1', '2', '3', '4', '5']:
                    review_data[column[0].css("::text").get()] = len(column[1].css("span.fill"))    #extracting the star ratings in the review
                else:
                    review_data[column[0].css("::text").get()] = column[1].css("::text").get()

            yield review_data

        # finding the element that consists the link to the next page

        next_page = response.css("article.querylist-pagination.position-").css('li')[-1].css("a").attrib['href']
        #crawling and parsing the next page
        
        if next_page is not None:
            yield response.follow(next_page,callback = self.parse)

This is the whole code to run the scrapper. After running it in the terminal we get the json file to work with.

The extracted dataset looks like the image shown below. This dataset needs a lot of cleaning and transformation which is explained in my github repository

Analysing the data

After cleaning the data I was able to identify some useful insights about the reviews.

This graph shows the Mean rating of different Aircrafts in various categories found in the review website. as you can see the aircraft A380 and Boeing 787 constantly have higher median rating than other aircrafts.

This was supported by another bar graph that shows the number of recommendations of various aircrafts:

This shows that while most of the aircrafts have a comparable ratio of ‘yes’ and ‘no’ recommendation, Boeing 747 and Boeing 777 have the highest number of ‘no’ recommendations.

The above graph shows the median rating in each year. We can see that the year 2018 is when the aircraft has lowest median rating. During the COVID period the rating has increased due to the lower number of passengers and hence low number of reviews.