Published on

Scrapy | Crawl WhoScored For Football Stats

Authors

Earlier, I have written code to crawl Google Play, iTunes AppStore and Goal.com websites. But every time I re-wrote the code to get content from website, parse it using BeautifulSoup while maintaining the list of crawled URLs to avoid crawling them again. This was a lot of work.

A while ago I, discovered Scrapy. It’s a python-based framework that makes it super easy to setup a crawler and lets one focus on extracting the necessary data for processing. In this post I will talk through installing Scrapy, writing a crawler to crawl WhoScored and extracting match information from it.

Installation is super easy. On a Mac, with pip installed, one just needs to fire the following command to install Scrapy along with all its dependencies:

pip install Scrapy

Now we need to setup a Scrapy project. Let’s call our project soccerstats.

scrapy startproject soccerstats

This will create a directory soccerstats for us. Now to setup the crawler, we need to define an item which is a container of the crawled data and write a spider that actually extracts the data. Additionally, we can define our custom LinkExtractor and a processing Pipeline.

The SoccerStats project can be found at — GitHub SoccerStats.

Let’s first configure the crawler.

# Automatically created by: scrapy startproject
#
# For more information about the [deploy] section see:
# http://doc.scrapy.org/en/latest/topics/scrapyd.html

[settings]
default = soccerstats.settings

[deploy]
#url = http://localhost:6800/
project = soccerstats

The above tells Scrapy that the settings file is present under the Python package soccerstats and is named settings.py. Let’s now see the settings file.

# -*- coding: utf-8 -*-

# Scrapy settings for soccerstats project
#
# For simplicity, this file contains only the most important settings by
# default. All the other settings are documented here:
#
#     http://doc.scrapy.org/en/latest/topics/settings.html
#

BOT_NAME = 'soccerstats'

SPIDER_MODULES = ['soccerstats.spiders']
NEWSPIDER_MODULE = 'soccerstats.spiders'

# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = 'Anuvrat Singh (+http://singhanuvrat.com)'

# WhoScored settings
WHOSCORED_MATCH_FEED = 'http://www.whoscored.com/Matches/%s/Live'
WHOSCORED_FEED_URL = 'http://www.whoscored.com/tournamentsfeed/%s/Fixtures/?d=%s%s&isAggregate=false'
TOURNAMENTS = [9155,    # England-Premier-League-2014-2015
               11369,   # Italy-Serie-A-2014-2015
               11363,   # Spain-La-Liga-2014-2015
               9192,    # Germany-Bundesliga-2014-2015
               9105,    # France-Ligue-1-2014-2015
               9121,    # Netherlands-Eredivisie-2014-2015
               9145,    # Russia-Premier-League-2014-2015
               4185,    # Fixtures/Brazil-Brasileiro-2014
               8358,    # USA-Major-League-Soccer-2014
               11306,   # Turkey-Super-Lig-2014-2015
               9156,    # England-Championship-2014-2015
               9189,    # Europe-UEFA-Champions-League-2014-2015
               9187,    # Europe-UEFA-Europa-League-2014-2015
               10274    # International-FIFA-World-Cup-2014
               ]
TOURNAMENT_YEARS = [2014, 2015]    # Years for the current season

CONCURRENT_ITEMS = 200
CONCURRENT_REQUESTS = 32
CONCURRENT_REQUESTS_PER_DOMAIN = 16
ROBOTSTXT_OBEY = True
COOKIES_ENABLED = False
LOG_ENABLED = False

LOG_FILE = 'application.log'

The SPIDER_MODULES settings tells Scrapy that the spider can be found in soccerstats.spiders module. We have also defined a user_agent so that website owners can know who has been crawling them. Then there are a few settings to configure concurrent requests count and logging.

One can add any configuration in this file and access it in the project by calling the method get_project_settings().get(CONFIG_KEY_NAME). I have added the match and feed URLs and the list of tournaments we are interested in. I prefer extracting application configurations into a separate file than hardcoding everything in code files.

Next, we need to define the Item that we will extract from each page. Each Item object needs to extend the scrapy.Item class. Below code is pretty self-explanatory.

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# http://doc.scrapy.org/en/latest/topics/items.html

import scrapy


class WhoScoredRatingsItem(scrapy.Item):
    """
        Define the item that will store the 
        player ratings for a team in a match
    """
    match_id = scrapy.Field()
    venue_name = scrapy.Field()
    referee_name = scrapy.Field()
    start_time = scrapy.Field()
    man_of_the_match = scrapy.Field()
    competition = scrapy.Field()

    home_team_id = scrapy.Field()
    home_team_name = scrapy.Field()
    home_team_average_age = scrapy.Field()
    home_team_manager = scrapy.Field()
    home_team_rating = scrapy.Field()
    home_team_formation = scrapy.Field()
    home_team_players = scrapy.Field()
    home_team_score_halftime = scrapy.Field()
    home_team_score_fulltime = scrapy.Field()

    away_team_id = scrapy.Field()
    away_team_name = scrapy.Field()
    away_team_average_age = scrapy.Field()
    away_team_manager = scrapy.Field()
    away_team_rating = scrapy.Field()
    away_team_formation = scrapy.Field()
    away_team_players = scrapy.Field()
    away_team_score_halftime = scrapy.Field()
    away_team_score_fulltime = scrapy.Field()

Now we need to define a CrawlSpider and a LinkExtractor. You provide a list of start_urls to the CrawlSpider. Scrapy then gets the content of this website and extracts all the URLs matching allowed_domains. In our particular case, we provide a list of tournamentfeed URLs to the CrawlSpider. The data is a list of lists in which first element of each list is the matchId. Using this matchId we can get the link to the match page. To extract the matchIds and form the links to match pages we will write a custom LinkExtractor. Let’s first start with that.

# -*- coding: utf-8 -*-

"""
Link extractors for WhoScored web pages.
"""

import ast

from scrapy.contrib.linkextractors import LinkExtractor
from scrapy.link import Link
from scrapy.utils.project import get_project_settings

class WhoScoredLinkExtractor(LinkExtractor):
    """
    Extract links from WhoScored Tournament Feed pages
    """

    def extract_links(self, response):
        """
        Given the response from the tournament feed page, extract all the matches and return URLs to their page.
        :param response: The contents of the tournament feed page
        :return: The list of URL to match pages
        """
        match_page = get_project_settings().get('WHOSCORED_MATCH_FEED')
        
        return [Link(match_page % match[0]) for match in ast.literal_eval(response.body.replace(',,', ',0,'))]

To implement a link extractor, one needs to extend the LinkExtractor class and define the extract_links(response) method. This method is called for every start URL. The return type of this method is a Link object that the CrawlSpider can crawl next.

Let’s look at the actual spider now.

# -*- coding: utf-8 -*-

"""
Spider to crawl WhoScored web pages.
"""

import re

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.utils.project import get_project_settings
from soccerstats.spiders.whoscored_linkextractor import WhoScoredLinkExtractor
from soccerstats.items import WhoScoredRatingsItem
import json


class WhoScoredSpider(CrawlSpider):
    """
    Define the crawler to crawwl WhoScored web pages and extract player ratings
    """
    name = 'WhoScored'
    allowed_domains = ['www.whoscored.com']
    rules = (
        Rule(WhoScoredLinkExtractor(), callback='parse_item'),
    )

    def __init__(self, tournament=None, year=None, month=None, *args, **kwargs):
        super(WhoScoredSpider, self).__init__(*args, **kwargs)
        self.start_urls = self.__prepare_seed_list(tournament, year, month)

    def parse_item(self, response):
        """
        Given the match feed page, extract ratings for both the teams
        :param response: The contents of the match page
        :return: The rating items for both the teams.
        """
        
        match_data_element = response.xpath('//script[contains(., "matchCentreData")]/text()').extract()
        if len(match_data_element) == 0:
            return
        
        match_data = match_data_element[0]
        match_center_data = json.loads(re.search("matchCentreData = (.+?);", match_data).group(1))

        item = WhoScoredRatingsItem()

        item['match_id'] = re.search("matchId = (.+?);", match_data).group(1)
        item['venue_name'] = match_center_data['venueName'] if 'venueName' in match_center_data else ''
        item['referee_name'] = match_center_data['refereeName'] if 'refereeName' in match_center_data else '' 
        item['start_time'] = match_center_data['startTime']
        item['competition'] = response.xpath('//div[@id="breadcrumb-nav"]/a/text()').extract()
        
        for pos in ['home', 'away']:
            team_data = match_center_data[pos]
            item[pos + '_team_id'] = team_data['teamId']
            item[pos + '_team_name'] = team_data['name']
            item[pos + '_team_average_age'] = team_data['averageAge']
            item[pos + '_team_manager'] = team_data['managerName']
            item[pos + '_team_formation'] = team_data['formations'][0]['formationName']
            item[pos + '_team_score_halftime'] = team_data['scores']['halftime']
            item[pos + '_team_score_fulltime'] = team_data['scores']['fulltime']
            
            players = {}
            team_rating = 0.0
            players_involved = 0
            for player in team_data['players']:
                player_id = player['playerId']
                
                ratings_array = player['stats']['ratings'] if 'ratings' in player['stats'] else None
                if ratings_array:
                    rating = ratings_array[max(ratings_array, key = int)]
                    team_rating += rating
                    players_involved += 1
                else: 
                    rating = -1
                
                players[player_id] = {'age': player['age'], 
                                      'height': player['height'], 
                                      'shirt': player['shirtNo'] if 'shirtNo' in player else -1, 
                                      'position': player['position'], 
                                      'name': player['name'], 
                                      'started': 'isFirstEleven' in player and player['isFirstEleven'], 
                                      'rating': rating
                                      }
                
                if player['isManOfTheMatch']:
                    item['man_of_the_match'] = {'id': player_id, 'name': player['name']}
            
            item[pos + '_team_players'] = players
            item[pos + '_team_rating'] = team_rating / players_involved 
        
        return item
        

    def __prepare_seed_list(self, tournament, year, month):
        whoscored_feed_url = get_project_settings().get('WHOSCORED_FEED_URL')

        if tournament and year and month:
            return [whoscored_feed_url % (tournament, year, month.zfill(2))]

        tournaments = get_project_settings().get('TOURNAMENTS')
        years = get_project_settings().get('TOURNAMENT_YEARS')

        dates = [(years[0], month) for month in xrange(06, 12)]
        #dates.extend([(years[1], month) for month in xrange(01, 06)])

        return [whoscored_feed_url % (tournament, year, str(month).zfill(2)) for tournament in tournaments for (year, month) in dates]

The spider class needs to extend the CrawlSpider class. We have defined the list of allowed_domain, which in our case is just whoscored.com. Scrapy will only extract whoscored.com links to crawl and ignore the rest. The method __prepare_seed_list(tournament, year, month) is responsible for generating the list of start_urls.

We have also defined the rule to use WhoScoredLinkExtractor() to extract links from the start_url pages. Notice that we have defined a callback for this rule. This means that the method parse_item will be called for every URL returned by the link extractor.

The method parse_item() does the main job of extracting data from the webpage by parsing appropriate xpaths. It extracts the data into the Item we created earlier and returns the item. A point to remember here is that if one wants to return multiple items from a page then all he needs to do is yield each item object.

To run the crawler we have a couple of options:

# To write the output into ratings.json file.
scrapy crawl WhoScored -o ratings.json -t json -L INFO

# To write the output in the console, and also provide inputs to the crawler.
scrapy crawl WhoScored -a tournament=9155 -a year=2014 -a month=7

Output from a run of the crawler:

2015-01-16 22:15:14+0530 [scrapy] INFO: Scrapy 0.24.4 started (bot: soccerstats)
2015-01-16 22:15:14+0530 [scrapy] INFO: Optional features available: ssl, http11, boto
2015-01-16 22:15:14+0530 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'soccerstats.spiders', 'FEED_URI': 'ratings.json', 'LOG_LEVEL': 'INFO', 'CONCURRENT_REQUESTS_PER_DOMAIN': 16, 'CONCURRENT_REQUESTS': 32, 'SPIDER_MODULES': ['soccerstats.spiders'], 'BOT_NAME': 'soccerstats', 'CONCURRENT_ITEMS': 200, 'ROBOTSTXT_OBEY': True, 'COOKIES_ENABLED': False, 'USER_AGENT': 'Anuvrat Singh (+http://singhanuvrat.com)', 'FEED_FORMAT': 'json', 'LOG_FILE': 'application.log'}
2015-01-16 22:15:14+0530 [scrapy] INFO: Enabled extensions: FeedExporter, LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2015-01-16 22:15:14+0530 [scrapy] INFO: Enabled downloader middlewares: RobotsTxtMiddleware, HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, ChunkedTransferMiddleware, DownloaderStats
2015-01-16 22:15:14+0530 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2015-01-16 22:15:14+0530 [scrapy] INFO: Enabled item pipelines: 
2015-01-16 22:15:14+0530 [WhoScored] INFO: Spider opened
2015-01-16 22:15:14+0530 [WhoScored] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2015-01-16 22:16:14+0530 [WhoScored] INFO: Crawled 136 pages (at 136 pages/min), scraped 103 items (at 103 items/min)
2015-01-16 22:17:14+0530 [WhoScored] INFO: Crawled 258 pages (at 122 pages/min), scraped 224 items (at 121 items/min)
2015-01-16 22:18:14+0530 [WhoScored] INFO: Crawled 387 pages (at 129 pages/min), scraped 354 items (at 130 items/min)
2015-01-16 22:19:14+0530 [WhoScored] INFO: Crawled 513 pages (at 126 pages/min), scraped 478 items (at 124 items/min)
2015-01-16 22:20:14+0530 [WhoScored] INFO: Crawled 638 pages (at 125 pages/min), scraped 605 items (at 127 items/min)
2015-01-16 22:21:14+0530 [WhoScored] INFO: Crawled 783 pages (at 145 pages/min), scraped 730 items (at 125 items/min)
2015-01-16 22:22:14+0530 [WhoScored] INFO: Crawled 920 pages (at 137 pages/min), scraped 867 items (at 137 items/min)
2015-01-16 22:23:14+0530 [WhoScored] INFO: Crawled 1059 pages (at 139 pages/min), scraped 1006 items (at 139 items/min)
2015-01-16 22:24:14+0530 [WhoScored] INFO: Crawled 1280 pages (at 221 pages/min), scraped 1122 items (at 116 items/min)
2015-01-16 22:25:14+0530 [WhoScored] INFO: Crawled 1392 pages (at 112 pages/min), scraped 1234 items (at 112 items/min)
2015-01-16 22:26:14+0530 [WhoScored] INFO: Crawled 1537 pages (at 145 pages/min), scraped 1379 items (at 145 items/min)
2015-01-16 22:26:51+0530 [WhoScored] INFO: Closing spider (finished)
2015-01-16 22:26:51+0530 [WhoScored] INFO: Stored json feed (1432 items) in: ratings.json
2015-01-16 22:26:51+0530 [WhoScored] INFO: Dumping Scrapy stats:
    {'downloader/request_bytes': 623577,
     'downloader/request_count': 1880,
     'downloader/request_method_count/GET': 1880,
     'downloader/response_bytes': 310691128,
     'downloader/response_count': 1880,
     'downloader/response_status_count/200': 1702,
     'downloader/response_status_count/403': 178,
     'finish_reason': 'finished',
     'finish_time': datetime.datetime(2015, 1, 16, 16, 56, 51, 113574),
     'item_scraped_count': 1432,
     'log_count/INFO': 19,
     'request_depth_max': 1,
     'response_received_count': 1880,
     'scheduler/dequeued': 1879,
     'scheduler/dequeued/memory': 1879,
     'scheduler/enqueued': 1879,
     'scheduler/enqueued/memory': 1879,
     'start_time': datetime.datetime(2015, 1, 16, 16, 45, 14, 402884)}
2015-01-16 22:26:51+0530 [WhoScored] INFO: Spider closed (finished)

I want to add here that if WhoScored.com were to change their page and move things around then the only classes we would need to check would be the WhoScoredLinkExtractor and WhoScoredSpider. On the other hand, if we want to add a new crawler, we just need to define the new link extractor and spider. Everything else is taken care of by Scrapy.