UFC Stats Modules

ufcscraper.base module

Base modules for ufc scraper

This module defines BaseFileHandler and BaseScraper classes, meant to be inherited by specific scraper or file handler modules.

class ufcscraper.base.BaseFileHandler(data_folder: Path | str)[source]

Bases: ABC

Base class for file handlers associated with a CSV table.

This class provides the basic functionality to manage data stored in a CSV file. It handles checking the existence of the file, initializing it with columns if it’s missing, removing duplicates, and loading the data into a pandas DataFrame.

dtypes

A dictionary mapping column names to their data types.

Type:: Dict[str, type | pd.core.arrays.integer.Int64Dtype]

sort_fields

A list of column names used for sorting the data.

Type:: List[str]

data_folder

The folder where the CSV file is stored.

Type:: Path

filename

The name of the CSV file. This should be defined in subclasses.

Type:: str

data: A pandas DataFrame that holds the data loaded from the CSV file.

check_data_file() → None[source]

Checks if the CSV file exists in the specified data folder.

If the file does not exist, it creates a new file with the specified columns. Logs the status of the file (whether new or existing) using the logger.

data = Empty DataFrame Columns: [] Index: []

data_file: Path

data_folder: Path

dtypes: Dict[str, type | pd.core.arrays.integer.Int64Dtype]

filename: str

load_data() → None[source]

Loads the data from the CSV file into the data DataFrame.

This method reads the CSV file, removes duplicates, and stores the data in the data attribute for further processing.

remove_duplicates_from_file() → None[source]

Removes duplicate rows from the CSV file.

This method reads the CSV file, removes any duplicate rows, and then saves the cleaned data back to the same file.

sort_fields: List[str]

class ufcscraper.base.BaseHTMLReader(html_file: Path | str, data_folder: Path | str)[source]

Bases: BaseFileHandler

Base class for HTML readers associated with a CSV file.

This class provides basic functionality for reading HTML files and storing the data in a CSV file. It includes methods to read HTML content and convert it into a pandas DataFrame.

read_html() → str[source]

Reads the HTML content from the specified HTML file.

Returns:: The HTML content as a string.
Return type:: str

class ufcscraper.base.BaseScraper(data_folder: Path | str, n_sessions: int | None = None, delay: float | None = None)[source]

Bases: BaseFileHandler

Base class for web scrapers associated with a CSV file.

This class provides basic functionality for scraping data from specific webs and storing it in a CSV file. It includes default settings for web scraping such as the base URL, the number of concurrent sessions, and the delay between requests.

web_url

The base URL for the website to scrape.

Type:: str

n_sessions

Number of concurrent sessions for scraping.

Type:: int

delay

Delay between requests to avoid being blocked.

Type:: float

data_file: Path

data_folder: Path

delay: float = 0.1

dtypes: Dict[str, type | pd.core.arrays.integer.Int64Dtype]

filename: str

static id_from_url(url: str) → str[source]

Extracts and returns the ID from a given URL.

Parameters:: url – The URL from which to extract the ID.
Returns:: The extracted ID as a string.

n_sessions: int = 1

sort_fields: List[str]

web_url: str = 'http://www.ufcstats.com'

ufcscraper.fighter_scraper module

This module defines a FighterScraper class for scraping and processing fighter data from UFCStats.

The FighterScraper class inherits from the BaseScraper class and is designed to retrieve detailed information about UFC fighters, including personal details, physical attributes, and fight records. The scraped data is processed and saved into a CSV file for later analysis. The module also provides methods for parsing and converting specific attributes like height, weight, reach, and more from the scraped HTML content.

class ufcscraper.fighter_scraper.FighterScraper(data_folder: Path | str, n_sessions: int | None = None, delay: float | None = None)[source]

Bases: BaseScraper

Scrapes and stores fighter data from UFCStats.

This class handles scraping fighter details from UFCStats, including personal information, physical attributes, and fight records. The data is saved to a CSV file for further analysis.

add_name_column() → None[source]

Adds a combined name column to the DataFrame.

The new column is created by concatenating the fighter’s first: and last names.

data = Empty DataFrame Columns: [fighter_id, fighter_f_name, fighter_l_name, fighter_nickname, fighter_height_cm, fighter_weight_lbs, fighter_reach_cm, fighter_stance, fighter_dob, fighter_w, fighter_l, fighter_d, fighter_nc_dq] Index: []

data_file: Path

data_folder: Path

dtypes: Dict[str, type | pd.core.arrays.integer.Int64Dtype] = {'fighter_d': Int64Dtype(), 'fighter_dob': 'datetime64[ns]', 'fighter_f_name': <class 'str'>, 'fighter_height_cm': <class 'float'>, 'fighter_id': <class 'str'>, 'fighter_l': Int64Dtype(), 'fighter_l_name': <class 'str'>, 'fighter_nc_dq': Int64Dtype(), 'fighter_nickname': <class 'str'>, 'fighter_reach_cm': <class 'float'>, 'fighter_stance': <class 'str'>, 'fighter_w': Int64Dtype(), 'fighter_weight_lbs': <class 'float'>}

filename: str = 'fighter_data.csv'

get_fighter_urls() → List[str][source]

Retrieves the URLs for fighter profiles.

Returns:: A list of URLs to fighter profiles.

static parse_dob(dob: bs4.element.Tag) → str[source]

Parses and formats the fighter’s date of birth.

Parameters:: dob – BeautifulSoup tag containing the date of birth.
Returns:: The date of birth in YYYY-MM-DD format, or “” if not available.

static parse_height(height: bs4.element.Tag) → str[source]

Parses and converts fighter’s height from feet and inches to cm.

Parameters:: height – BeautifulSoup tag containing the height in feet and inches.
Returns:: The height in centimeters, or “” if not available.

static parse_l_name(name: List[str]) → str[source]

Parses the last name from a list of name parts.

Parameters:: name – List of name parts.
Returns:: The parsed last name, or “” if it cannot be determined.

static parse_nickname(nickname: bs4.element.Tag) → str[source]

Parses the fighter’s nickname.

Parameters:: nickname – BeautifulSoup tag containing the nickname.
Returns:: The parsed nickname, or “” if not available.

static parse_reach(reach: bs4.element.Tag) → str[source]

Parses and converts fighter’s reach from inches to cm.

Parameters:: reach – BeautifulSoup tag containing the reach in inches.
Returns:: The reach in centimeters, or “” if not available.

static parse_stance(stance: bs4.element.Tag) → str[source]

Parses the fighter’s stance.

Parameters:: stance – BeautifulSoup tag containing the stance.
Returns:: The stance, or “” if not available.

static parse_weight(weight_element: bs4.element.Tag) → str[source]

Parses the fighter’s weight.

Parameters:: weight_element – BeautifulSoup tag containing the weight.
Returns:: The weight in pounds, or “” if not available.

scrape_fighters() → None[source]

Scrapes fighter details from URLs and saves the data to a CSV file.

This method retrieves fighter URLs, scrapes details from each URL, and appends the data to the CSV file. Handles errors and logs progress.

sort_fields: List[str] = ['fighter_l_name', 'fighter_f_name', 'fighter_id']

classmethod url_from_id(id_: str) → str[source]

Constructs the URL for a fighter’s details page based on their ID.

Parameters:: id – The fighter’s unique identifier.
Returns:: The URL for the fighter’s details page.

ufcscraper.event_scraper module

This module contains the EventScraper class, which is responsible for scraping event data from the UFCStats website.

The EventScraper class inherits from BaseScraper and provides functionality to retrieve and process event details such as event name, date, city, state, and country. The scraped data is stored in a CSV file (event_data.csv) and can be used for further analysis.

class ufcscraper.event_scraper.EventScraper(data_folder: Path | str, n_sessions: int | None = None, delay: float | None = None)[source]

Bases: BaseScraper

Scrapes event data from the UFCStats website.

This class handles scraping event details such as event name, date, city, state, and country, and stores them in a CSV file. It inherits basic scraping functionality from BaseScraper.

data = Empty DataFrame Columns: [event_id, event_name, event_date, event_city, event_state, event_country] Index: []

data_file: Path

data_folder: Path

dtypes: Dict[str, type | pd.core.arrays.integer.Int64Dtype] = {'event_city': <class 'str'>, 'event_country': <class 'str'>, 'event_date': 'datetime64[ns]', 'event_id': <class 'str'>, 'event_name': <class 'str'>, 'event_state': <class 'str'>}

event_type = 'completed'

filename: str = 'event_data.csv'

get_event_urls() → List[str][source]

Retrieves the URLs of all completed events from UFCStats.

This method scrapes the UFCStats website for event URLs that contain the keyword ‘event-details’. It returns a list of these URLs.

Returns:: A list of URLs for completed events.

get_fight_urls_from_event_urls(event_urls: List[str]) → List[str][source]

Extracts fight URLs from a list of event URLs.

Parameters:: event_urls – A list of event URLs from which to extract fight URLs.
Returns:: A list of fight URLs extracted from the provided event URLs.

scrape_events() → None[source]

Scrapes event data and saves it to a CSV file.

This method compares existing event URLs with those available on the UFCStats website, scrapes details of new events, and appends them to the CSV file. Logs the progress and any errors encountered.

sort_fields: List[str] = ['event_date', 'event_name']

classmethod url_from_id(id_: str) → str[source]

Constructs the event URL using the event ID.

Parameters:: id – The unique identifier for the event.
Returns:: The full URL to the event’s details page on UFCStats.

class ufcscraper.event_scraper.UpcomingEventScraper(data_folder: Path | str, n_sessions: int | None = None, delay: float | None = None)[source]

Bases: EventScraper

data_file: Path

data_folder: Path

event_type = 'upcoming'

filename: str = 'upcoming_event_data.csv'

get_fight_urls_from_event_urls(event_urls: List[str]) → List[str][source]

Extracts fight URLs from a list of event URLs.

Parameters:: event_urls – A list of event URLs from which to extract fight URLs.
Returns:: A list of fight URLs extracted from the provided event URLs.

ufcscraper.fight_scraper module

This module defines classes for scraping fight and round data from the UFCStats website.

Classes:

FightScraper: Inherits from BaseScraper and is responsible for scraping detailed fight statistics, such as fighter information, results, referees, and more. The data is stored in a CSV file named fight_data.csv. It also interacts with the RoundsHandler to scrape and store round-specific statistics.

RoundsHandler: Inherits from BaseFileHandler and manages the collection and storage of round-specific fight data. The data is saved in a CSV file named round_data.csv. It handles statistics like strikes, takedowns, control time, and more.

class ufcscraper.fight_scraper.BaseFightScraper(data_folder: Path | str, n_sessions: int | None = None, delay: float | None = None)[source]

Bases: BaseScraper, ABC

Base class for fight scrapers.

This class provides the basic functionality to scrape fight data from the UFCStats it should be inherited by specific fight scraper classes.

event_scraper: alias of EventScraper

get_fight_urls(get_all_events: bool = False) → List[str][source]

Retrieves URLs of all fights from UFCStats.

Parameters:: get_all_events – If False, only gets URLs for fights from events not already scraped.
Returns:: A list of URLs for fights.

static get_fighters(fight_details: bs4.element.ResultSet, fight_soup: bs4.BeautifulSoup) → Tuple[str, str][source]

Extracts fighter IDs from the fight details.

Parameters:

fight_details – A ResultSet containing fight detail information.
fight_soup – The BeautifulSoup object containing the fight page.

Returns:

A tuple containing the IDs of the two fighters.

static get_title_fight(fight_type: bs4.element.ResultSet) → str[source]

Determines if the fight is a title fight.

Parameters:: fight_type – A ResultSet containing fight type information.
Returns:: ‘T’ if it’s a title fight, ‘F’ otherwise.

static get_weight_class(fight_type: bs4.element.ResultSet) → str[source]

Extracts the weight class of the fight.

Parameters:: fight_type – A ResultSet containing fight type information.
Returns:: The weight class of the fight, or ‘’ if not found.

classmethod url_from_id(id_: str) → str[source]

Constructs the fight URL using the fight ID.

Parameters:: id – The unique identifier for the fight.
Returns:: The full URL to the fight’s details page on UFCStats.

class ufcscraper.fight_scraper.FightScraper(*args: Any, **kwargs: Any)[source]

Bases: BaseFightScraper

Scrapes fight data from the UFCStats website.

This class inherits from BaseScraper and handles scraping detailed fight statistics including fighters, referees, results, and more. It saves the scraped data into two CSV files: one for fights and one for rounds (through the companion class RoundsHandler).

data = Empty DataFrame Columns: [fight_id, event_id, referee, fighter_1, fighter_2, winner, num_rounds, title_fight, weight_class, gender, result, result_details, finish_round, finish_time, time_format, scores_1, scores_2] Index: []

data_file: Path

data_folder: Path

dtypes: Dict[str, type | pd.core.arrays.integer.Int64Dtype] = {'event_id': <class 'str'>, 'fight_id': <class 'str'>, 'fighter_1': <class 'str'>, 'fighter_2': <class 'str'>, 'finish_round': Int64Dtype(), 'finish_time': <class 'str'>, 'gender': <class 'str'>, 'num_rounds': Int64Dtype(), 'referee': <class 'str'>, 'result': <class 'str'>, 'result_details': <class 'str'>, 'scores_1': Int64Dtype(), 'scores_2': Int64Dtype(), 'time_format': <class 'str'>, 'title_fight': <class 'str'>, 'weight_class': <class 'str'>, 'winner': <class 'str'>}

filename: str = 'fight_data.csv'

static get_gender(fight_type: bs4.element.ResultSet) → str[source]

Determines the gender of the fight.

Parameters:: fight_type – A ResultSet containing fight type information.
Returns:: ‘F’ if it’s a women’s fight, ‘M’ otherwise.

static get_referee(overview: bs4.element.ResultSet) → str[source]

Extracts the referee’s name from the fight overview.

Parameters:: overview – A ResultSet containing fight overview information.
Returns:: The referee’s name, or ‘’ if not found.

static get_result(select_result: bs4.element.ResultSet, select_result_details: bs4.element.ResultSet) → Tuple[str, str][source]

Extracts the result and details of the fight.

Parameters:

select_result – A ResultSet containing the fight result.
select_result_details – A ResultSet containing additional result details.

Returns:

A tuple with the result type and result details.

static get_scores(overview: bs4.element.ResultSet, select_result: bs4.element.ResultSet, select_result_details: bs4.element.ResultSet) → Tuple[str, str][source]

Extracts the scores of the fight if they the fight went the distance.

Parameters:

overview – A ResultSet containing the fight overview.
select_result – A ResultSet containing the fight result.

Returns:

A tuple with the scores of the fight. As str to be written to the CSV file.

static get_winner(fighter_1: str, fighter_2: str, win_lose: bs4.element.ResultSet) → str[source]

Determines the winner of the fight based on the win/lose status.

Parameters:

fighter_1 – The ID of the first fighter.
fighter_2 – The ID of the second fighter.
win_lose – A ResultSet containing win/lose status for the fighters.

Returns:

The ID of the winner, or ‘Draw’ if it’s a draw, or ‘NC if no contest or ‘’ if not determined.

scrape_fights(get_all_events: bool = False) → None[source]

Scrapes fight data and saves it to CSV files.

This method scrapes fight details and round statistics. It saves the fight details and round statistics to separate CSV files.

Parameters:: get_all_events – If False, only scrapes fights from events not already scraped.

sort_fields: List[str] = ['event_id', 'fight_id']

class ufcscraper.fight_scraper.RoundsHandler(data_folder: Path | str)[source]

Bases: BaseFileHandler

Handles the manipulation and storage of round statistics.

This class inherits from BaseFileHandler and manages round-specific statistics, including strikes, takedowns, and control time. It formats and saves the data to a CSV file.

data = Empty DataFrame Columns: [fight_id, fighter_id, round, knockdowns, strikes_att, strikes_succ, head_strikes_att, head_strikes_succ, body_strikes_att, body_strikes_succ, leg_strikes_att, leg_strikes_succ, distance_strikes_att, distance_strikes_succ, ground_strikes_att, ground_strikes_succ, clinch_strikes_att, clinch_strikes_succ, total_strikes_att, total_strikes_succ, takedown_att, takedown_succ, submission_att, reversals, ctrl_time] Index: []

data_file: Path

data_folder: Path

dtypes: Dict[str, type | pd.core.arrays.integer.Int64Dtype] = {'body_strikes_att': Int64Dtype(), 'body_strikes_succ': Int64Dtype(), 'clinch_strikes_att': Int64Dtype(), 'clinch_strikes_succ': Int64Dtype(), 'ctrl_time': <class 'str'>, 'distance_strikes_att': Int64Dtype(), 'distance_strikes_succ': Int64Dtype(), 'fight_id': <class 'str'>, 'fighter_id': <class 'str'>, 'ground_strikes_att': Int64Dtype(), 'ground_strikes_succ': Int64Dtype(), 'head_strikes_att': Int64Dtype(), 'head_strikes_succ': Int64Dtype(), 'knockdowns': Int64Dtype(), 'leg_strikes_att': Int64Dtype(), 'leg_strikes_succ': Int64Dtype(), 'reversals': Int64Dtype(), 'round': Int64Dtype(), 'strikes_att': Int64Dtype(), 'strikes_succ': Int64Dtype(), 'submission_att': Int64Dtype(), 'takedown_att': Int64Dtype(), 'takedown_succ': Int64Dtype(), 'total_strikes_att': Int64Dtype(), 'total_strikes_succ': Int64Dtype()}

filename: str = 'round_data.csv'

static get_stats(fight_stats: bs4.element.ResultSet, fighter: int, round_: int, finish_round: int) → Tuple[str, ...][source]

Extracts round statistics for a specific fighter in a given fight.

Parameters:

fight_stats – A ResultSet containing fight statistics.
fighter – The index of the fighter (0 or 1).
round – The round number.
finish_round – The total number of rounds.

Returns:

A tuple of statistics for the specified fighter in the given round. Returns “” for all fields if an error occurs.

Raises:

ValueError – If fighter is not 0 or 1.

sort_fields: List[str] = ['fight_id', 'fighter_id', 'round']

class ufcscraper.fight_scraper.UpcomingFightScraper(data_folder: Path | str, n_sessions: int | None = None, delay: float | None = None)[source]

Bases: BaseFightScraper

Scrapes fight data for upcoming events from the UFCStats website.

This class inherits from FightScraper and is specifically designed to scrape fight data for upcoming events. It uses the UpcomingEventScraper to get event URLs and then scrapes fight details from those events.

data = Empty DataFrame Columns: [fight_id, event_id, fighter_1, fighter_2, title_fight, weight_class] Index: []

data_file: Path

data_folder: Path

dtypes: Dict[str, type] = {'event_id': <class 'str'>, 'fight_id': <class 'str'>, 'fighter_1': <class 'str'>, 'fighter_2': <class 'str'>, 'title_fight': <class 'str'>, 'weight_class': <class 'str'>}

event_scraper: alias of UpcomingEventScraper

filename: str = 'upcoming_fight_data.csv'

remove_rows_from_table(fight_ids: List[str]) → None[source]

Removes rows from the fight data table based on fight IDs.

Parameters:: fight_ids – A list of fight IDs to be removed from the data.

scrape_fights() → None[source]

Scrapes fight data and saves it to CSV files.

This method scrapes fight details and saves them to a CSV file.

sort_fields: List[str] = ['event_id', 'fight_id']