ufcscraper.base module

Base modules for ufc scraper

This module defines BaseFileHandler and BaseScraper classes, meant to be inherited by specific scraper or file handler modules.

class ufcscraper.base.BaseFileHandler(data_folder: Path | str)[source]

Bases: ABC

Base class for file handlers associated with a CSV table.

This class provides the basic functionality to manage data stored in a CSV file. It handles checking the existence of the file, initializing it with columns if it’s missing, removing duplicates, and loading the data into a pandas DataFrame.

dtypes

A dictionary mapping column names to their data types.

Type:: Dict[str, type | pd.core.arrays.integer.Int64Dtype]

sort_fields

A list of column names used for sorting the data.

Type:: List[str]

data_folder

The folder where the CSV file is stored.

Type:: Path

filename

The name of the CSV file. This should be defined in subclasses.

Type:: str

data: A pandas DataFrame that holds the data loaded from the CSV file.

check_data_file() → None[source]

Checks if the CSV file exists in the specified data folder.

If the file does not exist, it creates a new file with the specified columns. Logs the status of the file (whether new or existing) using the logger.

data = Empty DataFrame Columns: [] Index: []

data_folder: Path

dtypes: Dict[str, type | pd.core.arrays.integer.Int64Dtype]

filename: str

load_data() → None[source]

Loads the data from the CSV file into the data DataFrame.

This method reads the CSV file, removes duplicates, and stores the data in the data attribute for further processing.

remove_duplicates_from_file() → None[source]

Removes duplicate rows from the CSV file.

This method reads the CSV file, removes any duplicate rows, and then saves the cleaned data back to the same file.

sort_fields: List[str]

class ufcscraper.base.BaseHTMLReader(html_file: Path | str, data_folder: Path | str)[source]

Bases: BaseFileHandler

Base class for HTML readers associated with a CSV file.

This class provides basic functionality for reading HTML files and storing the data in a CSV file. It includes methods to read HTML content and convert it into a pandas DataFrame.

read_html() → str[source]

Reads the HTML content from the specified HTML file.

Returns:: The HTML content as a string.
Return type:: str

class ufcscraper.base.BaseScraper(data_folder: Path | str, n_sessions: int | None = None, delay: float | None = None)[source]

Bases: BaseFileHandler

Base class for web scrapers associated with a CSV file.

This class provides basic functionality for scraping data from specific webs and storing it in a CSV file. It includes default settings for web scraping such as the base URL, the number of concurrent sessions, and the delay between requests.

web_url

The base URL for the website to scrape.

Type:: str

n_sessions

Number of concurrent sessions for scraping.

Type:: int

delay

Delay between requests to avoid being blocked.

Type:: float

delay: float = 0.1

static id_from_url(url: str) → str[source]

Extracts and returns the ID from a given URL.

Parameters:: url – The URL from which to extract the ID.
Returns:: The extracted ID as a string.

n_sessions: int = 1

web_url: str = 'http://www.ufcstats.com'