ufcscraper.base module
Base modules for ufc scraper
This module defines BaseFileHandler and BaseScraper classes, meant to be inherited by specific scraper or file handler modules.
- class ufcscraper.base.BaseFileHandler(data_folder: Path | str)[source]
Bases:
ABCBase class for file handlers associated with a CSV table.
This class provides the basic functionality to manage data stored in a CSV file. It handles checking the existence of the file, initializing it with columns if it’s missing, removing duplicates, and loading the data into a pandas DataFrame.
- dtypes
A dictionary mapping column names to their data types.
- Type:
Dict[str, type | pd.core.arrays.integer.Int64Dtype]
- sort_fields
A list of column names used for sorting the data.
- Type:
List[str]
- data_folder
The folder where the CSV file is stored.
- Type:
Path
- filename
The name of the CSV file. This should be defined in subclasses.
- Type:
str
- data
A pandas DataFrame that holds the data loaded from the CSV file.
- check_data_file() None[source]
Checks if the CSV file exists in the specified data folder.
If the file does not exist, it creates a new file with the specified columns. Logs the status of the file (whether new or existing) using the logger.
- data = Empty DataFrame Columns: [] Index: []
- data_folder: Path
- dtypes: Dict[str, type | pd.core.arrays.integer.Int64Dtype]
- filename: str
- load_data() None[source]
Loads the data from the CSV file into the data DataFrame.
This method reads the CSV file, removes duplicates, and stores the data in the data attribute for further processing.
- remove_duplicates_from_file() None[source]
Removes duplicate rows from the CSV file.
This method reads the CSV file, removes any duplicate rows, and then saves the cleaned data back to the same file.
- sort_fields: List[str]
- class ufcscraper.base.BaseHTMLReader(html_file: Path | str, data_folder: Path | str)[source]
Bases:
BaseFileHandlerBase class for HTML readers associated with a CSV file.
This class provides basic functionality for reading HTML files and storing the data in a CSV file. It includes methods to read HTML content and convert it into a pandas DataFrame.
- class ufcscraper.base.BaseScraper(data_folder: Path | str, n_sessions: int | None = None, delay: float | None = None)[source]
Bases:
BaseFileHandlerBase class for web scrapers associated with a CSV file.
This class provides basic functionality for scraping data from specific webs and storing it in a CSV file. It includes default settings for web scraping such as the base URL, the number of concurrent sessions, and the delay between requests.
- web_url
The base URL for the website to scrape.
- Type:
str
- n_sessions
Number of concurrent sessions for scraping.
- Type:
int
- delay
Delay between requests to avoid being blocked.
- Type:
float
- delay: float = 0.1
- static id_from_url(url: str) str[source]
Extracts and returns the ID from a given URL.
- Parameters:
url – The URL from which to extract the ID.
- Returns:
The extracted ID as a string.
- n_sessions: int = 1
- web_url: str = 'http://www.ufcstats.com'