Sourcing, Cleaning, and Exploring Collective Investment Schemes in Kenya
The dataset compiles Effective Annual Rates for Money Market Funds (KES) and Assets Under Management, sourced from Capital Markets Authority reports and Cytonn Research publications. Utilizing web crawling techniques, data cleaning, and basic exploratory analysis, we’ve created a standardized dataset suitable for in-depth financial research. The dataset compiles Effective Annual Rates for Money Market Funds (KES) and Assets Under Management for the schemes.
data-science
Author
Kabui, Charles
Published
4 October 2024
Modified
6 October 2024
Abstract
This study presents a comprehensive dataset of Kenyan Collective Investment Schemes and their performance from 2014 to 2024. By leveraging web crawling techniques on Capital Markets Authority (CMA) and Cytonn Research reports, we compiled data on Effective Annual Rates for Money Market Funds (KES) and Assets Under Management for the schemes (Money Market Funds, Equity Funds, Fixed Income Funds, Balanced Funds, and Special Funds). The research process involved data sourcing, cleaning, and basic exploratory analysis, resulting in a standardized dataset suitable for further research. We provide two primary visualizations: plots of Annual Rates for oney Market Funds (KES) and Assets Under Management over time. This dataset aims to serve as a foundation for researchers, investors, and industry stakeholders to conduct more in-depth analyses of Kenya’s collective investment landscape. By archiving and publishing this data, we contribute to the accessibility and transparency of financial information in the Kenyan market.
In recent years, Kenya’s financial landscape has witnessed significant growth and diversification, with Collective Investment Schemes playing an increasingly prominent role. Among these, Money Market Funds (MMFs) have emerged as a particularly popular investment vehicle, offering a unique blend of benefits that appeal to a wide range of investors in the Kenyan market.
Money Market Funds operate by pooling capital from numerous investors, which professional fund managers then invest collectively in short-term, highly liquid financial instruments. This structure allows MMFs to offer several key advantages in the Kenyan context:
Higher Returns: MMFs typically provide superior interest rates compared to standard savings accounts, making them an attractive option for investors seeking to maximize their returns on short-term investments.
Lower Entry Barriers: With the ability to start investing with smaller amounts, MMFs have democratized access to professional fund management, opening up opportunities for a broader range of Kenyan investors.
Compound Interest: Unlike most traditional bank deposits that offer simple interest, MMFs generally provide compound interest. This feature can lead to accelerated wealth accumulation over time, particularly benefiting long-term investors.
Enhanced Liquidity: MMFs maintain high liquidity, allowing investors to access their funds quickly when needed, typically within one to two business days after a withdrawal request. This flexibility is crucial in a dynamic economy like Kenya’s, where financial needs can change rapidly.
Diversification: By investing in a variety of short-term securities, MMFs offer a level of diversification that can be challenging for individual investors to achieve on their own, especially with limited capital.
The growing popularity of MMFs in Kenya reflects broader trends in the country’s financial sector, including increased financial literacy, a growing middle class, and the expansion of digital financial services. However, despite their importance, comprehensive and accessible data on the performance and characteristics of these funds has been limited.
This study aims to address this gap by creating a robust, clean dataset of Kenyan Collective Investment Schemes, with a particular focus on Money Market Funds. Our objectives are:
To source and gather relevant data from authoritative sources, including the Capital Markets Authority (CMA) and published financial reports.
To clean and standardize the collected data, ensuring consistency and reliability for analytical purposes.
To archive and publish the resulting dataset, facilitating further research and analysis by academics, industry professionals, and policymakers.
By undertaking this data-centric approach, we aim to contribute to the broader understanding of Kenya’s financial market dynamics, particularly in the realm of collective investments. This dataset will serve as a foundation for more in-depth analyses, potentially informing investment strategies, policy decisions, and academic research in the field of Kenyan finance.
In the following sections, we will detail our methodology for data collection and cleaning, and present the structure of the resulting dataset. Through this effort, we hope to not only shed light on the current state of Money Market Funds in Kenya but also to set a precedent for transparent, reproducible financial data curation in emerging markets.
Sourcing and Gathering Data
Before we begin, lets prepare our enviroment with some important python packages and reusable functions
collective_scheme_type =\dict[Literal['Scheme'], str] |dict[Literal['Funds'], list[str]]def strip_start_end(s1: str, last_acceptable_characters =')'):""" Cleans a given string by removing specific patterns and non-alphabet characters at the start and end of a string. Args: s1 (str): The input string to be cleaned. last_acceptable_characters (str, optional): Characters that are acceptable at the end of the string. Defaults to ')'. Returns: str: The cleaned string. The function performs the following steps: 1. Removes the phrase "comprising of" or "which comprises of". 2. Removes the word "and" followed by any non-alphabet characters at the end of the string. 3. Removes any non-alphabet characters from the start of the string. 4. Removes any non-alphabet characters from the end of the string, except those specified in `last_acceptable_characters`. 5. Replaces multiple spaces with a single space. 6. Strips leading and trailing whitespace. 7. Recursively applies the function if any of the patterns still match the string. 8. Removes non-ASCII characters. Example: >>> strip_start_end("comprising of example and123") 'example' """iftype(s1) !=stror s1 isNone:return''# Define a regex pattern to match 'and' followed by any non-alphabet # characters at the end of the string and_pattern =r'\band[^a-zA-Z]*$'# Define a regex pattern to match any non-alphabet characters at the start of the string non_alphabet_start =r'^[^a-zA-Z]+'# Define a regex pattern to match any non-alphabet characters at the end of the string non_alphabet_end =f'[^a-zA-Z{last_acceptable_characters}]+$'# Define a regex pattern to match the phrase "comprising of|which comprises of" comprising_of_pattern =r'comprising of|which comprises of'# Replace multiple spaces with a single space multiple_white_space =r'\s+' s2 = re.sub(comprising_of_pattern, '', s1) s3 = re.sub(and_pattern, '', s2) s4 = re.sub(non_alphabet_start, '', s3) s5 = re.sub(non_alphabet_end, '', s4) s6 = re.sub(multiple_white_space, ' ', s5) s7 = s6.strip()# Recursively apply the function if any of the patterns still match the stringwhileany(re.match(p, s7) for p in [ and_pattern, non_alphabet_start, non_alphabet_end, comprising_of_pattern]):return strip_start_end(s5)# remove non ASCII characters s8 = s7.encode('ascii', errors='ignore').decode()# Return the cleaned stringreturn s8def hacky_normalizer(val: str):""" Normalizes a given string by performing the following operations: 1. Strips leading and trailing whitespace. 2. Converts the string to uppercase. 3. Replaces special characters with underscores, (non-alphanumeric, non-percent, non-parentheses, non-underscore). 4. Replaces multiple consecutive underscores with a single underscore. Args: val (str): The input string to be normalized. Returns: str: The normalized string. """ val = val.strip().upper()# Replace special characters with underscore modified_string = re.sub(r'[^a-zA-Z0-9\%()_]', '_', val)# Replace multiple consecutive underscores with a single underscore modified_string = re.sub(r'_+', '_', modified_string)return modified_stringdef dynamic_callback(callback, *args):""" Dynamically calls a callback function with the appropriate number of arguments. This function inspects the signature of the provided callback function to determine the number of parameters it accepts. It then calls the callback with the corresponding number of arguments from the provided *args. Args: callback (Callable): The function to be called. *args: Variable length argument list to be passed to the callback. Returns: The result of the callback function call. Raises: TypeError: If the callback is not callable. """ sig = inspect.signature(callback) param_count =len(sig.parameters)if param_count ==0:return callback()return callback(*args[:param_count])
asyncdef collective_investment_schemes_click_fn(page: Page):await page.wait_for_selector('ul.module-accordion') elements =await page.query_selector_all('li .accordion-title')for element in elements: text_content =await element.text_content()if'APPROVED COLLECTIVE INVESTMENT SCHEMES'in text_content:await element.click() accordion_element =await page.wait_for_selector('li.current.builder-accordion-active')await page.evaluate(""" document.querySelector('#headerwrap').style.display = 'none'; document.querySelector('.pojo-a11y-toolbar-toggle').style.display = 'none'; """)await asyncio.sleep(1)return accordion_elementprint('Element not found')# Take a screenshotawait web_screenshot_async("https://www.cma.or.ke/licensees-market-players/", action = collective_investment_schemes_click_fn, width =1000, screenshot_options =None, crop_options = { 'bottom': 600, 'right': 600 })
licensees.cma.or.ke
Show Code
asyncdef collective_investment_schemes_2(page: Page):returnawait page.query_selector('table')# Take a screenshotawait web_screenshot_async(# Fund manager URL"https://licensees.cma.or.ke/licenses/15/", action = collective_investment_schemes_2, width =2000, screenshot_options =None, crop_options = { 'bottom': 500, 'right': 700 },)
Crawling
Next, let’s try grab the schemes table into a dataframe that we can work with. Below is the list of all the certified schemes in Kenya by CMA. 34
Show Code
def extract_collective_scheme_name(para: Tag): full_name =' '.join([i.get_text(strip=True) for i in para.find_all('strong')])return strip_start_end(full_name)def make_collective_unit_obj(tbody_tr_td: Tag) -> collective_scheme_type:return {'Scheme': extract_collective_scheme_name(tbody_tr_td.find('p') or tbody_tr_td),'Funds': [ strip_start_end(i.get_text(separator=' ', strip=True)) for i in tbody_tr_td.select('ul li') ] }def fetch_collective_schemes_1(): CMA_market_players_html: str= urlopen("https://www.cma.or.ke/licensees-market-players/").read() investment_schemes_table_html = BeautifulSoup(CMA_market_players_html, "html.parser")\ .find('span', string="APPROVED COLLECTIVE INVESTMENT SCHEMES")\ .find_parent('li')\ .find('table')return [ make_collective_unit_obj(tbody_tr_td)for tbody_tr_td in investment_schemes_table_html.select('tbody tr td') ]def fetch_collective_schemes_2(): CMA_market_players_html: str= urlopen("https://licensees.cma.or.ke/licenses/15/").read() investment_schemes_table_html = BeautifulSoup(CMA_market_players_html, "html.parser")\ .find('table')return [ make_collective_unit_obj(tbody_tr_td)for tbody_tr_td in investment_schemes_table_html.select('tbody tr > :first-child') ]# For example: # Orient Umbrella Collective Investment Scheme (formerly Alphafrica Umbrella Fund) => # Orient Umbrella Collective Investment Schemedef remove_quoted_str(str1: str): return re.sub(r'\(.*?(?!\)).*?$', '', str1 or'').strip()def remove_rendadant_words(str1: str):return re.sub(r'\b(scheme|schemes|trust|trusts|specialized|special|funds|fund|unit|units|collective|investment)\b\s*', '', str1 or'', flags=re.IGNORECASE).strip()def remove_special_words(str1: str):return re.sub(r'\b(specialized|special)\b\s*', '', str1 or'', flags=re.IGNORECASE).strip()def make_merge_key(str1: str): return hacky_normalizer(remove_rendadant_words(remove_quoted_str(str1)))def merge_collective_schemes(schemes_list: list[collective_scheme_type]) -> collective_scheme_type: all_names: dict[str, list[str]] = groupby( make_merge_key, [unit_obj['Scheme'] for unit_obj in schemes_list]) all_schemes: dict[str, list[str]] = groupby( make_merge_key, [scheme for unit_obj in schemes_list for scheme in unit_obj['Funds']])return {'Scheme': sorted( [name for values in all_names.values() for name in values], key =lambda x: len(remove_special_words(remove_quoted_str(x))), reverse=True )[0],'Funds': [sorted(schemes, key =lambda x: len(x), reverse=True)[0]for schemes in all_schemes.values() ] }collective_schemes_1 = fetch_collective_schemes_1()collective_schemes_2 = fetch_collective_schemes_2()collective_schemes_1_2 = collective_schemes_1 + collective_schemes_2collective_schemes_grouped_by_name = groupby(lambda x: make_merge_key(x['Scheme']), collective_schemes_1_2)collective_schemes = [ merge_collective_schemes(collective_schemes) for collective_schemes in collective_schemes_grouped_by_name.values()iflen(collective_schemes) >0andlen(collective_schemes[0]['Scheme']) >0]collective_schemes_df = pd.DataFrame(collective_schemes)collective_schemes_df
Scheme
Funds
0
African Alliance Kenya Unit Trust Scheme
[African Alliance Kenya Money Market Fund (For...
1
British-American Unit Trust Scheme
[British-American Money Market Fund, British-A...
2
NCBA Unit Trust Funds
[NCBA Fixed Income Fund, NCBA Equity Fund, NCB...
3
Zimele Unit Trust Scheme
[Zimele Balanced Fund, Zimele Money Market Fun...
4
ICEA Unit Trust Scheme
[ICEA Money Market Fund, ICEA Equity Fund, ICE...
5
Standard Investment Trust Funds
[Standard Investment Equity Growth Fund, Stand...
6
CIC Unit Trust Scheme
[CIC Money Market Fund, CIC Balanced Fund, CIC...
7
Madison Unit Trust Fund
[Madison Money Market Fund, Madison Fixed Inco...
8
Dyer and Blair Unit Trust Scheme
[Dyer and Blair Diversified Fund, Dyer and Bla...
9
Amana Unit Trust Funds Scheme
[Amana Money Market Fund, Amana Balanced Fund,...
10
Diaspora Unit Trust Scheme
[Diaspora Money Market Fund, Diaspora Bond Fun...
11
First Ethical Opportunities Fund
[]
12
Genghis Unit Trust Funds
[GenCap Hazina Fund (Bond Fund), GenCap Eneza ...
13
Mali Money Market Fund
[]
14
Sanlam Unit Trust Scheme
[Sanlam Money Market Fund (Sanlam Pesa Plus Fu...
15
Nabo Africa Funds
[Nabo Africa Money Market Fund, Nabo Africa Ba...
16
Old Mutual Unit Trust Scheme
[Old Mutual Equity Fund, Old Mutual Money Mark...
17
Equity Investment Bank Collective Investment S...
[Equity Investment Bank Money Market Fund, Equ...
18
Dry Associates Unit Trust Scheme
[Dry Associates Money Market Fund (Kenya Shill...
19
Co-op Trust Fund
[Co-op Balanced Fund, Co-op Equity Fund, Co-op...
20
Apollo Unit Trust Scheme
[Apollo Money Market Fund, Apollo Balanced Fun...
21
Cytonn Unit Trust Scheme
[Cytonn Money Market Fund (USD), Cytonn Balanc...
22
Orient Umbrella Collective Investment Scheme (...
[Orient Hifadhi Fixed Income Fund (formerly Al...
23
Wanafunzi Investment Unit Trust Fund
[]
24
Absa Unit Trust Funds
[Absa Balanced Fund, Absa Bond Fund, Absa Doll...
25
Jaza Unit Trust Fund
[Boda Riders Money Market Fund, Jaza Premier M...
26
Masaru Unit Trust Scheme
[Masaru Wealth Management Fund, Masaru Money M...
27
ADAM Unit Trust Scheme
[ADAM Balanced Fund, ADAM Equities Fund, ADAM ...
28
KCB Unit Trust Scheme (formerly Natbank Unit T...
[KCB Money Market Fund (USD) (formerly Natbank...
29
GenAfrica Unit Trust Scheme
[GenAfrica Money Market Fund, GenAfrica Equity...
30
Amaka Unit Trust (Umbrella) Scheme
[Amaka HOSP Fixed Income Fund, Amaka Qualified...
31
Jubilee Unit Trust Collective Investment Scheme
[Jubilee Balanced Fund, Jubilee Equity Fund, J...
32
Enwealth Capital Unit Trust Scheme
[Enwealth Balanced Fund, Enwealth Equity Fund,...
33
Kuza Asset Management Unit Trust Scheme
[Kuza Fixed Income Fund, Kuza Money Market Fun...
34
Etica Unit Trust Funds which has the following...
[Etica Money Market Fund(USD), Etica Fixed Inc...
35
Lofty Corban Unit Trust Scheme
[Lofty Corban Equity Fund, Lofty Corban Specia...
36
Faida Unit Trust Funds
[Hazina Bond Fund, Angaza Money Market Fund, K...
37
Taifa Unit Trust Funds
[Taifa Money Market Fund (KES), Taifa Miney Ma...
38
Stanbic Unit Trust Funds
[Stanbic Money Market Fund, Stanbic Fixed Inco...
39
Spearhead Africa Infrastructure (Special) Fund
[Spearhead Africa Infrastructure (Special) Fund]
40
Rencap Unit Trust Scheme
[Rencap Money Market Fund(KES), Rencap Balance...
41
Mayfair umbrella Collective investment scheme
[Mayfair Money Market Fund, Mayfair Fixed Inco...
42
Investcent Partners Trust Fund
[Investcent Multi Asset Special Fund(KES), Inv...
43
Investcent Alternative Investment Fund
[Investcent Hedge Fund (KES)]
44
ICEA LION Collective Investment Scheme
[ICEA LION Money Market Fund, ICEA LION Equity...
45
GCIB Unit Trust Scheme
[GCIB Money Market Fund, GCIB Fixed Income Fun...
46
CPF Unit Trust Funds
[CPF Money Market Fund, CPF Bond Fund, CPF USD...
47
Arvocap Unit Trust Scheme
[Arvocap Money Market Fund, Arvocap Ngao Fixed...
48
MyXENO Unit Trust Scheme
[Xeno Kenya Money Market Fund, Xeno Kenya Bond...
As of 6 October 2024, there are 49 unique and approved unit trust schemes in Kenya, regulated by the Capital Markets Authority (CMA). The management of these schemes involves a complex ecosystem of financial institutions, each playing a distinct role:
Approved Fund Managers: These are entities specifically licensed by the CMA to manage collective investment schemes. They are responsible for making investment decisions and managing the day-to-day operations of the funds 5.
Investment Banks: Investment banks are not the traditional commercial banks, but rather CMA-approved institutions 6 that can engage in activities such as underwriting, market making, and fund management. For example, Genghis Capital Limited is listed by the CMA as an investment bank and manages its own unit trust fund, the Gencap Hela Imara Money Market Fund 7.
Commercial Banks with Asset Management Arms: Traditional banks may establish separate entities for asset management. For instance, KCB Bank has KCB Asset Management, which is approved by the CMA to manage unit trusts8. Please note that there is also KCB Investment Bank910.
Non-Financial Companies with Investment Products: Some companies outside the traditional financial sector have entered the investment market. A notable example is the Mali Money Market Fund 1112, owned wholly or in part by Safaricom PLC, Kenya’s largest telecommunications company 13. While Safaricom is not a licensed fund manager, they have partnered with Genghis Capital Limited to administer the Mali MMF 14.
This complex landscape can sometimes lead to potential conflicts of interest. For instance, when an investment bank like Genghis Capital manages both its own funds and third-party funds like the Mali MMF, it raises questions about prioritization and fair treatment of all clients. An additional layer of complexity arises when commercial banks (or their subsidiaries), such as KCB15, offer unit trust investment options alongside traditional savings and fixed deposit accounts. This dual offering presents a potential conflict of interest. Banks typically earn higher profits from traditional deposit accounts compared to the fees generated from managing unit trusts. This raises questions about how banks advise their clients on savings options. While money market funds often provide better returns for savers, banks might have an inherent incentive to promote their own deposit products. This situation underscores the importance of financial literacy and independent advice for consumers navigating these choices. Investors should be aware of this potential conflict and critically evaluate the recommendations they receive, considering whether the advice aligns more with their own financial interests or those of the bank.
The investment landscape is constantly evolving, with fund managers occasionally modifying their product offerings. A notable example is Zimele Asset Management’s decision to convert its Money Market Fund into a Fixed Income Fund 16. This transition underscores the fluid nature of investment products in Kenya. Zimele’s clients were not given the option to retain their investments in the original Money Market Fund structure. Some investors expressed dismay at discovering their Money Market Accounts had been altered without their knowledge 17, suggesting a lack of comprehensive public communication about the change. This situation emphasizes the importance of regularly monitoring one’s investments to stay informed about any modifications that may affect them.
Given this complexity, potential investors should exercise caution and conduct thorough due diligence before committing their funds. This includes:
Verifying the regulatory status of the fund and its manager with the CMA
Understanding the fund’s investment strategy and associated risks
Reviewing the fund’s performance history and fee structure
Investigating any potential conflicts of interest
Seeking independent financial advice if necessary
Warning
Always approach investments with caution, especially when important information is missing, unclear, or overly complicated. Remember that higher returns often come with higher risks, and past performance does not guarantee future results.
Scheme Performance Data Collection
Challenges in Data Accessibility
The Kenyan financial regulatory environment mandates that unit trust schemes publish their daily yields in two national newspapers. However, this requirement presents several challenges for comprehensive data collection and analysis:
Limited Digital Presence: Many newspapers lack a substantial digital archive, necessitating physical access to print copies for data retrieval.
Cost Barriers: Accessing historical data often involves purchasing old newspaper records, making large-scale data collection financially prohibitive.
Time-Intensive Process: Manually gathering data from physical newspapers is a labor-intensive task, impractical for long-term, comprehensive analysis.
Inconsistent Reporting: Not all fund managers consistently publish their yields, leading to gaps in the data.
These factors collectively create a significant barrier to accessing and analyzing comprehensive, historical performance data for Kenyan unit trust schemes.
Cytonn Research: A Valuable Data Source
In light of these challenges, Cytonn Fund Managers’ research publications have emerged as an invaluable resource. Since 2014, Cytonn has been conducting and freely publishing market research at https://cytonnreport.com/18. Key aspects of this data source include:
Comprehensive Coverage: Over 600 reports covering various aspects of the Kenyan financial market.
Historical Data: Consistent reporting since 2014, providing a substantial historical dataset.
Free Access: Public availability of the reports, removing financial barriers to data access.
Aggregated Information: Cytonn’s reports often include compiled data from multiple sources, offering a more comprehensive view of the market.
While some fund managers publish current yield data on their websites19, the lack of historical data limits the usefulness of these sources for trend analysis and comprehensive research.
Data Collection Methodology
Given the richness and accessibility of Cytonn’s research, we adopted the following approach for data collection:
Ethical Considerations: Our crawling process was designed to respect Cytonn’s server resources, avoiding any disruption to their services.
Data Extraction: We implemented a process to extract relevant tables and data points from each report.
Data Aggregation: The extracted information was compiled into a structured dataset suitable for analysis.
Compliance with Terms of Service: We carefully reviewed Cytonn’s terms of service to ensure our use of the data aligns with their fair use policy22.
Screenshots of Cytonn Reports
cytonn.com page
Show Code
# Take a screenshotawait web_screenshot_async("https://cytonn.com/researches", width =1500, height =1200)
cytonnreport.com page
Show Code
asyncdef cytonnreport_fn(page: Page):await page.wait_for_selector('.grid-x > .pagination')await asyncio.sleep(1)# Take a screenshotawait web_screenshot_async("https://cytonnreport.com/research", action = cytonnreport_fn, width =1500, height =1200)
Money Market Fund Yield Table
Show Code
# Define a function that selects a table by its header textdef select_table_by_title(target_header_text: str):# Define a nested asynchronous function that takes a Page object as an argumentasyncdef fn(page: Page):# Wait for any table element to be present on the pageawait page.wait_for_selector('table')# Query and collect all table elements on the page table_elements =await page.query_selector_all('table')# Iterate through each table elementfor table_element in table_elements:# Query and collect all header cells in the current table table_headers =await table_element.query_selector_all('thead tr td')# Iterate through each header cellfor table_header in table_headers:# Extract the text content of the current header cell header_text:str=await page.evaluate('(element) => element.textContent', table_header)# Check if the header text starts with the target textif header_text.startswith(target_header_text):# If a match is found, return the current table elementawait asyncio.sleep(1)return table_elementreturn fnawait web_screenshot_async(# URL to take a screenshot of"https://cytonnreport.com/research/cytonn-monthly-",# Action deciding WHAT (element) or WHEN (eg: click) to take the screenshot action = select_table_by_title('Cytonn Report: Money Market Fund Yield'), width =500, crop_options = { 'bottom': 600 })
asyncdef get_all_cytonn_reports(per_page_count: int=10):""" Retrieves all Cytonn reports from the Cytonn Report website. Args: per_page_count (int, optional): The number of reports to retrieve per page. Defaults to 10. Returns: list: A list of all the retrieved reports. """ page, close_playwright =await get_browser_page_async() reports_url ="https://cytonnreport.com/get/allreports" reports_headers: dict=None reports_method: str=Noneasyncdef handle_route(route: Route):nonlocal reports_headersnonlocal reports_method reports_headers = route.request.headers.copy() reports_method = route.request.methodawait route.continue_()asyncdef get_cytonn_reports(current_page: int): javascript_fetch_fn =f''' async () => {{ try {{ const response = await fetch( "{reports_url}", {{ "headers": {json.dumps(reports_headers)}, "method": "{reports_method}", "body": {json.dumps(json.dumps( {"pagination": {"per_page": per_page_count, "current_page": current_page } }))}, "referrer": "https://cytonnreport.com/research", "referrerPolicy": "no-referrer-when-downgrade", "mode": "cors", "credentials": "include"}}); if (!response.ok) {{ throw new Error(`HTTP error! status: ${{response.status}}`);}} const json = await response.json(); return json;}} catch (error) {{ console.error('Fetch error:', error); throw error; // Re-throw to allow calling code to handle it}}}} '''returnawait page.evaluate(javascript_fetch_fn)await page.route(reports_url, handle_route)# Navigate to the desired URLawait page.goto("https://cytonnreport.com/research")whilenot reports_headers:await asyncio.sleep(1) current_page =1 all_reports = [] pbar: tqdm =NonewhileTrue: reports_response =await get_cytonn_reports(current_page) reports = reports_response['data'] if reports_response else []iflen(reports) >0: total = reports_response['total'] pbar = pbar or tqdm(total=total) pbar.update(len(reports)) all_reports.extend(reports) last_page = reports_response['last_page']if last_page == current_page:break current_page +=1else:breakawait asyncio.sleep(0.4)await close_playwright()if pbar: pbar.close()return all_reportsall_cytonn_reports =await get_all_cytonn_reports()print(f'There are {len(all_cytonn_reports)} reports')
100%|██████████| 654/654 [03:18<00:00, 3.29it/s]
There are 654 reports
Show Code
# converting the JSON into dataframeall_cytonn_reports_df = pd.DataFrame(all_cytonn_reports)all_cytonn_reports_df.head(3)
id
slug
name
author
researchdate
active
created_by
series_id
campaign
sent
...
title
category
summary
body
date_holder
date
creator
status
series
topics
0
737
social-health-insurance
Social Health Insurance Fund (SHIF) Review, an...
Research Team
2024-10-13
1
95
1
2500
1
...
Social Health Insurance Fund (SHIF) Review, an...
Investment Reports
<p><strong>Money Markets, T-Bills Primary Auct...
<p><strong>Money Markets, T-Bills Primary Auct...
13
13 October, 2024
Fredrick Maore
Active
{'id': 1, 'name': 'Cytonn Weekly', 'descriptio...
[{'id': 2480, 'name': 'Fixed Income', 'slug': ...
1
736
cytonn-q3-2024-markets-review
Cytonn Q3’2024 Markets Review
Cytonn Research
2024-10-06
1
89
1
2491
1
...
Cytonn Q3’2024 Markets Review
Investment Reports
<p><strong>Global Economic Growth: </strong></...
<p><strong>Global Economic Growth: </strong></...
6
6 October, 2024
James Kirira
Active
{'id': 1, 'name': 'Cytonn Weekly', 'descriptio...
[{'id': 2473, 'name': 'Global Markets Review',...
2
735
cytonn-note-on-1
Cytonn Note on the 8th October 2024 Monetary P...
Research Team
2024-10-03
1
92
3
None
0
...
Cytonn Note on the 8th October 2024 Monetary P...
Investment Reports
<p>The Monetary Policy Committee (MPC) is set ...
<p>The Monetary Policy Committee (MPC) is set ...
3
3 October, 2024
Edna Dande
Active
{'id': 3, 'name': 'MPC Notes', 'description': ...
[{'id': 2472, 'name': 'Cytonn Note on the 8th ...
3 rows × 30 columns
As can be observed, the dataset above is a bit complex and diffucult to uderstand or analyze. This is because alot of information is contained in the reports.
Show Code
# This cofirms all the records have are uniquelen(all_cytonn_reports_df), len(all_cytonn_reports), len(all_cytonn_reports_df['id'].unique())
(654, 654, 654)
Explore and Clean the Dataset
The goal here is to extract the details of Effective/Norminal/Annual Rate of Money Market Funds (KES) and Assets Under Management for the entire schemes
Preview the Columns
Show Code
all_cytonn_reports_df.iloc[0]
id 737
slug social-health-insurance
name Social Health Insurance Fund (SHIF) Review, an...
author Research Team
researchdate 2024-10-13
active 1
created_by 95
series_id 1
campaign 2500
sent 1
thumbnail None
created_at 2024-10-13 13:11:00
meta_title None
meta_keywords None
meta_description None
og_description None
url https://cytonnreport.com/research/social-healt...
og_image https://cytonnreport.com/storage/research/tmpp...
updated_at 2024-10-13 14:33:46
deleted_at None
title Social Health Insurance Fund (SHIF) Review, an...
category Investment Reports
summary <p><strong>Money Markets, T-Bills Primary Auct...
body <p><strong>Money Markets, T-Bills Primary Auct...
date_holder 13
date 13 October, 2024
creator Fredrick Maore
status Active
series {'id': 1, 'name': 'Cytonn Weekly', 'descriptio...
topics [{'id': 2480, 'name': 'Fixed Income', 'slug': ...
Name: 0, dtype: object
Below is a tree structure of one record, to visualize the objects and their inner properties
Show Code
json_structure = json2txttree(all_cytonn_reports[:1])min_topics =min(len(i.get('topics', [])) for i in all_cytonn_reports)max_topics =max(len(i.get('topics', [])) for i in all_cytonn_reports)json_structure = json_structure.replace('└─ (array)', f'└─ (array) [{len(all_cytonn_reports)} items]')json_structure = json_structure.replace('"topics" (array)', f'"topics" (array) [between {min_topics} - {max_topics} items]')print(json_structure)
A full report is formed by articles. Each topics is a subsection, with title being the header and body being the content. We will merge all bodies from the articles to form the entire report HTML, which we will parse to extract the Money Market Funds yields tables. In addition, we are also going to add the main body and main summary and topics summary to encure we capture any table we might miss.
Show Code
CYTONN_RECORD_LITERALS = Literal['summary', 'body', 'topics', 'researchdate']def get_report_HTML(report: dict[CYTONN_RECORD_LITERALS, Any]) ->str: summary_html = report['summary'] body_html = report['body'] topics_html =''.join([f"{i['summary']}\n\n{i['body']}"for i in report['topics']])returnf"{summary_html}\n{body_html}\n{topics_html}"# from IPython.display import HTML# HTML(get_report_HTML(all_cytonn_reports[0]))
Parsing Dates
There are some summary tables that have dates such as Q1'2023, Q1'2023 (%), FY'2023, FY'2023 (%), Q1'2024, Q1'2024 (%)
Show Code
await web_screenshot_async(# URL to take a screenshot of"https://cytonnreport.com/research/q1-2024-unit-trust-funds-performance-note",# Action deciding WHAT (element) or WHEN (eg: click) to take the screenshot action = select_table_by_title('Cytonn Report: Assets Under Management (AUM) for the Approved Collective Investment Schemes'), width =700, screenshot_options =None, crop_options = { 'bottom': 400 })
Below function will help parse such time ranges:
Show Code
def get_date_range(month, year):# Convert month name to number month_num = [i.lower() for i in month_abbr].index(month.lower())\iflen(month) >0andany([i.lower() == month.lower() for i in month_abbr])\else datetime.strptime(month, '%B').month# Get the last day of the month _, last_day = monthrange(int(year), month_num)# Create date objects for the first and last day of the month start_date = date(int(year), month_num, 1) end_date = date(int(year), month_num, last_day)return start_date.strftime('%Y-%m-%d'), end_date.strftime('%Y-%m-%d')parse_date_pattern_months = ("JAN|JANUARY|FEB|FEBRUARY|MAR|MARCH|APR|APRIL|MAY|""JUN|JUNE|JUL|JULY|AUG|AUGUST|SEP|SEPTEMBER|""OCT|OCTOBER|NOV|NOVEMBER|DEC|DECEMBER")parse_date_pattern =rf'(?:(\d{{2}})[_|\s|-]*)?({parse_date_pattern_months})[_|\s|-]*(\d{{4}})'def parse_fiscal_period_dates(date_string: str) -> (tuple[str, str] |None):""" This function parses a date string representing a fiscal period (Fiscal/Financial Year, Quarter, or Half-year), year, or date and returns the corresponding start and end dates. """ extracted_date = re.search('^'+ parse_date_pattern +'$', date_string, re.IGNORECASE)if extracted_date: search_date, search_month, search_year = extracted_date.groups()if search_date: month_num = datetime.strptime(search_month, '%B').monthreturn datetime(int(search_year), month_num, int(search_date)).strftime('%Y-%m-%d')else:return get_date_range(search_month, search_year)if re.match(r"^\d{4}$", date_string, re.IGNORECASE): date_string =f"FY'{date_string}"# Define a regex pattern to match fiscal periods (FY, Q1-Q4, H1-H2) followed by a year, or just an year pattern =r"^(FY|Q[1-4]|H[1-2])['|_|\s]*(\d{4})$"# Try to match the input string against the pattern match = re.match(pattern, date_string, re.IGNORECASE)# If no match is found, return Noneifnot match:returnNone# Extract the period type and year from the match period, year = match.groups() year =int(year)# Handle Fiscal Year (FY) caseif period.upper() =='FY': start_date = datetime(year, 1, 1) end_date = datetime(year, 12, 31)# Handle Quarter (Q1-Q4) caseselif period.upper().startswith('Q'): quarter =int(period[1]) start_month = (quarter -1) *3+1 start_date = datetime(year, start_month, 1)# Calculate end date of the quarter end_date = start_date.replace(month=start_month +2) + timedelta(days=32) end_date = end_date.replace(day=1) - timedelta(days=1)# Handle Half-year (H1-H2) caseselif period.upper().startswith('H'): half =int(period[1]) start_month = (half -1) *6+1 start_date = datetime(year, start_month, 1)# Calculate end date of the half-year end_date = start_date.replace(month=start_month +5) + timedelta(days=32) end_date = end_date.replace(day=1) - timedelta(days=1)# Return start and end dates formatted as stringsreturn (start_date.strftime('%Y-%m-%d'), end_date.strftime('%Y-%m-%d'))def TEST_parse_fiscal_period_dates():# Test the function test_dates = ["FY'2019", "Q1'2020", "H1'2019", "fy 2018", "q32021", "h2_2022", "2020", '2019','JUNE_2020', '01_NOVEMBER_2017', "H3'2020" ]for expanding_value in test_dates: result = parse_fiscal_period_dates(expanding_value)if result:print(f"{expanding_value}: {result}")else:print(f"{expanding_value}: Invalid format")TEST_parse_fiscal_period_dates()
Parsing a Effective Annual Rate(KES Money Market Fund) and Total Assets Under Management (Collective Investment Schemes)
The Extracted_Scheme_Entry class below represents and validates a financial record entry. It validates record type (Assets Under Management or Effective Annual Rate), date, value, and scheme. The class also maintains lists of non-existent schemes and invalid records. Assets under management (AUM) is the market value of the investments managed by the fund manager on behalf of clients, inluding MMF, FIXED, balanced, equity, etc.
Show Code
class Extracted_Scheme_Entry:""" A class to represent and validate financial entry information. Class Attributes: INVALID_FUNDS (list[str]): Stores funds not found in the mapping. INVALID_DATES (list[str]): Stores entry dates not valid. INVALID_VALUES (list[str]): Stores entry values not valid. TYPE_ASSETS_UNDER_MANAGEMENT (str): Constant for Assets Under Management type. TYPE_EFFECTIVE_ANNUAL_RATE (str): Constant for Effective Annual Rate type. """ INVALID_SCHEMES: list[str] = [] INVALID_DATES: list[str] = [] INVALID_VALUES: list[str] = [] TYPE_ASSETS_UNDER_MANAGEMENT: str='ASSETS_UNDER_MANAGEMENT'# Assets Under Management TYPE_EFFECTIVE_ANNUAL_RATE: str='EFFECTIVE_ANNUAL_RATE'# Effective Annual Ratedef__init__(self, entry_type: Literal['ASSETS_UNDER_MANAGEMENT', 'EFFECTIVE_ANNUAL_RATE'], entry_date: str, entry_value: str, entry_scheme: str, scheme_filter_function: Callable[[str], list[str]]):""" Initialize a RecordEntry instance with validated attributes. Args: entry_type (str): Type of the record (TYPE_ASSETS_UNDER_MANAGEMENT or TYPE_EFFECTIVE_ANNUAL_RATE). entry_date (str): Date of the record (2024-03-01) or Financial period (H1'2024). entry_value (str): Value of the record. entry_scheme (str): Name of the MMF(KES) fund fund_manager_filter_predicate (Callable): A predicate to filter and return matched MMF(KES) fund for validation. """self.entry_type = Extracted_Scheme_Entry.validate_type(entry_type)self.entry_date = Extracted_Scheme_Entry.validate_date(entry_date)self.entry_value = Extracted_Scheme_Entry.validate_value(entry_value)self.entry_scheme = Extracted_Scheme_Entry.validate_scheme(entry_scheme, scheme_filter_function)ifself.entry_date isNone: Extracted_Scheme_Entry.INVALID_DATES.append(entry_date)ifself.entry_value isNone: Extracted_Scheme_Entry.INVALID_VALUES.append(entry_value)ifself.entry_scheme isNone: Extracted_Scheme_Entry.INVALID_SCHEMES.append(entry_scheme)def is_valid(self) ->bool:""" Check if the record is valid (all attributes are non-empty). """ is_valid =\bool(self.entry_type) \andbool(self.entry_date) \andbool(self.entry_value) \andbool(self.entry_scheme)return is_valid@staticmethoddef validate_scheme(value: str, filter_predicate: Callable[[str], list[str]]) ->str|None:""" Validate and standardize the date or financial period. """try: value =str(value or'').lower()# These represent USD MMF's EXCLUDES = ['Dollar', 'USD'] is_USD_MMF =any((exclude.lower() in value) for exclude in EXCLUDES)ifnot is_USD_MMF: names = filter_predicate(value)iflen(names) ==1:return names[0]iflen(names) >1:print(f'"{value}" has more that two matches! {names}')returnNoneexcept:returnNone@staticmethoddef validate_date(value: str) ->str|tuple[str, str]|None:""" Validate and standardize the date """try:return parse_fiscal_period_dates(value)\or datetime.strptime(value, "%Y-%m-%d").strftime('%Y-%m-%d')\orNoneexcept:returnNone@staticmethoddef validate_value(value: str|float) ->str|None:""" Validate and clean the entry value. """try:iftype(value) ==float:return value# remove percentage sign value = value.rstrip('%')# remove comma and white space value =''.join([i for i in value if i notin [' ', ',', '-']])returnfloat(value) iflen(value) >0elseNoneexcept:returnNone@staticmethoddef validate_type(value: str) -> Literal['ASSETS_UNDER_MANAGEMENT', 'EFFECTIVE_ANNUAL_RATE']:""" Validate the record type. Args: value (str): The record type to validate. Raises: TypeError exception. """ value = (value or'').upper()if value in [Extracted_Scheme_Entry.TYPE_ASSETS_UNDER_MANAGEMENT, Extracted_Scheme_Entry.TYPE_EFFECTIVE_ANNUAL_RATE]:return valueraiseTypeError(f"{value} is not proper entry Type!")def TEST_MoneyMarketFund_KES_RecordEntry():# Test the class test_cases = [ {"entry_type": Extracted_Scheme_Entry.TYPE_ASSETS_UNDER_MANAGEMENT,"entry_date": "2024-03-01","entry_value": "1,000,000","entry_scheme": "britam", }, {"entry_type": Extracted_Scheme_Entry.TYPE_EFFECTIVE_ANNUAL_RATE,"entry_date": "H1'2024","entry_value": "5.5%","entry_scheme": "old mutual", }, {"entry_type": Extracted_Scheme_Entry.TYPE_ASSETS_UNDER_MANAGEMENT,"entry_date": "invalid-date","entry_value": "1,000,000","entry_scheme": "sanlam","invalid": "invalid date" }, {"entry_type": Extracted_Scheme_Entry.TYPE_ASSETS_UNDER_MANAGEMENT,"entry_date": "2024-03-01","entry_value": "invalid-value","entry_scheme": "britam","invalid": "invalid value" }, {"entry_type": Extracted_Scheme_Entry.TYPE_ASSETS_UNDER_MANAGEMENT,"entry_date": "2024-03-01","entry_value": "1,000,000","entry_scheme": "unknown scheme","invalid": "unmapped scheme" }, {"entry_type": Extracted_Scheme_Entry.TYPE_ASSETS_UNDER_MANAGEMENT,"entry_date": "2024-03-01","entry_value": "1,000,000","entry_scheme": "britam sanlam","invalid": "2 funds matched" }, {"entry_type": Extracted_Scheme_Entry.TYPE_ASSETS_UNDER_MANAGEMENT,"entry_date": "2024-03-01","entry_value": "1,000,000","entry_scheme": "Britam USD Dollar Fund","invalid": "USD MMF" }, ]# Define the fund filter function test_fund_map = [ ('Britam MMF(KES)', ['britam', 'british-american', 'british', 'american'] ), ('UAP Old Mutual MMF(KES)', ['old mutual', 'uap old mutual', 'uap'] ), ('Sanlam MMF(KES)', ['sanlam', 'sanlam investments'] ) ]def test_fund_filter(value: str): value = value.lower() names = [name for name, aliases in test_fund_map ifany(alias in value for alias in aliases)]return names# Run testsfor test_case in test_cases: entry = Extracted_Scheme_Entry( test_case["entry_type"], test_case["entry_date"], test_case["entry_value"], test_case["entry_scheme"], test_fund_filter ) cases = [entry.entry_date, entry.entry_value, entry.entry_scheme] invalid =f" ({test_case.get('invalid')})"if test_case.get('invalid') else''print(f"Valid: {entry.is_valid()}{invalid}, {cases}")# Print invalid entriesprint("\nInvalid Funds:", Extracted_Scheme_Entry.INVALID_SCHEMES)print("\nInvalid Dates:", Extracted_Scheme_Entry.INVALID_DATES)print("\nInvalid Values:", Extracted_Scheme_Entry.INVALID_VALUES) Extracted_Scheme_Entry.INVALID_SCHEMES = [] Extracted_Scheme_Entry.INVALID_DATES = [] Extracted_Scheme_Entry.INVALID_VALUES = []TEST_MoneyMarketFund_KES_RecordEntry()
We then create a fund collective schemes map with a tuple of name and aliases because the records don’t have a simple or stardard naming in the Cytonn reports. As such, we need to use very unique and simple names that we can use to match abitrary Money market funds names from the crawled data.
Show Code
SCHEME_NAME_ALIAS_MAP = [# The African Alliance (AA) Kenya Shillings Fund is a money market fund by # African Alliance Kenya Investment Bank Limited (the fund manager) # launched on 27th April 2015.# https://centwarrior.com/aa-kenya-shillings-fund/# https://www.linkedin.com/posts/centwarrior_aa-kenya-shillings-fund-explained-in-2024-activity-7169322082814705664-8nwu?utm_source=share&utm_medium=member_desktop# https://cytonn.com/topicals/investment-risk-analysis ('African Alliance Kenya Unit Trust Scheme', ['african', 'alliance', 'aa kenya'] ), ('British-American Unit Trust Scheme', ['britam', 'british-american', 'british', 'american'] ), ('NCBA Unit Trust Funds', ['ncba', 'cba', 'commercial bank of africa'] ), ('Zimele Unit Trust Scheme', ['zimele'] ), ('ICEA Unit Trust Scheme', ['icea'] ), ('Standard Investment Trust Funds', ['standard', 'mansa'] ), ('CIC Unit Trust Scheme', ['cic'] ), ('Madison Unit Trust Fund', ['Madison', 'madisson'] ), ('Dyer and Blair Unit Trust Scheme', ['dyer', 'blair'] ), ('Amana Unit Trust Funds Scheme', ['amana'] ), ('Diaspora Unit Trust Scheme', ['diaspora'] ), ('First Ethical Opportunities Fund', ['ethical', # 'first', 'opportunities' ] ),# https://www.cma.or.ke/licensees-market-players/# https://genghis-capital.com/asset-management/money-market-fund/ ('Genghis Unit Trust Funds', ['hela','genghis', 'hazina', 'hisa', 'iman', 'gencap', 'compliant', 'eneza', 'genCap', 'imara'] ),# https://www.businessdailyafrica.com/bd/markets/capital-markets/safaricom-s-mali-unit-trust-asset-base-hits-sh1-4bn--4582142 ('Mali Money Market Fund', ['mali'] ), ('Sanlam Unit Trust Scheme', ['sanlam'] ), ('Nabo Africa Funds', ['nabo'] ), ('Old Mutual Unit Trust Scheme', ['mutual', 'old', 'Faulu'] ),# https://equitygroupholdings.com/ke/investor-relations/eib# https://www.cma.or.ke/licensees-market-players/ ('Equity Investment Bank Collective Investment Scheme', ['equity'] ),# https://www.cma.or.ke/licensees-market-players/ ('Dry Associates Unit Trust Scheme', ['dry associates', 'dry', 'associates'] ), ('Co-op Trust Fund', ['co-op', 'gratuity', 'Coop'] ), ('Apollo Unit Trust Scheme', ['aggressive', 'apollo'] ), ('Cytonn Unit Trust Scheme', ['cytonn'] ), ('Orient Umbrella Collective Investment Scheme (formerly Alphafrica Umbrella Fund)', ['orient', 'kasha', 'alpha', 'alphafrica'] ), ('Wanafunzi Investment Unit Trust Fund', ['wanafunzi'] ), ('Absa Unit Trust Funds', ['absa'] ), ('Jaza Unit Trust Fund', ['jaza'] ), ('Masaru Unit Trust Scheme', ['masaru'] ), ('ADAM Unit Trust Scheme', ['adam'] ), ('KCB Unit Trust Scheme (formerly Natbank Unit Trust Scheme)', ['kcb', 'natbank'] ), ('GenAfrica Unit Trust Scheme', ['genafrica'] ), ('Amaka Unit Trust (Umbrella) Scheme', ['amaka'] ), ('Jubilee Unit Trust Collective Investment Scheme', ['jubilee'] ),# Previusly "Liberty Pension Services Limited"# https://enwealth.co.ke/about/#governance# https://www.linkedin.com/company/enwealth-kenya/?originalSubdomain=ke# https://enwealth.co.ke/capital/enwealth-money-market-fund/ ('Enwealth Capital Unit Trust Scheme', ['enwealth'] ), ('Kuza Asset Management Unit Trust Scheme', ['kuza', 'momentum'] ),# https://www.linkedin.com/company/arvocap-asset-managers/# https://www.businessdailyafrica.com/bd/markets/avocarp-latest-to-enter-kenya-s-asset-management-market-4644586 ('Arvocap Unit Trust Scheme', ['arvocap'] ), ('Etica Capital Limited', ['etica'] ),# https://licensees.cma.or.ke/licenses/15/ ('Mayfair umbrella Collective investment scheme', ['mayfair'] ), ('Lofty Corban Unit Trust Scheme', ['lofty-corban', 'lofty', 'corban'] ), ('CPF Unit Trust Funds', ['cpf', 'cpof'] ), ('Stanbic Unit Trust Funds', ['stanbic'] ), ('MyXENO Unit Trust Scheme', ['myxeno'] ),################################################## UNVERIFIED COLLECTIVE INVESTMENTS ################################################## ('Metropolitan Canon Asset Managers Limited', ['metropolitan'] ), ('FCB Capital Limited', ['fcb'] ), ('Fusion Investment Management Limited', ['fusion'] ), ('Altree Capital Kenya Limited', ['altree'] ), ('CFS Asset Management Limited', ['cfs'] ), ('I&M Capital Limited', ['i&m'] ), ('Globetec Asset Managers Limited', ['globetec'] ), (# https://cytonnreport.com/research/cytonn-q3-2024-markets-review# https://www.businessdailyafrica.com/bd/markets/capital-markets/fintech-start-up-ndovu-gets-cma-nod-to-set-up-money-market-fund-4548970'Waanzilishi Capital Limited', ['waanzilishi', 'ndovu'] ), ('Star Capital Management Limited', ['star'] ),# Unverified and NO online presense! ('Stanlib Kenya', ['stanlib'] ),]SCHEME_NAME_ALIAS_MAP
Show Code
def scheme_filter(value: str) ->list[str]: value = value.lower() names = [ name for name, aliasesin SCHEME_NAME_ALIAS_MAP ifany((alias.lower() in value) for alias in aliases) ]return names# Testdef TEST_scheme_filter(fund_manager: str): name = scheme_filter(fund_manager)print(f"{fund_manager} => {name}")TEST_scheme_filter('KCB Fund Managers')TEST_scheme_filter('Cytonn Fund Mangers')TEST_scheme_filter('Nabo')TEST_scheme_filter('madison')
A list of functions that process the matched tables. Each function should essentially process a single column. The function receives three parameters: a dataframe row data (a table row entry), the entire record from which the table was extracted from, and any other_params. This first parameter is useful to capture the entry value, and the second is important to capture the date of the record if not provided in the table. The table row entry values are a dictionally named with the table names used to match the tables. Callback function returns a Extracted_Scheme_Entry
def compare_table_and_given_column_names( given_table_columns, header_column_names, *, use_optional_columns =True): table_columns = [ table_column for table_column in given_table_columns iftype(table_column) ==stror use_optional_columns ornot table_column.get('optional', False) ] table_column_strs = [ table_column iftype(table_column) ==strelse table_column['column_name']for table_column in table_columns ] is_match_new =len(header_column_names) ==len(table_columns) and\all([ normalize_and_compare_two_strs(header_column_name, table_column) iftype(table_column) ==strelse table_column['predicate'](hacky_normalizer(header_column_name))for header_column_name, table_column inzip(header_column_names, table_columns) ])if is_match_new ornot use_optional_columns: other_params = { other_param_key: other_param_value(hacky_normalizer(header_column_name))for header_column_name, table_column inzip(header_column_names, table_columns)iftype(table_column) !=strand table_column.get('other_params', None)for other_param_key, other_param_valuein table_column['other_params'].items() } if is_match_new else { }return is_match_new, table_column_strs, other_paramsreturn compare_table_and_given_column_names( given_table_columns, header_column_names, use_optional_columns =False)def get_table(table: Tag, extraction_map):for tag in table.find_all(True): tag.attrs = {} # remove tags such as colspan and rowspan# Iterate through predefined extraction mappingsfor (given_table_columns, extractor_callbacks) in extraction_map: clean_up_tasks: list[Callable[[], None]] = [] header_tr_s: list[Tag] = table.select('thead tr') is_match =False is_match_columns = [] is_match_other_params = {}# Check if table headers match the expected columnsfor header_tr in header_tr_s: header_column_names: list[str] = [i.get_text(strip=True) for i in header_tr.find_all('td')] is_match_new, pure_str_columns, other_params = compare_table_and_given_column_names( given_table_columns, header_column_names)if is_match_new: is_match_columns = pure_str_columns is_match_other_params = other_params or {}# If not a match, add to cleanup tasks.# We add cleap tasks here to delay deleting the table headers before we decide # that this table is matched. When the given columns are matched, the other columns# are deleted to ensure the dataframe has one column.ifnot is_match_new: clean_up_tasks.append(header_tr.extract) is_match = is_match or is_match_new# If a match is found, process the tableif is_match:try:# Execute cleanup tasks [clean_up_task() for clean_up_task in clean_up_tasks]# Convert table to DataFrame table_df = pd.read_html(io.StringIO(str(table)))[0] table_df.columns = is_match_columnsreturn (table_df, extractor_callbacks, is_match_other_params)exceptExceptionas e:print('error', e, table)continuereturn (None, None, None)def is_valid_dataframe(df: pd.DataFrame |None) ->bool:return df isnotNoneandnot df.emptyDEBUG_OPTIONS =dict[ Literal['log_unmatched_table', 'log_invalid_columns', 'log_extracted_valid','log_extracted_invalid','log_extractor_count', ], Callable[[str], None]]def get_tables(html: str, extraction_map, *, debug_options: DEBUG_OPTIONS = {}): log_unmatched_table = debug_options.get('log_unmatched_table')# Parse the HTML content using BeautifulSoup parsed_html = BeautifulSoup(html, "html.parser")# Find all <table> elements in the parsed HTML and store them in a list# remove duplicates tables: list[Tag] =list({ hacky_normalizer(str(table)): table for table in parsed_html.find_all('table')}.values())# Iterate over each table found in the HTMLfor table in tables:# Generate a DataFrame and a list of extractor callbacks for each table table_df, extractor_callbacks, other_params = get_table(copy(table), extraction_map)# Check if the DataFrame is valid and not Noneif is_valid_dataframe(table_df) and table_df isnotNone:# Yield the DataFrame and the associated callbacksyield (table_df, extractor_callbacks, other_params)else:if log_unmatched_table: log_unmatched_table(str(table))def extract_frame_by_column_names( report: dict[CYTONN_RECORD_LITERALS, str], tables_html: str, extraction_map,*, debug_options: DEBUG_OPTIONS = {}, ): log_invalid_columns = debug_options.get('log_invalid_columns') log_extractor_count = debug_options.get('log_extractor_count') log_extracted_valid = debug_options.get('log_extracted_valid') log_extracted_invalid = debug_options.get('log_extracted_invalid')# Generate tables and callbacks using the get_tables function table__callback__generator = get_tables(tables_html, extraction_map, debug_options=debug_options)# Iterate over each table DataFrame and its extractor callbacksfor table_df, extractor_callbacks, other_params in table__callback__generator:if log_extractor_count: log_extractor_count(len(extractor_callbacks))iflen(extractor_callbacks) >0:# Apply each callback function to the rows of the tablefor extractor_callback in extractor_callbacks: table_rows = [ dynamic_callback(extractor_callback, raw_table_row.to_dict(), report, other_params)for _, raw_table_row in table_df.iterrows() ]# Convert the processed rows into a new DataFrame extracted_df = pd.DataFrame([vars(i) for i in table_rows if i.is_valid()])# Check if the extracted DataFrame is valid and yield itif is_valid_dataframe(extracted_df):if log_invalid_columns: __invalid_columns = [vars(i) for i in table_rows ifnot i.is_valid()]iflen(__invalid_columns) >0: log_invalid_columns(pd.DataFrame(__invalid_columns))if log_extracted_valid: log_extracted_valid(extracted_df)yield extracted_dfelif log_extracted_invalid: log_extracted_invalid(table_df)def extract_table_by_column_names( report: dict[CYTONN_RECORD_LITERALS, str], *, debug_options: DEBUG_OPTIONS = {}):# Get the HTML content of the report report_html = get_report_HTML(report)yieldfrom extract_frame_by_column_names( report, report_html, EXTRACTION_MAP, debug_options = debug_options)
Validating the Parsing
Below code extracts and parses entries from all the records and stores various metrics for validating.
Show Code
# Stores a tuple of the index of the record, the key for the log and the value loggeddebug_log_store: list[tuple[str, str, Any]] = []def extract_all_records():for index, report in tqdm(all_cytonn_reports_df.iterrows(), total=len(all_cytonn_reports_df)): debug_options = {'log_unmatched_table': lambda value_str: debug_log_store.append((index, 'log_unmatched_table', value_str)),'log_invalid_columns': lambda value_df: debug_log_store.append((index, 'log_invalid_columns', value_df)),'log_extracted_valid': lambda value_tuple_df: debug_log_store.append((index, 'log_extracted_valid', value_tuple_df)),'log_extracted_invalid': lambda value_tuple_df: debug_log_store.append((index, 'log_extracted_invalid', value_tuple_df)),'log_extractor_count': lambda value_int: debug_log_store.append((index, 'log_extractor_count', value_int)), }yieldfrom extract_table_by_column_names(report, debug_options = debug_options)extracted_records_df = pd.concat(objs = extract_all_records(), ignore_index =True)extracted_records_df
array(['Absa Money Market Fund USD', 'Average',
'Average of Top 5 Money Market Funds',
'Benchmark ( Average 91 day T- Bill + 1.0% point)',
'Benchmark (Average 182 day T- Bill + 5.0% points)',
'CIC Dollar Fund', 'Cytonn Money Market Fund USD',
'Dry Associates Money Market Fund USD', 'Industrial Average',
'Industry average', 'KCB Money Market Fund USD',
'Kuza Money Market Fund USD', 'Lofty-Corban Money Market Fund USD',
'Nabo Africa Money Market Fund USD',
'Old Mutual Dollar Money Market Fund', 'Sanlam Dollar Fund',
'Total', 'Weighted Average Growth', 'nan'], dtype='<U49')
Show Code
all( re.match(r".*(total|usd|average|dollar).*|nan", str(i), flags=re.IGNORECASE) for i in Extracted_Scheme_Entry.INVALID_SCHEMES)
EAR_fig = px.line(EAR_pivot, x=EAR_pivot.index, y=EAR_pivot.columns)EAR_fig.update_layout( height=800, margin=dict(t=100), title=dict( text="Effective Annual Rate (Percentage)", # Your title here y=0.98, # Adjust the title's vertical position x=0.5, # Center the title xanchor='center', yanchor='top' ), xaxis=dict( side="top", # This moves the x-axis to the top title="Date"# This sets the title for the x-axis ), yaxis=dict( title="Effective Annual Rate"# This sets the title for the x-axis ), legend=dict( orientation="h", # horizontal orientation yanchor="bottom", y=-4.5, # move the legend below the plot xanchor="center", x=0.5 ))EAR_fig.update_traces( hovertemplate="<br>".join(["scheme=%{fullData.name}","date=%{x|%Y-%m-%d}","annual_rate=%{y}%",# removes any additional trace information that Plotly might add by default."<extra></extra>" ]))EAR_fig.show()