{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Explore the datasets used in this cookbook\n", "\n", "These datasets includes financial data in the text format to fine-tune language model for specific financial understanding tasks.\n", "\n", "**Note:** This is an exploratory data analysis notebook. We will not perform any data cleaning or preprocessing. This is just to understand the datasets and get some insights." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Import the Libraries\n", "\n", "Let's import the necessary libraries to explore the datasets." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from IPython.core.display import display, HTML\n", "import pandas as pd\n", "from io import StringIO\n", "from datasets import load_dataset\n", "from tqdm.notebook import tqdm" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "mkdir: ../../data: File exists\n" ] } ], "source": [ "!mkdir ../../data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Financial-PhraseBank\n", "\n", "Homepage: [Financial PhraseBank](https://www.researchgate.net/publication/251231364_FinancialPhraseBank-v10)\n", "\n", "Three-class (positive, negative, neutral) sentiment dataset of sentences from financial news. It consists a total of 4840 sentences from English language. The dataset is divided by agreement rate of 5-8 annotators. We used the [HF repository](https://huggingface.co/datasets/financial_phrasebank) to import the dataset.\n", "\n", "**Note:** The sentiment analysis is a well-established research area and financial phrasebank is only one dataset of this domain. If you would like to use more datasets to train the model, you can check the following links: [Financial News Dataset](https://www.kaggle.com/ankurzing/sentiment-analysis-for-financial-news), [Financial Tweets](https://www.kaggle.com/davidwallach/financial-tweets), [Financial News Sentiment](https://www.kaggle.com/ankurzing/sentiment-analysis-for-financial-news)\n" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "# Load the financial sentiment dataset\n", "fin_sentiment = load_dataset(\"financial_phrasebank\", \"sentences_50agree\") # 2.26K rows with text and labels, all agreed by annotators. \n", "# Check the training data\n", "fin_sentiment_df = pd.DataFrame(fin_sentiment['train'])" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
sentencelabel
0According to Gran , the company has no plans t...1
1Technopolis plans to develop in stages an area...1
2The international electronic industry company ...0
3With the new production plant the company woul...2
4According to the company 's updated strategy f...2
\n", "
" ], "text/plain": [ " sentence label\n", "0 According to Gran , the company has no plans t... 1\n", "1 Technopolis plans to develop in stages an area... 1\n", "2 The international electronic industry company ... 0\n", "3 With the new production plant the company woul... 2\n", "4 According to the company 's updated strategy f... 2" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# View the first few rows of the dataset\n", "fin_sentiment_df.head()" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "label\n", "1 2879\n", "2 1363\n", "0 604\n", "Name: count, dtype: int64" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Check the distribution of the labels\n", "fin_sentiment_df['label'].value_counts()" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "label\n", "1 0.594098\n", "2 0.281263\n", "0 0.124639\n", "Name: proportion, dtype: float64" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Check the normalised distribution of the labels\n", "fin_sentiment_df['label'].value_counts(normalize=True)" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "# If you want to save the dataset to a csv file\n", "fin_sentiment_df.to_csv('../../data/financial_phrasebank_50agree.csv', index=False) " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Financial Q&A Dataset\n", "\n", "Homepage: [FinQA](https://github.com/czyssrs/FinQA)\n", "\n", "A large scale dataset to develop analysis capability on business financials, numerical reasoning, and understand heterogenous representation." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!mkdir ../../data/finqa\n", "!wget -O ../../data/finqa/train.json https://raw.githubusercontent.com/czyssrs/FinQA/main/dataset/train.json # Approximately 15MB\n", "!wget -O ../../data/finqa/dev.json https://raw.githubusercontent.com/czyssrs/FinQA/main/dataset/dev.json # Approximately 2MB\n", "!wget -O ../../data/finqa/test.json https://raw.githubusercontent.com/czyssrs/FinQA/main/dataset/test.json # Approximately 2MB" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "finqa_train = pd.read_json('../../data/finqa/train.json')" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
pre_textpost_textfilenametable_oritableqaidtable_retrievedtext_retrievedtable_retrieved_alltext_retrieved_all
0[interest rate to a variable interest rate bas...[fair value of forward exchange contracts afte...ADI/2009/page_49.pdf[[, October 31, 2009, November 1, 2008], [Fair...[[, october 31 2009, november 1 2008], [fair v...{'question': 'what is the the interest expense...ADI/2009/page_49.pdf-1[{'score': -0.620767951011657, 'ind': 'table_1...[{'score': 1.251369595527649, 'ind': 'text_1'}...[{'score': -0.620767951011657, 'ind': 'table_1...[{'score': 1.251369595527649, 'ind': 'text_1'}...
1[abiomed , inc ., and subsidiaries notes to co...[the remaining unrecognized compensation expen...ABMD/2012/page_75.pdf[[, Number of Shares (in thousands), Weighted ...[[, number of shares ( in thousands ), weighte...{'question': 'during the 2012 year , did the e...ABMD/2012/page_75.pdf-1[{'score': 1.944458127021789, 'ind': 'table_2'}][{'score': 1.835455536842346, 'ind': 'text_15'...[{'score': 1.944458127021789, 'ind': 'table_2'...[{'score': 1.835455536842346, 'ind': 'text_15'...
2[the following table shows annual aircraft fue...[as of december 31 , 2018 , we did not have an...AAL/2018/page_13.pdf[[Year, Gallons, Average Priceper Gallon, Airc...[[year, gallons, average priceper gallon, airc...{'question': 'what was the total operating exp...AAL/2018/page_13.pdf-2[{'score': 1.610554456710815, 'ind': 'table_1'...[{'score': -1.64792251586914, 'ind': 'text_9'}...[{'score': 1.610554456710815, 'ind': 'table_1'...[{'score': -1.64792251586914, 'ind': 'text_9'}...
3[the fair value of our grants receivable is de...[in the third quarter of 2013 , we sold our sh...INTC/2013/page_71.pdf[[(In Millions), Dec 28,2013, Dec 29,2012], [A...[[( in millions ), dec 282013, dec 292012], [a...{'question': 'what percentage of total cash an...INTC/2013/page_71.pdf-4[{'score': 2.9937365055084233, 'ind': 'table_8...[{'score': -2.141725540161133, 'ind': 'text_9'...[{'score': 2.9937365055084233, 'ind': 'table_8...[{'score': -2.141725540161133, 'ind': 'text_9'...
4[entergy louisiana , llc management's financia...[the retail electric price variance is primari...ETR/2008/page_313.pdf[[, Amount (In Millions)], [2007 net revenue, ...[[, amount ( in millions )], [2007 net revenue...{'question': 'what is the growth rate in net r...ETR/2008/page_313.pdf-3[{'score': 3.095985174179077, 'ind': 'table_6'...[{'score': -0.5041980147361751, 'ind': 'text_2'}][{'score': 3.095985174179077, 'ind': 'table_6'...[{'score': -0.5041980147361751, 'ind': 'text_2...
\n", "
" ], "text/plain": [ " pre_text \\\n", "0 [interest rate to a variable interest rate bas... \n", "1 [abiomed , inc ., and subsidiaries notes to co... \n", "2 [the following table shows annual aircraft fue... \n", "3 [the fair value of our grants receivable is de... \n", "4 [entergy louisiana , llc management's financia... \n", "\n", " post_text filename \\\n", "0 [fair value of forward exchange contracts afte... ADI/2009/page_49.pdf \n", "1 [the remaining unrecognized compensation expen... ABMD/2012/page_75.pdf \n", "2 [as of december 31 , 2018 , we did not have an... AAL/2018/page_13.pdf \n", "3 [in the third quarter of 2013 , we sold our sh... INTC/2013/page_71.pdf \n", "4 [the retail electric price variance is primari... ETR/2008/page_313.pdf \n", "\n", " table_ori \\\n", "0 [[, October 31, 2009, November 1, 2008], [Fair... \n", "1 [[, Number of Shares (in thousands), Weighted ... \n", "2 [[Year, Gallons, Average Priceper Gallon, Airc... \n", "3 [[(In Millions), Dec 28,2013, Dec 29,2012], [A... \n", "4 [[, Amount (In Millions)], [2007 net revenue, ... \n", "\n", " table \\\n", "0 [[, october 31 2009, november 1 2008], [fair v... \n", "1 [[, number of shares ( in thousands ), weighte... \n", "2 [[year, gallons, average priceper gallon, airc... \n", "3 [[( in millions ), dec 282013, dec 292012], [a... \n", "4 [[, amount ( in millions )], [2007 net revenue... \n", "\n", " qa id \\\n", "0 {'question': 'what is the the interest expense... ADI/2009/page_49.pdf-1 \n", "1 {'question': 'during the 2012 year , did the e... ABMD/2012/page_75.pdf-1 \n", "2 {'question': 'what was the total operating exp... AAL/2018/page_13.pdf-2 \n", "3 {'question': 'what percentage of total cash an... INTC/2013/page_71.pdf-4 \n", "4 {'question': 'what is the growth rate in net r... ETR/2008/page_313.pdf-3 \n", "\n", " table_retrieved \\\n", "0 [{'score': -0.620767951011657, 'ind': 'table_1... \n", "1 [{'score': 1.944458127021789, 'ind': 'table_2'}] \n", "2 [{'score': 1.610554456710815, 'ind': 'table_1'... \n", "3 [{'score': 2.9937365055084233, 'ind': 'table_8... \n", "4 [{'score': 3.095985174179077, 'ind': 'table_6'... \n", "\n", " text_retrieved \\\n", "0 [{'score': 1.251369595527649, 'ind': 'text_1'}... \n", "1 [{'score': 1.835455536842346, 'ind': 'text_15'... \n", "2 [{'score': -1.64792251586914, 'ind': 'text_9'}... \n", "3 [{'score': -2.141725540161133, 'ind': 'text_9'... \n", "4 [{'score': -0.5041980147361751, 'ind': 'text_2'}] \n", "\n", " table_retrieved_all \\\n", "0 [{'score': -0.620767951011657, 'ind': 'table_1... \n", "1 [{'score': 1.944458127021789, 'ind': 'table_2'... \n", "2 [{'score': 1.610554456710815, 'ind': 'table_1'... \n", "3 [{'score': 2.9937365055084233, 'ind': 'table_8... \n", "4 [{'score': 3.095985174179077, 'ind': 'table_6'... \n", "\n", " text_retrieved_all \n", "0 [{'score': 1.251369595527649, 'ind': 'text_1'}... \n", "1 [{'score': 1.835455536842346, 'ind': 'text_15'... \n", "2 [{'score': -1.64792251586914, 'ind': 'text_9'}... \n", "3 [{'score': -2.141725540161133, 'ind': 'text_9'... \n", "4 [{'score': -0.5041980147361751, 'ind': 'text_2... " ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "finqa_train.head()" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
pre_textpost_textfilenametable_oritableqaidtable_retrievedtext_retrievedtable_retrieved_alltext_retrieved_all
count62516251625162516251625162516251625162516251
unique20901862211020982097620362515758558061926194
top[.][.]CME/2017/page_83.pdf[[Current assets, $1,922], [Long-term assets, ...[[current assets, $ 1922], [long-term assets, ...{'question': 'what percent did purchase issuan...ADI/2009/page_49.pdf-1[][][{'score': -0.5908989310264581, 'ind': 'table_...[{'score': 2.742092132568359, 'ind': 'text_17'...
freq206396992144062022
\n", "
" ], "text/plain": [ " pre_text post_text filename \\\n", "count 6251 6251 6251 \n", "unique 2090 1862 2110 \n", "top [.] [.] CME/2017/page_83.pdf \n", "freq 20 639 6 \n", "\n", " table_ori \\\n", "count 6251 \n", "unique 2098 \n", "top [[Current assets, $1,922], [Long-term assets, ... \n", "freq 9 \n", "\n", " table \\\n", "count 6251 \n", "unique 2097 \n", "top [[current assets, $ 1922], [long-term assets, ... \n", "freq 9 \n", "\n", " qa \\\n", "count 6251 \n", "unique 6203 \n", "top {'question': 'what percent did purchase issuan... \n", "freq 2 \n", "\n", " id table_retrieved text_retrieved \\\n", "count 6251 6251 6251 \n", "unique 6251 5758 5580 \n", "top ADI/2009/page_49.pdf-1 [] [] \n", "freq 1 440 620 \n", "\n", " table_retrieved_all \\\n", "count 6251 \n", "unique 6192 \n", "top [{'score': -0.5908989310264581, 'ind': 'table_... \n", "freq 2 \n", "\n", " text_retrieved_all \n", "count 6251 \n", "unique 6194 \n", "top [{'score': 2.742092132568359, 'ind': 'text_17'... \n", "freq 2 " ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "finqa_train.describe()" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "pre_text object\n", "post_text object\n", "filename object\n", "table_ori object\n", "table object\n", "qa object\n", "id object\n", "table_retrieved object\n", "text_retrieved object\n", "table_retrieved_all object\n", "text_retrieved_all object\n", "dtype: object" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "finqa_train.dtypes" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "what is the the interest expense in 2009?\n", "interest rate to a variable interest rate based on the three-month libor plus 2.05% ( 2.05 % ) ( 2.34% ( 2.34 % ) as of october 31 , 2009 ) . if libor changes by 100 basis points , our annual interest expense would change by $ 3.8 million . dollar , would have on the fair value of our forward exchange contracts as of october 31 , 2009 and november 1 , 2008: . \n", "if libor changes by 100 basis points , our annual interest expense would change by $ 3.8 million . \n", "[['', 'october 31 2009', 'november 1 2008'], ['fair value of forward exchange contracts asset ( liability )', '$ 6427', '$ -23158 ( 23158 )'], ['fair value of forward exchange contracts after a 10% ( 10 % ) unfavorable movement in foreign currency exchange rates asset ( liability )', '$ 20132', '$ -9457 ( 9457 )'], ['fair value of forward exchange contracts after a 10% ( 10 % ) favorable movement in foreign currency exchange rates liability', '$ -6781 ( 6781 )', '$ -38294 ( 38294 )']]\n" ] } ], "source": [ "entry = finqa_train.iloc[0]\n", "question = entry[\"qa\"][\"question\"]\n", "\n", "print(question)\n", "\n", "context = \"\"\n", "for ind, each_sent in entry[\"qa\"][\"model_input\"]:\n", " context += each_sent\n", " context += \" \"\n", "\n", "print(context)\n", "\n", "context = \"\"\n", "for each_con in entry[\"qa\"][\"gold_inds\"]:\n", " context += entry[\"qa\"][\"gold_inds\"][each_con]\n", " context += \" \"\n", "\n", "print(context)\n", "\n", "table = entry[\"table\"]\n", "print(table)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Flare FinQA\n", "\n", "Part of the [PIXIU](https://github.com/The-FinAI/PIXIU) project, researchers and practitioners from China, UK and US created datasets that can be ready-to-feed in the LLM prompting format. Called FLARE Evaluation datasets, they present a set of prompt-based datasets: [see HF Repo](https://huggingface.co/collections/ChanceFocus/flare-evaluation-datasets-english-6529286a147d9119a64689c0). In the Flare-FinQA, they convert the FinQA dataset to work with prompt-based strategies." ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "DatasetDict({\n", " train: Dataset({\n", " features: ['id', 'query', 'answer', 'text'],\n", " num_rows: 6251\n", " })\n", " test: Dataset({\n", " features: ['id', 'query', 'answer', 'text'],\n", " num_rows: 1147\n", " })\n", " valid: Dataset({\n", " features: ['id', 'query', 'answer', 'text'],\n", " num_rows: 883\n", " })\n", "})\n" ] } ], "source": [ "dataset = load_dataset(\"ChanceFocus/flare-finqa\") #Approximate Size: 16MB\n", "print(dataset)" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Please answer the given financial question based on the context.\n", "Context: the significant changes from december 31 , 2008 to december 31 , 2009 in level 3 assets and liabilities are due to : a net decrease in trading securities of $ 10.8 billion that was driven by : 2022 net transfers of $ 6.5 billion , due mainly to the transfer of debt 2013 securities from level 3 to level 2 due to increased liquidity and pricing transparency ; and net settlements of $ 5.8 billion , due primarily to the liquidations of 2013 subprime securities of $ 4.1 billion . the change in net trading derivatives driven by : 2022 a net loss of $ 4.9 billion relating to complex derivative contracts , 2013 such as those linked to credit , equity and commodity exposures . these losses include both realized and unrealized losses during 2009 and are partially offset by gains recognized in instruments that have been classified in levels 1 and 2 ; and net increase in derivative assets of $ 4.3 billion , which includes cash 2013 settlements of derivative contracts in an unrealized loss position , notably those linked to subprime exposures . the decrease in level 3 investments of $ 6.9 billion primarily 2022 resulted from : a reduction of $ 5.0 billion , due mainly to paydowns on debt 2013 securities and sales of private equity investments ; the net transfer of investment securities from level 3 to level 2 2013 of $ 1.5 billion , due to increased availability of observable pricing inputs ; and net losses recognized of $ 0.4 billion due mainly to losses on non- 2013 marketable equity securities including write-downs on private equity investments . the decrease in securities sold under agreements to repurchase of 2022 $ 9.1 billion is driven by a $ 8.6 billion net transfers from level 3 to level 2 as effective maturity dates on structured repos have shortened . the decrease in long-term debt of $ 1.5 billion is driven mainly by 2022 $ 1.3 billion of net terminations of structured notes . transfers between level 1 and level 2 of the fair value hierarchy the company did not have any significant transfers of assets or liabilities between levels 1 and 2 of the fair value hierarchy during 2010 . items measured at fair value on a nonrecurring basis certain assets and liabilities are measured at fair value on a nonrecurring basis and therefore are not included in the tables above . these include assets measured at cost that have been written down to fair value during the periods as a result of an impairment . in addition , these assets include loans held-for-sale that are measured at locom that were recognized at fair value below cost at the end of the period . the fair value of loans measured on a locom basis is determined where possible using quoted secondary-market prices . such loans are generally classified as level 2 of the fair value hierarchy given the level of activity in the market and the frequency of available quotes . if no such quoted price exists , the fair value of a loan is determined using quoted prices for a similar asset or assets , adjusted for the specific attributes of that loan . the following table presents all loans held-for-sale that are carried at locom as of december 31 , 2010 and 2009 : in billions of dollars aggregate cost fair value level 2 level 3 .\n", "|in billions of dollars|aggregate cost|fair value|level 2|level 3|\n", "|december 31 2010|$ 3.1|$ 2.5|$ 0.7|$ 1.8|\n", "|december 31 2009|$ 2.5|$ 1.6|$ 0.3|$ 1.3|\n", ".\n", "Question: what was the growth rate of the loans held-for-sale that are carried at locom from 2009 to 2010\n", "Answer:\n", "0.97656\n" ] } ], "source": [ "entry = dataset['train'][5]\n", "query = entry[\"query\"]\n", "answer = entry[\"answer\"]\n", "\n", "print(query)\n", "print(answer)" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [], "source": [ "flare_finqa_df = pd.DataFrame(dataset['train'])\n", "flare_finqa_df.to_csv('../../data/flare_finqa_train.csv', index=False) \n", "flare_finqa_df = pd.DataFrame(dataset['valid'])\n", "flare_finqa_df.to_csv('../../data/flare_finqa_valid.csv', index=False)\n", "flare_finqa_df = pd.DataFrame(dataset['test'])\n", "flare_finqa_df.to_csv('../../data/flare_finqa_test.csv', index=False) " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can also directly download the HF datasets, using Github LFS and read the parquet files. See the example below:\n", "\n", "```python\n", "import pyarrow.parquet as pq\n", "\n", "# columns=['col1', 'col2'] to restrict loaded columns\n", "pds = pq.read_pandas('../../data/flare-finqa/data/train.parquet', columns=None).to_pandas()\n", "\n", "# path_or_buf='output.jsonl.gz' to output to a file instead of stdout\n", "pds.to_json(path_or_buf=\"../../data/flare_finqa_train.json\", orient='records', lines=True, date_format='iso', date_unit='us', compression='gzip')\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# ConvFinQA Dataset\n", "\n", "Homepage: \n", "\n", "From the creators of the FinQA, this dataset still aims to improve \"numerical reasoning\" skills of language models, presented in a conversational format." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!mkdir ../../data/convfinqa\n", "!wget -O ../../data/convfinqa/data.zip https://github.com/czyssrs/ConvFinQA/raw/main/data.zip #Approximate Size: 17MB\n", "!unzip ../../data/convfinqa/data.zip -d data/convfinqa\n", "!rm ../../data/convfinqa/data.zip" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
pre_textpost_textfilenametable_oritableqaidannotationqa_0qa_1
0[26 | 2009 annual report in fiscal 2008 , reve...[year ended june 30 , cash provided by operati...JKHY/2009/page_28.pdf[[, Year ended June 30, 2009], [2008, 2007], [...[[2008, year ended june 30 2009 2008, year end...{'question': 'what was the percentage change i...Single_JKHY/2009/page_28.pdf-3{'amt_table': '<table class='wikitable'><tr><t...NaNNaN
1[substantially all of the goodwill and other i...[the above unaudited pro forma financial infor...RSG/2008/page_114.pdf[[, Year Ended December 31, 2008 (Unaudited), ...[[, year ended december 31 2008 ( unaudited ),...{'question': 'what was the percent of the grow...Single_RSG/2008/page_114.pdf-2{'amt_table': '<table class='wikitable'><tr><t...NaNNaN
2[in a new business model such as the retail se...[.]AAPL/2002/page_23.pdf[[, 2002, 2001, 2000], [Net sales, $5,742, $5,...[[, 2002, 2001, 2000], [net sales, $ 5742, $ 5...{'question': 'what was the percentage change i...Single_AAPL/2002/page_23.pdf-1{'amt_table': '<table class='wikitable'><tr><t...NaNNaN
3[( 1 ) includes shares repurchased through our...[.]UPS/2009/page_33.pdf[[, 12/31/04, 12/31/05, 12/31/06, 12/31/07, 12...[[, 12/31/04, 12/31/05, 12/31/06, 12/31/07, 12...{'question': 'what was the difference in perce...Single_UPS/2009/page_33.pdf-2{'amt_table': '<table class='wikitable'><tr><t...NaNNaN
4[( 1 ) includes shares repurchased through our...[.]UPS/2009/page_33.pdf[[, 12/31/04, 12/31/05, 12/31/06, 12/31/07, 12...[[, 12/31/04, 12/31/05, 12/31/06, 12/31/07, 12...NaNDouble_UPS/2009/page_33.pdf{'amt_table': '<table class='wikitable'><tr><t...{'question': 'what is the roi of an investment...{'question': 'what was the difference in perce...
\n", "
" ], "text/plain": [ " pre_text \\\n", "0 [26 | 2009 annual report in fiscal 2008 , reve... \n", "1 [substantially all of the goodwill and other i... \n", "2 [in a new business model such as the retail se... \n", "3 [( 1 ) includes shares repurchased through our... \n", "4 [( 1 ) includes shares repurchased through our... \n", "\n", " post_text filename \\\n", "0 [year ended june 30 , cash provided by operati... JKHY/2009/page_28.pdf \n", "1 [the above unaudited pro forma financial infor... RSG/2008/page_114.pdf \n", "2 [.] AAPL/2002/page_23.pdf \n", "3 [.] UPS/2009/page_33.pdf \n", "4 [.] UPS/2009/page_33.pdf \n", "\n", " table_ori \\\n", "0 [[, Year ended June 30, 2009], [2008, 2007], [... \n", "1 [[, Year Ended December 31, 2008 (Unaudited), ... \n", "2 [[, 2002, 2001, 2000], [Net sales, $5,742, $5,... \n", "3 [[, 12/31/04, 12/31/05, 12/31/06, 12/31/07, 12... \n", "4 [[, 12/31/04, 12/31/05, 12/31/06, 12/31/07, 12... \n", "\n", " table \\\n", "0 [[2008, year ended june 30 2009 2008, year end... \n", "1 [[, year ended december 31 2008 ( unaudited ),... \n", "2 [[, 2002, 2001, 2000], [net sales, $ 5742, $ 5... \n", "3 [[, 12/31/04, 12/31/05, 12/31/06, 12/31/07, 12... \n", "4 [[, 12/31/04, 12/31/05, 12/31/06, 12/31/07, 12... \n", "\n", " qa \\\n", "0 {'question': 'what was the percentage change i... \n", "1 {'question': 'what was the percent of the grow... \n", "2 {'question': 'what was the percentage change i... \n", "3 {'question': 'what was the difference in perce... \n", "4 NaN \n", "\n", " id \\\n", "0 Single_JKHY/2009/page_28.pdf-3 \n", "1 Single_RSG/2008/page_114.pdf-2 \n", "2 Single_AAPL/2002/page_23.pdf-1 \n", "3 Single_UPS/2009/page_33.pdf-2 \n", "4 Double_UPS/2009/page_33.pdf \n", "\n", " annotation \\\n", "0 {'amt_table': '
12008year ended june 30 2009 2008year ended june 30 2009 2008year ended june 30 2009
2net income$ 103102$ 104222$ 104681
3non-cash expenses743977042056348
4change in receivables21214-2913 ( 2913 )-28853 ( 28853 )
5change in deferred revenue21943510024576
6change in other assets and liabilities-14068 ( 14068 )417217495
7net cash from operating activities$ 206588$ 181001$ 174247
year ended june 30 , cash provided by operations increased $ 25587 to $ 206588 for the fiscal year ended june 30 , 2009 as compared to $ 181001 for the fiscal year ended june 30 , 2008 . this increase is primarily attributable to a decrease in receivables compared to the same period a year ago of $ 21214 . this decrease is largely the result of fiscal 2010 annual software maintenance billings being provided to customers earlier than in the prior year , which allowed more cash to be collected before the end of the fiscal year than in previous years . further , we collected more cash overall related to revenues that will be recognized in subsequent periods in the current year than in fiscal 2008 . cash used in investing activities for the fiscal year ended june 2009 was $ 59227 and includes $ 3027 in contingent consideration paid on prior years 2019 acquisitions . cash used in investing activities for the fiscal year ended june 2008 was $ 102148 and includes payments for acquisitions of $ 48109 , plus $ 1215 in contingent consideration paid on prior years 2019 acquisitions . capital expenditures for fiscal 2009 were $ 31562 compared to $ 31105 for fiscal 2008 . cash used for software development in fiscal 2009 was $ 24684 compared to $ 23736 during the prior year . net cash used in financing activities for the current fiscal year was $ 94675 and includes the repurchase of 3106 shares of our common stock for $ 58405 , the payment of dividends of $ 26903 and $ 13489 net repayment on our revolving credit facilities . cash used in financing activities was partially offset by proceeds of $ 3773 from the exercise of stock options and the sale of common stock ( through the employee stock purchase plan ) and $ 348 excess tax benefits from stock option exercises . during fiscal 2008 , net cash used in financing activities for the fiscal year was $ 101905 and includes the repurchase of 4200 shares of our common stock for $ 100996 , the payment of dividends of $ 24683 and $ 429 net repayment on our revolving credit facilities . cash used in financing activities was partially offset by proceeds of $ 20394 from the exercise of stock options and the sale of common stock and $ 3809 excess tax benefits from stock option exercises . beginning during fiscal 2008 , us financial markets and many of the largest us financial institutions have been shaken by negative developments in the home mortgage industry and the mortgage markets , and particularly the markets for subprime mortgage-backed securities . since that time , these and other such developments have resulted in a broad , global economic downturn . while we , as is the case with most companies , have experienced the effects of this downturn , we have not experienced any significant issues with our current collection efforts , and we believe that any future impact to our liquidity will be minimized by cash generated by recurring sources of revenue and due to our access to available lines of credit. .\n", "Conversations: \n", "Question: what is the net cash from operating activities in 2009?\n", "Answer:\n", "206588.0\n" ] } ], "source": [ "entry = dataset['train'][0]\n", "query = entry[\"query\"]\n", "answer = entry[\"answer\"]\n", "\n", "print(query)\n", "print(answer)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# IBM FinTabNet\n", "\n", "Homepage: [IBM FinTabNet](https://developer.ibm.com/exchanges/data/all/fintabnet/)\n", "Data Exploration Notebook: [IBM - Fintabnet - EDA](https://dataplatform.cloud.ibm.com/analytics/notebooks/v2/f57cf3f6-e972-48ff-ab7b-3771ba7b9683/view?access_token=317644327d84f5d75b4782f97499146c78d029651a7c7ace050f4a7656033c30)\n", "\n", "The IBM FinTabNet dataset is a large scale dataset for financial table to text generation. It has been created by IBM Research AI. You can download the dataset by uncommenting the following commands. The dataset size is 16GB. And the sister project SynthTabNet (synthetically generated documents) is 10GB in size." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#!wget -O ../../data/fintabnet.tar.gz https://dax-cdn.cdn.appdomain.cloud/dax-fintabnet/1.0.0/fintabnet.tar.gz #Approximate Size: 16GB\n", "\n", "# FinTabNet consists of real public documents. SynthTabNet is another dataset from synthetically generated table layouts with annotations in jsonl files. If you want to work with synthetic data, you can download the SynthTabNet dataset from the following link:\n", "\n", "#!wget -O ../../data/synthtabnet.tar.gz https://ds4sd-public-artifacts.s3.eu-de.cloud-object-storage.appdomain.cloud/datasets/synthtabnet_public/v2.0.0/fintabnet.zip #Approximate Size: 10GB" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Uncomment the following code to extract the dataset\n", "\"\"\"\n", "import tarfile\n", "from os import path\n", "\n", "#Extracting the dataset\n", "tar = tarfile.open(\"../../data/fintabnet.tar.gz\") \n", "if hasattr(tarfile, 'data_filter'):\n", " tar.extractall(filter='data')\n", "else:\n", " # remove this when no longer needed\n", " print('Extracting may be unsafe; consider updating Python')\n", " tar.extractall()\n", "tar.close()\n", "\n", "# Verifying the file was extracted properly\n", "data_path = \"examples/\"\n", "path.exists(data_path)\n", "\"\"\"" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "# Parse the JSON file and read all the images and labels\n", "import json\n", "\n", "# Download the example file from using:\n", "# !wget https://dax-cdn.cdn.appdomain.cloud/dax-fintabnet/1.0.0/examples.tar.gz ./data/fintabnet-examples.tar.gz\n", "\n", "with open('../../data/fintabnet-examples/FinTabNet_1.0.0_table_example.jsonl', 'r') as fp:\n", " images = {}\n", " for line in fp:\n", " sample = json.loads(line)\n", " # Index images\n", " if sample['filename'] in images:\n", " annotations = images[sample['filename']][\"annotations\"]\n", " html = images[sample['filename']][\"html\"]\n", " else:\n", " annotations = []\n", " html = \"\"\n", " for t, token in enumerate(sample[\"html\"][\"cells\"]):\n", " if \"bbox\" in token:\n", " annotations.append({\"category_id\":2, \"bbox\": token[\"bbox\"]})\n", " #Build html table\n", " cnt = 0\n", " for t, token in enumerate(sample[\"html\"][\"structure\"][\"tokens\"]):\n", " html += token\n", " if token==\"\":\n", " html += \"\".join(sample[\"html\"][\"cells\"][cnt][\"tokens\"])\n", " cnt += 1\n", " annotations.append({\"category_id\": 1, \"bbox\": sample[\"bbox\"]})\n", " images[sample['filename']] = {'filepath': '../../data/fintabnet-examples/pdf/' + sample[\"filename\"], 'html': html, 'annotations': annotations}" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Table HTML for page #0\n" ] }, { "data": { "text/html": [ "
201720162015
Minimum rentals$2,814$2,394$2,249
Contingent rentals(1)178214194
$2,992$2,608$2,443
Operating LeasesAircraftand RelatedEquipmentFacilitiesand Other
TotalOperatingLeases2018$398$2,047
$2,44520193431,887
2,23020202611,670
1,93120212031,506
1,70920221851,355
1,540Thereafter1757,844
8,019Total$1,565$16,309
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "Table HTML for page #1\n" ] }, { "data": { "text/html": [ "
Amount Reclassified from AOCI
Affected Line Item in theIncome Statement201720162015
Amortization of retirement plans prior servicecredits, before tax$120$121$115
Salaries and employee benefitsIncome tax benefit(44)(45)(43)
Provision for income taxesAOCI reclassifications, net of tax$76$76$72
201720162015
Foreign currency translation gain (loss):
Balance at beginning of period$(514)$(253)$81
Translation adjustments(171)(261)(334)
Balance at end of period(685)(514)(253)
Retirement plans adjustments:
Balance at beginning of period345425425
Prior service credit and other arising during period1(4)72
Reclassifications from AOCI(76)(76)(72)
Balance at end of period270345425
Accumulated other comprehensive (loss) income at end of period$(415)$(169)$172
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "Table HTML for page #2\n" ] }, { "data": { "text/html": [ "
2018$81
201971
202055
202144
202241
20172016GrossCarryingAmountAccumulated AmortizationNet BookValueGrossCarryingAmount
Accumulated AmortizationNet BookValueCustomer relationships$656$(203)$453$912
$(156)$756Technology54(26)28123
(16)107Trademarks and other136(88)48202
(57)145Total$846$(317)$529$1,237
20172016
Accrued Salaries and Employee Benefits
Salaries$431$478
Employee benefits, including variable compensation781804
Compensated absences702690
$1,914$1,972
Accrued Expenses
Self-insurance accruals$976$837
Taxes other than income taxes283311
Other1,9711,915
$3,230$3,063
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "Table HTML for page #3\n" ] }, { "data": { "text/html": [ "
(in millions, except per share amounts)FirstQuarterSecondQuarterThirdQuarterFourth Quarter
2017(1)
Revenues$14,663$14,931$14,997$15,728
Operating income1,2641,1671,0251,581
Net income7157005621,020
Basic earnings per common share(2)2.692.632.113.81
Diluted earnings per common share(2)2.652.592.073.75
2016(3)
Revenues$12,279$12,453$12,654$12,979
Operating income (loss)1,1441,137864(68)
Net income (loss)692691507(70)
Basic earnings (loss) per common share(2)2.452.471.86(0.26)
Diluted earnings (loss) per common share(2)2.422.441.84(0.26)
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "Table HTML for page #4\n" ] }, { "data": { "text/html": [ "
201720162015
Low3.25%2.75%4.50%
High4.504.507.00
Weighted-average4.033.825.90
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "Table HTML for page #5\n" ] }, { "data": { "text/html": [ "
Aircraft andAircraft RelatedOther(1)Total
2018$1,777$1,440$3,217
20191,7295082,237
20201,9334002,333
20211,3413091,650
20221,2761981,474
Thereafter2,8954993,394
Total$10,951$3,354$14,305
B767FB777FTotal
201814418
201915217
202016319
202110313
202210414
Thereafter6-6
Total711687
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "Table HTML for page #6\n" ] }, { "data": { "text/html": [ "
2017Percent of Revenue 2017
Revenues$7,401100.0%
Operating expenses:
Salaries and employee benefits2,07728.1
Purchased transportation3,04941.2
Rentals3534.8
Depreciation and amortization2393.2
Fuel2253.1
Maintenance and repairs1431.9
Intercompany charges170.2
Other1,21416.4
Total operating expenses7,31798.9%
Operating income$84
Operating margin1.1%
Package:
Average daily packages1,022
Revenue per package (yield)$24.77
Freight:
Average daily pounds3,608
Revenue per pound (yield)$0.56
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "Table HTML for page #7\n" ] }, { "data": { "text/html": [ "
20172016TotalNumber ofSharesPurchasedAveragePrice Paidper ShareTotalPurchasePriceTotalNumber ofSharesPurchased
AveragePrice Paidper ShareTotalPurchasePriceCommon stock repurchases2,955,000$172.13$50918,225,000
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "Table HTML for page #8\n" ] }, { "data": { "text/html": [ "
20172016
Funded Status of Plans:
Projected benefit obligation (PBO)$29,913$29,602
Fair value of plan assets26,31224,271
Funded status of the plans$(3,601)$(5,331)
Cash Amounts:
Cash contributions during the year$2,115$726
Benefit payments during the year$2,310$912
MeasurementDateDiscount Rate
5/31/20174.08%
5/31/20164.13
5/31/20154.42
5/31/20144.60
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "Table HTML for page #9\n" ] }, { "data": { "text/html": [ "
Net Book Value at May 31,Range2017
2016Wide-body aircraft and related equipment15 to 30 years$9,103
$8,356Narrow-body and feeder aircraft and related equipment5 to 18 years3,099
3,180Package handling and ground support equipment3 to 30 years3,862
3,249Information technology2 to 10 years1,114
1,051Vehicles3 to 15 years3,400
3,084Facilities and other2 to 40 years5,403
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "import fitz # PyMuPDF\n", "\n", "for i, (filename, image) in enumerate(images.items()):\n", " pdf_document = fitz.open(image[\"filepath\"])\n", " pdf_page = pdf_document[0] # Assuming you want to work with the first page\n", " pdf_width = int(pdf_page.rect.width)\n", " pdf_height = int(pdf_page.rect.height)\n", " img = pdf_page.get_pixmap()\n", " \n", " print(\"Table HTML for page #{}\".format(i))\n", " display(HTML(image['html']))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# FinTabNet OTSL\n", "\n", "An alternative format is OTSL, which is published by the same development team. https://huggingface.co/datasets/ds4sd/FinTabNet_OTSL\n", "This dataset is a conversion of the original FinTabNet into the OTSL format." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Load the test dataset to explore the data (even the smaller split is 300MB)\n", "dataset = load_dataset(\"ds4sd/FinTabNet_OTSL\", split=\"test\") # Approximate size: 300MB" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": ".venv", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.4" } }, "nbformat": 4, "nbformat_minor": 2 }