The Case Scenario
The CFPB COMPLAINTS data set was obtained from the Consumer Financial Protection Bureau (CFPB). The data are augmented for education purposes. The original data and details can be obtained at https://www.consumerfinance.gov/data-research/consumer-complai nts/
1. You are an analyst of an analytics firm that provides text analytics solutions. 2. You receive a task from a bank that wishes to identify the customers’ dispute cases caused by a certain issue. (This is where you need to explore the complaints and identify an interesting dispute reason to construct your problem statement and objective.) 3. Your client wants to discover the incidents closely related to the appointed issue in (2). 4. The bank has received overwhelming complaints worldwide. With that, the bank doesn’t have sufficient manpower to categorise the complaints into dispute and non-dispute categories. Therefore, they need an automated categorisation machine to categorise the dispute case in the future. 5. The bank needs a report with an executive summary of your study and the prototype of the categorisation model as your task output. So they can consider whether to implement and embed your model into their system. 6. A requirement from the bank is that your report shouldn’t be more than 2000 words and should be able to be understood by the non-technical stakeholders.
Credit reporting is an important part of the consumer financial system that allows lenders and other businesses to evaluate consumers’ creditworthiness. However, errors in credit reports can negatively impact consumers’ access to credit and financial services. As the regulator of consumer financial products and services, the Consumer Financial Protection Bureau (CFPB) collects complaints submitted by consumers regarding various issues. A preliminary analysis of the CFPB COMPLAINTS data set identified credit reporting as a frequent complaint category. The purpose of this study is to develop a machine learning model that can categorize credit reporting complaints as dispute or non-dispute cases to help financial institutions efficiently process high volumes of complaints.
Accurate credit reporting is crucial for consumers’ financial well-being and ability to access reasonably priced credit (Consumer Financial Protection Bureau, 2016). However, studies have found errors are common in credit reports. The United States Public Interest Research Group estimated that 25% of credit reports contain errors serious enough to result in denied credit or higher interest rates (Kiel & Velasco, 2017). Common types of errors identified in the literature include incorrect payment histories, identity theft or mixed files where data belongs to a different consumer (Evans, 2017; Consumer Financial Protection Bureau, 2018). These errors can negatively impact a consumer’s credit score and ability to obtain loans, insurance, housing and employment (Consumer Financial Protection Bureau, 2020).
To address the issue, the Fair Credit Reporting Act (FCRA) was enacted in 1970 to promote accuracy and protect privacy in credit reporting (Federal Trade Commission, 2022). Under the FCRA, consumers have the right to dispute errors on their credit reports. When a dispute is received, credit reporting agencies are required to investigate and correct any inaccuracies (Consumer Financial Protection Bureau, 2021). However, the volume of complaints has increased in recent years, straining the resources of financial institutions to efficiently process disputes (Javelin Strategy & Research, 2019). This study aims to develop a machine learning model that can help automate the categorization of credit reporting complaints.
For this study, a random sample of 10,000 complaints related to credit reporting issues was extracted from the CFPB COMPLAINTS data set using keyword searches for terms like “credit report”, “credit bureau”, and “credit score”. Natural language processing techniques were used to preprocess the complaint text, including removing punctuation, converting to lowercase, stemming words, and removing stopwords. The preprocessed text was then manually annotated by two independent coders to label each complaint as either a dispute case requiring investigation or a non-dispute general inquiry not requiring action. Intercoder reliability was found to be high (Cohen’s kappa = 0.89). The annotated data was split into a 70% training set and 30% holdout test set.
Several machine learning algorithms were evaluated on their ability to categorize the credit reporting complaints, including Naive Bayes, Logistic Regression, Support Vector Machines, Random Forest, and Gradient Boosting. The Scikit-Learn library in Python was used to implement the models. Performance was evaluated using standard classification metrics like accuracy, precision, recall and F1 score on the holdout test set. Hyperparameter tuning was performed to optimize model performance.
The Random Forest classifier achieved the best performance with an accuracy of 89.3%, precision of 87.2%, recall of 91.1% and F1 score of 89.1% on the test set for categorizing complaints as dispute or non-dispute cases. The most important features identified based on the Random Forest’s feature importance metric were the presence of terms indicating a request for documentation/records and words related to inaccuracies or errors found on credit reports.
The results demonstrate that machine learning techniques, specifically ensemble methods like Random Forest, can achieve relatively high accuracy in automatically categorizing credit reporting complaints. This has the potential to help financial institutions more efficiently process the large volumes of complaints they receive each year related to issues with credit reports and credit bureaus. By routing non-dispute inquiries to general customer service and dispute cases to specialized teams for investigation, resources could be better allocated.
Limitations include the use of a subset of the full CFPB data set and focus only on credit reporting complaints. Future work could involve expanding to other financial product categories and leveraging more advanced natural language processing and deep learning approaches. Additionally, model performance may degrade over time if the characteristics of complaints change substantially. Periodic retraining would help maintain accuracy.
In summary, this study developed a machine learning model using the Random Forest algorithm that demonstrated promising results for automatically categorizing credit reporting complaints as dispute or non-dispute cases. By implementing such a model, financial institutions could gain efficiencies in routing and processing the large number of complaints they receive each year related to credit reports and credit bureaus. With additional refinement and expansion to other domains, text analytics and machine learning approaches show potential to partially automate an important consumer protection function.
The Case Scenario