Home / Blog / Data Science / Support Vector Classifier for Analysis of FBI Crime Data

Support Vector Classifier for Analysis of FBI Crime Data

September 24, 2024
91

Meet the Author : Mr. Bharani Kumar

Bharani Kumar Depuru is a well known IT personality from Hyderabad. He is the Founder and Director of AiSPRY and 360DigiTMG. Bharani Kumar is an IIT and ISB alumni with more than 17 years of experience, he held prominent positions in the IT elites like HSBC, ITC Infotech, Infosys, and Deloitte. He is a prevalent IT consultant specializing in Industrial Revolution 4.0 implementation, Data Analytics practice setup, Artificial Intelligence, Big Data Analytics, Industrial IoT, Business Intelligence and Business Management. Bharani Kumar is also the chief trainer at 360DigiTMG with more than Ten years of experience and has been making the IT transition journey easy for his students. 360DigiTMG is at the forefront of delivering quality education, thereby bridging the gap between academia and industry.

Introduction

Cyber events are a combination of past crimes and more recent crimes. According to national crime data and polls, cybercrime events happen as separate criminal offences and are on the rise. The current classification method is less accurate in classifying cybercrimes and cyber-incidents. The biggest flaws in the dominant model are these overlapping findings and, consequently, the absence of a new classification method. The study offers a unique method of categorising various crimes by relying on physical aspects like time and date in order to solve these challenges. It allows users to obtain an easy, accurate classification result using a support vector machine (SVM)-based cybercrime classifier. It creates a model to organise over a training set in order to get the most precise results by grouping the dataset by either decision trees or random forests. It is a modest and effective strategy to classify cybercrimes so that those impacted may recognise the sort of incident and respond appropriately. Additionally, it evaluates the data and generates a variety of charts for the accurate representation of the data. The above model, which divides convicted offenders into groups with low, medium, and high recidivism risks, aids in reducing crime rates in society and guaranteeing the welfare and well-being of its residents.

At present, the criminal cases that are pending in India are rapidly increasing with the number of crimes committed increasing. To resolve a case based upon selected data there should be an intensive investigation and analysis that's to be done internally. With the quantity of crime data Journal of upset Research, that's present in India currently the analysis and higher cognitive process of those criminal cases is just too difficult for the officials. Identifying this major problem this paper concentrates on creating an answer for the choice making of crime that's committed. Machine Learning is the branch of science where computers decide without human intervention. In recent times Machine Learning is being employed in various domains each of the samples of such cases is automated or self-driving cars. By Machine Learning algorithms there's a way where we can predict certain results based upon our inputs given and supply an answer to solving crime cases in India. The 2 common styles of prediction techniques are classification and regression. This crime data prediction may be a domain where classification is applied. Classification could be a supervised prediction technique and it's been utilized in various domains like forecasting stock and medicinal areas.

The major goal of this research is to consider several algorithms that might be used to analyse and forecast crime data and to increase the accuracy of these models by processing them to provide better outcomes. The goal is to use the training data set to train the required model to predict the information, which will then be validated using the test data set. Here, decision trees, random forests, and logistic regression are the models being used for categorization.

The major contributions of the paper are as follows

a. Preprocessing of the info has been performed, so errors within the data and malware are effectively removed within the crime dataset.

b. The SVM method implemented classification on the publicly available dataset, the results show that the proposed SVM classification gives a higher performance compared to other approaches.

The work is organized as follows

We are left with a clean, compatible dataset that is devoid of anomalies as a consequence of this phase, which may be used in the next stage, in which machine learning algorithms would be applied to the cleaned dataset. The next stage is machine learning implementation, where several supervised classification algorithms are used to divide criminal records into three groups: Low, Medium, and High. We use SVM-based classification techniques to train our models.

Data Cleaning

Using the Python packages Pandas and Numpy, these three tuples were deleted. We disregard and do away with these three tuples. The information set, aside from the three tuples, had a few null values that were replaced by the norm of that particular column. The removal of Null Values The original dataset was made up of three tuples, each of which had all of its attributes set to null. Using the Python packages Pandas and Numpy, these three tuples were deleted. Therefore, we had to get rid of any repeated records in order to prevent overfitting. The dataset was reduced as a result from 60000 duplicate records to 18000 unique ones.

The dataset contains attribute values of string data type. The String data type isn't compatible with the Machine Learning Models. To convert the info into a compatible format (Integer, Float), we perform the Label Encoding Technique. Within the associated dataset, label encoding is finished manually to cut back the biases in the dataset which is one of the foremost important factors that has to be taken care of. The 20 features from the dataset are considered. In total, 16 important features are considered out of 20 for the implementation of prediction of the models. The features mainly considered are Ethnic code, legal status, Age, Sex Code, position, and plenty more considered while training the model on the dataset. The target variable is Score_Text with values low, medium, and high.

The data provides statistical metrics like Mean, variance, Minimum Value, and Maximum Value while describing the integer type properties of the dataset. Let's explore the characteristic to further explain this Age: Based on our observations, the average criminal is between 34 and 35 years old. The 12.20-year quality deviation is present. The oldest offender in the dataset is 84 years old, while the youngest is 16 years old. Similarly, extrapolations to other similar features are frequently made. The criminal data set is gathered from the graphic up above, and a four-step procedure known as data processing takes place. Data processing is the process of looking through big pre-existing databases to obtain fresh statistics. Data cleansing, also known as data cleaning, is the initial procedure used to identify and fix erroneous or faulty records. Data preparation is a procedure that transforms the data into a format that is useful and effective after the information has been changed and rectified.

Data Transformation

The dataset contains various attributes with values in the string data type. The String data type is not compatible with the Machine Learning Models. To convert the data into a compatible format (Integer, Float), we perform the Label Encoding Technique. In the associated dataset, label encoding is done manually to reduce the biases in the dataset which is one of the most important factors that needs to be taken care of.

Then, feature extraction is described as a procedure where the information is condensed to a more manageable collection of operations. Once the data has been gathered, a procedure known as processing is used to extract, alter, or categorise the data using computer processes. The prophet model's methodology is shown in the flowchart below. In the beginning, the data is obtained by calculating the information that is combined from trend, seasonality, and holidays. The main procedure is featuring selection, when the user either chooses manually or automatically those features that can have the greatest impact on the output or prediction variable. The information goes through a four-step procedure after the characteristics are chosen, including modelling, forecast evaluation, surfacing issues, and visually evaluating the forecasts. The information sets are analysed once the technique is finished to highlight their key properties, frequently using visual approaches. The "ScoreText" property of the dataset is our target variable to implement the aforementioned methods. This attribute can have one of three values across the dataset: Low, Medium, or High. This might serve as a gauge for a criminal's propensity for repeat offences. Since we want to categorise the offenders who supported their likelihood to do a criminal offence again, we've set the "ScoreText" element as the target attribute.

1)The classification into Low, Medium, and High risk provides a cogent perspective to the authorities while processing the given criminal for parole or bail. To explicate this, the criminal with a better tendency, must not get a bail/parole, as compared to at least one with a lower risk. This solves the aim of checking and curbing criminal recidivism in society, hence ensuring the protection of citizens and eschewing a possible crime. The attributes selected as features, for the algorithms, are directly or indirectly affected and associated with the recidivism tendency of the given criminals. Nearly 17 attributes of the dataset were selected for the training and testing of our machine learning models. For an intensive comparative study among all 3 algorithms used, the features of the model remained identical. A bias will be encountered in any Machine Learning model, this basically may influence the result generated by the machine learning model. Biases in models must be removed since they supply us with an impartial outcome. A bias-free machine learning model cannot exist because it requires a particular amount of bias to model the info and to analyze predictions. However, the aim is to scale back these biases occurring in our model. Within the case of coaching models for Criminal Recidivism, various biases can be encountered. To explicate this, a number of the biases could also be against certain races, where people of a selected race are also impartially evaluated for gauging the recidivism score. Another such bias may occur within the gender of any offender, where someone of a selected gender is also more biased/likely to be categorized as a recidivist.

2) Algorithmic Bias - This is the type of bias that is introduced by the algorithmic phase of the machine learning model and is not present due to anomalies in data samples. Data Scientists strive to attain a perfect balance between high variance and high bias. Here, in our model, the Random Forest Classifier introduces bias, when training and testing the given dataset. This is an inherent property of the algorithm.

3) Measurement Bias - It occurs when we select the features we wish to incorporate into the model. It may be the way these features/attributes are used in the machine learning phase. A striking example of this is the use of this for criminal recidivism, where any priory committed crimes or crimes committed by relatives/friends may also taint the outcome of the model. Thus, attributes like Agency Type, Custody Status”, Legal Status”, etc. may create a measurement bias for criminals in evaluating their score text.

4) Prejudice Bias - This type of bias is mainly due to the influence of social stereotypes and orthodox opinions. It mainly occurs in training data, where prejudice against a particular culture, gender, ethnicity, or any such factor may make the model biased while generating the output. If the algorithm is exposed to more even-handed data distribution, then the statistical relationship between such potentially prejudiced attributes can be avoided.

Conclusion

The above model, which divides convicted offenders into groups with low, medium, and high recidivism risks, aids in reducing crime rates and guaranteeing the welfare and well-being of society's residents. In this approach, machine learning might be elevated to a crucial role in maintaining the security and safety of unassuming individuals who may one day become the targets of attacks or other traumatic events. Thus, by addressing the flaws in current law enforcement procedures and empowering the authorities to make an informed, statistically and analytically sound decision when granting parole to criminals who may become repeat offenders, our machine learning model seeks to address one of the aforementioned causes. In order to make the results more targeted and relevant to the appropriate place, the aforementioned research methodology system should be expanded for more specialised prisons/facilities. Additionally, a longer time frame should be used for the study, producing more criminal records and improving accuracy. In the next years, the suggested methodology should aid in lowering criminal recidivism. Using Natural Language Processing tools, the findings of this study should be combined with the spoken statements made by the defendant on the day of the trial in order to assess their legitimacy. To identify illuminating trends, in-depth study should first be done on a smaller scale, then it should be done more broadly. This method ought to aid those in charge of giving parole refrain from doing so for offenders with a medium to high level of risk.

Click here to learn Data Science Course, Data Science Course in Hyderabad, Data Science Course in Bangalore