Home / Blog / Interview Questions / B Tech in Data Science Interview Questions and Answers

# B Tech in Data Science Interview Questions and Answers

• September 02, 2022
• 4915
• 99

### Meet the Author : Mr. Bharani Kumar

Bharani Kumar Depuru is a well known IT personality from Hyderabad. He is the Founder and Director of Innodatatics Pvt Ltd and 360DigiTMG. Bharani Kumar is an IIT and ISB alumni with more than 17 years of experience, he held prominent positions in the IT elites like HSBC, ITC Infotech, Infosys, and Deloitte. He is a prevalent IT consultant specializing in Industrial Revolution 4.0 implementation, Data Analytics practice setup, Artificial Intelligence, Big Data Analytics, Industrial IoT, Business Intelligence and Business Management. Bharani Kumar is also the chief trainer at 360DigiTMG with more than Ten years of experience and has been making the IT transition journey easy for his students. 360DigiTMG is at the forefront of delivering quality education, thereby bridging the gap between academia and industry.

• ### 1. Why is the Linear algebra concept needed for Machine Learning algorithms?

The Machine Learning models are applied to data that can be expressed in matrix form, which is a 2-Dimensional arrangement. Linear algebra is used in data preprocessing, data transformation, model evaluation, and so on as part of Machine Learning implementation to extract meaningful information from raw business data.

• ### 2. What is correlation?

Correlation is a way to express the relation between 2 continuous variables. Correlation can be interpreted using a Scatter plot between the 2 variables. The correlation coefficient ( r ) is a measure that quantifies how strongly the variables are related.

Earn yourself a promising career in data science by enrolling in the Data Science Classes in Pune offered by 360DigiTMG.

• ### 3. What is Covariance?

The covariance explains the systematic relationship between pair of variables where changes in one affect changes in another variable.

• ### 4. What is the concept behind the p-value?

When you conduct a hypothesis test, the p-value will allow you to determine the strength of your result. This value is between 0 to 1. Based on the value, we can denote the strength of specific results.

Looking forward to becoming a Data Scientist? Check out the Data Science Course and get certified today.

• ### 5. What is Normal Distribution?

The normal distribution is a set of continuous variables spread across in the shape of a bell curve. This will be useful for analyzing the variables and their relationships when we use a normal distribution curve.

• ### 6. What is the bias?

Bias is the error in the model due to algorithms that are not strong enough to capture the underlying trends in data. This leads to lower accuracy because of underfitting.

• ### 7. What is the expected value?

The expected value will be used when we want to know the average of uncertain value (share market value) we can use the Expected value. the formula is: ∑xi. p(x)

Also, check this Data Science Institute in Bangalore to start a career in Data Science.

• ### 8. Why are we using the box-cox transformation technique?

The box cox technique helps to transform the non-normal distribution into a normal distribution. we can use all statistical analysis only when our data is the normal distribution. So it is mandatory to have a dataset in the normal distribution.

• ### 9. What ROC and what impact has ROC will do?

The receiver Operating Characteristic is a technique that is used to measure the True Positive Rate (TPR)against the False Positive rate (FPR). The formula of TPR is TP/TP+FN. On the contrary, the false positive rate is determined as FPR = FP / FP+TN.

• ### 10. What is the importance of the normalization technique in data preprocessing?

In the machine learning model to get better accuracy, all the variables should be on a common scale. So, the normalization technique converts the different scaler features into the same scaler.

• ### 11. What is the concept behind the random forest technique?

The random forest is a classifier technique that includes the number of various subsets of decision trees which will help to improve the predictive accuracy of datasets. Instead of relying on one model, the random forest takes the prediction from multiple trees based on the majority votes of the prediction. The high number of trees in the forest leads to good accuracy and prevents overfitting.

Become a Data Scientist with 360DigiTMG Data Science course in Hyderabad Get trained by the alumni from IIT, IIM, and ISB.

• ### 12. What is a primary difference between the 2 sampling techniques: probability and nonprobability?

Probability sampling is an unbiased technique which allows equal opportunity for data points to be considered while sampling. It allows to make strong statistical inferences about the population. Non-probability is a biased approach, sampling involves non-random/biased selection based on convenience or other criteria. It may not represent a strong statistical inference about the population.

• ### 13. What is Entropy in a decision tree algorithm?

Entropy is a factor that helps to check the homogeneity in the sample data. if the entropy value is zero that means the data is homogenous. In the contrast, if entropy is 1 means the sample is equally divided. Entropy is the control of how a decision tree can split the data and affects how a decision.

• ### 14. What is the linear regression technique in data science?

Linear regression is a supervised machine learning algorithm in the scenario where the predicted value is continuous data and it has a constant slope. The linear regression analysis technique helps to understand the relationship between the two variables. Here we are using the ‘mx+c’ formula for the best fit line.

• ### 15. What are the assumptions in the linear regression model?

The 4 main assumptions justify the use of the linear regression model. The first one is Linearity which means the expected dependent variable in a straight-line function than second is Statistical independence of errors which has Homoscedasticity of the errors. The final one Normality of the error distributions.

• ### 16. What is meant by the Cofactor matrix?

The cofactor matrix is formed with the cofactors of the elements of the given matrix. The cofactor of an element of the matrix is equal to the product of the minor of the element and Subtracts with one to the power of the positional value of the element.

• ### 17. What distinguishes random forest from bagging?

Bagging is an ensemble approach that fits numerous models to distinct portions of a training dataset and then combines the results. Random forest is a kind of bagging that selects subsets of features in each data sample at random.

• ### 18. What do you mean by assuming no outliers for Multiple Logistic Regression?

The variables that you simply care about must not contain outliers. The Logistic Regression is sensitive to outliers, or data points that have unusually large or small values. You will tell if your variables have outliers by plotting them and observing if any points are far away from all other points.

• ### 19. What do you mean by "trend" in forecasting analysis?

The trend is consistent and it also has a pattern of upward and downwards. From the trend’s patterns, we will understand the overall pattern whether the value got increasing or decreases. if you see the overall pattern that is called Global trend. Example sales analysis for future.

• ### 20. What is variance in Data science?

Variance is a type of error that will occur when models would be complex and keep learning the features from the data and noise exists. This poor accuracy in testing and result is overfitting.

• ### 21. What is a box plot and how do we interpret it?

A Box plot is the visual graphical representation of any dataset. It helps to identify the outliers and distribution of the data. This is the descriptive statistical method and we will be able to summarize 5 features. 1. Maximum value, 2. Third quartile (Q3), 3. Median 4. First quartile (Q1), 5. Lower value. Boxplot uses IQR method to detect and identify the exceptional values. These are called as outliers.

• ### 22. What basis does clustering work in machine learning algorithms?

The Cluster analysis is otherwise called Data segmentation. The clustering algorithm helps to identify the homogenous group of records. This is an unsupervised learning technique. The homogenous data are grouped by similarity of data which is based on the distance matrix.

• ### 23. How is cosine similarity involved in text mining?

Cosine similarity is a metric that helps to find the similarity between the two sentences in the text mining concept. In the cosine similarity concept, the object is considered a vector. Cosine similarity is measured by Theta.

if Theta is equal to 0 then the 2 vectors are similar. if theta is equal to 90 degrees both vectors are dissimilar.

• ### 24. What do you mean by probability?

Probability is a measure of the chance/likelihood of an event.Events can’t be predicted with certainty but can be expressed as to how likely they can occur using the idea of probability.

• ### 25. What is the second-moment business decision?

In the second moment, the business decision describes the spread of the data. On average how far away is the data from its mean? Mathematically this is calculated by Variance, Standard deviation, and Range. if we found more spread then data will be uncertain. less spread of data is easy to do analysis.

• ### 26. What are the objectives of SVM?

The SVM stands for Support Vector Machine. The main goal of SVM is to create a flat boundary that divides the space to create a homogenous partition on both sides. These are classifier techniques. SVM searches for Maximum Margin Hyperplane. The flat boundary is called a Hyperplane.

• ### 27. What is the difference between Univariate and Bi variate analysis?

In the Univariate analysis, we are analyzing a single variable. We are performing bar charts, boxplots, and histograms. In the bivariate analysis, we are performing on two variables. A Scatter plot helps to identify the relationship between two variables.

• ### 28. Where do we use the Matplotlib library in python?

Matplotlib library we used for plotting 2D numerical values. We are using matplotlib for the visual representation of the Data frame. We can create bar, histogram, scatter plot, etc.

• ### 29. What is meant by ANN?

The ANN stands for Artificial neural network. It exactly mimics the human brain. Neurons relate to each node. This is consisting of an input layer, hidden layer, and Output layer. This ANN is otherwise called Multi-layer perceptron’s. In deep learning, ANN contributes to all complex data sets.

• ### 30. How is the seaborn package useful for data analysis?

Seaborn is one of the advanced Python data visualization libraries based on matplotlib. It gives a high-level interface for attractive and informative statistical analysis. This is an enhanced version of matplotlib. It also helps to get bar charts, scatter, box plots, etc.