Home / Blog / Data Science / Untapped 100 Data Science 2021 Interview Questions

Untapped 100 Data Science 2021 Interview Questions

  • September 19, 2022
  • 5631
  • 24
Author Images

Meet the Author : Mr. Bharani Kumar

Bharani Kumar Depuru is a well known IT personality from Hyderabad. He is the Founder and Director of Innodatatics Pvt Ltd and 360DigiTMG. Bharani Kumar is an IIT and ISB alumni with more than 17 years of experience, he held prominent positions in the IT elites like HSBC, ITC Infotech, Infosys, and Deloitte. He is a prevalent IT consultant specializing in Industrial Revolution 4.0 implementation, Data Analytics practice setup, Artificial Intelligence, Big Data Analytics, Industrial IoT, Business Intelligence and Business Management. Bharani Kumar is also the chief trainer at 360DigiTMG with more than Ten years of experience and has been making the IT transition journey easy for his students. 360DigiTMG is at the forefront of delivering quality education, thereby bridging the gap between academia and industry.

Read More >

Table of Content

Following the emergence of technology, almost everything is evolving with time. The concept of gaining knowledge from various fields using data is inclusive hence birthing new career paths and professional communities. Data Science is the buzzword making a wave in this present millennial.

If you are looking forward to starting your journey as a Data Scientist or you want to explore this terrain, it is vital to gain mastery of the numerous skills required. To be successful in your community and job applications, listed below are possible questions you can be asked during a Data Science interview.

Untapped 100 Data Science 2021 Interview Questions

  • What is the difference between Big Data, Data Analytics, and Data Science?

    Big Data: It is a field that treats ways to extract, analyze, or deal with data sets that are too large and complex for a traditional data processing application software.

    Data Science: is a field that comprises everything related to data cleansing, preparation, and analysis.

    Data Analytics: is the science of examining raw data to reach certain conclusions. It involves applying an algorithm or mechanical process to derive insights and running through several data sets to look for meaningful correlations.

  • Also, check this Data Science Institute in Bangalore to start a career in Data Science.

    What are the skills required to become a Data Analyst?

    • Machine Learning skills
    • Programming Skills e.g knowledge of R and Python
    • Mathematical and Statistical skills
    • Communication and Data Visualization skills
    • Data wrangling skills: It is the ability to convert raw data to other convenient forms
  • In a Machine Learning model, what marks the difference between Overfitting and Underfitting?

    Overfitting is a modeling error that occurs when a function is closely fit to a limited set of data points. When a model is too complex and it has too many parameters relative to the number of observations, over-fitting will occur.

    Underfitting occurs when a Machine Learning algorithm cannot capture the underlying trends and does not fit the data well. When fitting a linear model to non-linear data, unfitting is expected to occur.

  • What is the difference between Correlation and Covariance?

    Both are used as a measure to check how two variables change with respect to each other.

    Correlation:
    is a unit measure of change between two variables changing with respect to one another. It is unit dependent, and the difference in scale does not affect the correlation value. It varies from -1 to +1.
    CoVariance:
    is a measure of change of how two variables change with respect to each other. It is unit dependent and varies from –infinity to +infinity.
  • When should the Classification technique be used over the Regression Technique?

    Both classification and regression techniques are related to predictions and are supervised Machine Learning algorithms.

    Regression algorithms are used to predict the continuous values such as price, salary, age, etc and it involves predicting a response i.e. a value from a continuous set of data.

    Classification algorithms are used to predict or classify discrete values such as male or female, true or false, etc.

  • What is Data Cleansing?

    Looking forward to becoming a Data Scientist? Check out the Data Science Course and get certified today.

    You may have an idea already because cleansing usually means to cleanout. Data cleansing involves the detection, correction, removal of inaccurate, incorrect, and coarse data that is not required from a database or set of records. After cleansing, the quality of data is efficiently improved making the cleaned data set consistent with the other sets you have in your system.

  • What is a Decision Tree Algorithm?

    It is a supervised Machine Learning algorithm that is used for classification and Regression.

  • How do you clean your data?

    Raw data with anomalies will always give the wrong results and analysis. There is no one way to highlight the exact procedures in cleaning your data since the techniques may differ based on the data type.

    The basic steps include:

    • Remove unwanted and duplicate observations in your data set
    • Fix the structural errors like typographical errors, mismatch of file names, etc
    • Filter observations that do not fit within the data being analyzed called 'outliers'
    • Missing data should not be left missing and should be adequately handled
    • Check the quality of your data and validate it
  • What is the difference between K-NN and K-means clustering?

    Both are commonly used in learning algorithms.

    Earn yourself a promising career in data science by enrolling in the Data Science Classes in Pune offered by 360DigiTMG.

    K-NN is a supervised learning algorithm used for classification where the variable "k" represents a parameter. K-nearest neighbor requires labeled data and when the data is given, it can classify the new unlabeled data by analyzing the k number of closer data points.

    K-means clustering is an unsupervised algorithm requiring unlabeled data. It can be used to understand social media trends, changes in strategic marketing, demographics among others.

  • What do you understand by p-value?

    P-value can be easily understood as the probability of getting results in an event such that any observation made is as extreme as the observed results of a hypothesis test called the null-hypothesis. When the p-value is smaller, it means that we can reject the test and when values are higher, the null hypothesis is accepted.

  • How is statistics important in Data Science?

    The role of statistics in Data Science cannot be undermined. As a field, statistics is the science of learning from data as it reveals information about data required in decision-making processes. It provides the methods and tools required to understand data structures and its roles include the acquisition of data, processing, and mining, making predictions, modeling, validation, and visualization of the interpreted structures.

  • What is a Linear Regression?

    In an observed data, linear regression models the relationship between two variables by fitting a linear equation. For a Linear regression line, the equation can be written as Y= a+bX where a is the intercept, b is the slope of the line, X is the explanatory variable and Y is the dependent variable.

  • What is a Linear Regression used for?

    Linear Regression is used when the value of a variable is to be predicted based on the value of another variable. The dependent variable is the variable to be predicted.

  • How can you explain a Normal distribution?

    It is a probability distribution that describes how the values of a variable are distributed. Most of the values cluster around the central peak and the probabilities for values move away from the mean.

    It is often called a bell curve because of the shape of the graph.

  • What is a Normal distribution used for?

    It is significant in statistics because it is often used in both social and natural sciences to represent random variables with real values whose distributions are not known.

  • How do you count different values in a pandas column?

    By using a value_counts() function

  • What is the meaning of Tensor in Tensorflow?

    A tensor is a mathematical object represented as arrays of higher dimensions

  • Python or R, which one would you prefer for text analysis?

    R is more efficient and suitable for Machine Learning than using it for text analysis. Python becomes a more suitable option for text analysis because it contains the Pandas library that gives easy to use data structures and high-performing tools for data analysis. Python also has a fast performance rate for all types of data analysis.

  • How can you determine the k for the k-means?

    In determining the optimal number of clusters in a data set as in k-means clustering, three major methods that can be used are:

    1. Direct methods: which comprises of optimizing a criterion and it includes;
      • Elbow methods
      • Silhouette methods
    2. Statistical testing methods: comprise comparing evidence against the null hypothesis.
      • Gap Statistic
  • What are the common libraries used for plotting data in Python?

    • Matplotlib
    • Seaborn
    • ggplot
    • Bokeh
    • Plotly

    There are many more libraries that can be explored.

  • How can you differentiate between a Stochastic Gradient Descent and Batch Gradient Descent?

    Stochastic Gradient Descent computes a gradient using a single sample and it converges faster while a Batch Gradient Descent computes a gradient using the whole data set and it takes a longer time to converge.

  • What is memory management in Python?

    It is the process by which applications read and write data. A memory manager determines where to put an application's data. A Python memory manager helps to internally manage the heaps containing all Python objects and data structures.

  • What is a lambda function?

    Anonymous functions are introduced using the keyword 'lambda'. The functions are often arguments being passed to higher-order functions. A higher-order function is a function that takes one or more functions as an argument or returns a function as its result.

  • What are False Negatives?

    It is the opposite of False positives and it is also called 'Type II error'. It occurs when events are classified as non-events.

  • State the differentiation between Supervised Machine Learning and Unsupervised Machine Learning?

    In Supervised Machine Learning, the machine is trained using data that is well labeled, such that the algorithm provides an answer key it can use to evaluate its accuracy on the training data.

    On the other hand, Unsupervised learning is a Machine Learning technique that does not require labeled data and the model needs no supervision. The algorithm extracts features and patterns on its own.

  • What do you understand by Reinforcement learning?

    Want to learn more about data science? Enroll in the Best Data Science courses in Chennai to do so.

    It is one of the basic Machine Learning techniques alongside supervised and unsupervised machine learning. Reinforcement learning is the training of Machine Learning models to make a sequence of decisions that include moving from a state to another.

  • During Sampling, what are the types of biases that can occur?

    The types of sampling bias include:

    • Self- selection bias
    • Non-response bias
    • Under coverage bias
    • Survivorship bias
    • Advertising bias
  • What are Eigenvectors and Eigenvalues?

    An Eigenvector is a non-zero vector of a linear transformation such that when that linear transformation is applied to it, it changes by a scalar factor.

    The Eigenvalue is the factor by which the Eigenvalue is stretched or scaled. The direction of an eigenvector is reversed when the Eigenvalue is negative.

  • Explain the difference between Systematic Sampling and Cluster Sampling

    Both are statistical measures used to study population samples. However, systematic sampling engages a fixed interval in creating samples from a larger population while cluster sampling involves the breaking down of the population into clusters before taking a random sample from each cluster.

  • How can you maintain a deployed model?

    Monitor all models, Evaluate the metrics of your models, compare models and rebuild or you can add a low percentage of negative test data as part of the model result and develop an auto-encoder model with training data before deploying the model.

  • What are False Positives?

    It is also called a Type I error and it occurs when a non-event is classified as an event.

  • Explain the treatment of outliers in a data set?

    An observation lying at an abnormal distance from others in a sample can be treated by :

    • Setting up a filter in your testing tool
    • Removing or changing outliers during the after test analysis
    • Considering the underlying distribution or the value of mild outliers
    • Changing the value of the outlier
  • How can you differentiate between the “long” and “wide” format data?

    In the long format, each row represents an observation that belongs to a particular category.

    In the wide-format, data in categories are always grouped in rows and columns.

  • What is Star Schema?

    It is an approach used to develop warehouses consisting of one or more tables about any number of dimension tables.

  • What is a Boltzmann Machine?

    A Boltzmann machine is the one used to optimize the quantity and weight of a given problem. It has a simple learning algorithm.

  • Explain Ensemble Learning

    Ensemble learning involves the use of multiple learning algorithms to have a better predictive performance than from a constituent learning algorithm alone.

  • What are the types of ensemble learning?

    The common types of ensembles are:

    • Bayes Optimal classifier
    • Boosting
    • Bootstrap aggregating
    • Bayesian model averaging
    • Stacking
  • In your understanding, how would you describe a Logistic Regression?

    It is a statistical model that uses a logistic function to model a binary dependent variable in its basic form.

  • What are the types of Logistic Regression?

    • Ordinal logistic regression
    • Multinomial logistic regression
    • Binary logistic regression
  • What do you understand about the Recommender System?

    It is sometimes called a recommendation system and it is a filtering system that suggests relevant and important information to users out of a large volume of information.

  • What are the types of classification Algorithms?

    • K-Nearest Neighbor
    • Logistic Regression
    • Naïve Bayes
    • Decision tree
    • Random Forest
  • Describe what do you understand by a Hyperparameter?

    A hyperparameter is a parameter that the value is set before commencing a learning process and it determines how a network is structured and trained.

  • What is the relationship between “Numpy” and ‘Scipy”?

    Both of them are Python libraries used in numerical and mathematical analysis.

  • How would you explain a Model Cross-Validation Technique?

    Cross-Validation is a model validation technique used primarily to estimate how a predictive model can perform accurately when put to practice. The technique is used for assessing and giving insights on how a model will generalize to an independent data set.

  • What is Selection bias?

    It is sometimes called a selection effect, such that when the selection of groups, data, or individuals are done for analysis, the samples obtained do not represent the population to be analyzed.

  • Explain Dropout?

    Dropout is a method of dropping out visible units of a network that are hidden randomly to avoid overfitting of data.

  • What is Machine Learning?

    Machine Learning is used to devise complex models and algorithms that are used to make predictions. It explores the study and construction of algorithms that can make predictions on data.

  • What do you understand by Naïve Bayes?

    It is an algorithm based on the Bayes theorem which describes the probable occurrence of an event based on existing knowledge of conditions that may be related to that event.

  • What is Systematic Sampling?

    It is a technique that involves the selection of elements from an ordered sampling frame.

    Untapped 100 Data Science 2021 Interview Questions

  • In project analytics, what are the basic steps involved?

    • Collection of Data
    • Cleansing of Data
    • Pre-processing of data
    • Setting up train and validation tests
    • Creation of Model
    • Deployment of Model
  • Mention the different types of clustering algorithm

    • Hierarchical Clustering
    • Fuzzy Clustering
    • K-Means Clustering
  • Which of the Native data structures in Python is Immutable?

    Tuples

  • Mention the various native data structure in Python?

    • Tuples
    • List
    • Dictionary
    • Sets
  • Explain what is the Confusion Matrix?

    It is a two by two [2x2] table consisting of four output results as given by the binary classifier.

  • What does the Law of Large numbers state?

    The law of large numbers is a theorem that states that the sample means, standard deviation, and variance converge to what they are seeking to estimate.

  • Why do we do Resampling?

    To validate models by using random subsets or to change labels on data points when trying to perform a significant test.

  • What are Auto-encoders?

    They are simple learning networks that seek to convert inputs to outputs without significant errors i.e. output should be as close as possible to the input.

  • What are the different Layers on CNN?

    Convolutional Layer, ReLU Layer, Pooling Layer, and Fully Connected Layer.

  • What is Batch Normalization?

    Is a method for improving the performance and stability of neural networks by normalizing the inputs so that they have a standard deviation of 1 [one] and a mean output activation of 0 [zero].

  • When working on your model, how can you avoid Overfitting?

    • Keep the model simple
    • Use cross-validation techniques Use regularization techniques
  • What is the difference between Univariate, bivariate, and Multivariate data?

    Univariate data has only a variable and patterns can be studied using statistical measures.

    Bivariate data involves two different variables and the relationship between them.

    Multivariate data involves three or more variables as it consists of more than one dependent variable.

  • What role does an Activation Function perform?

    An Activation Function is useful for introducing non-linearity into a neural network helping it to learn more complex functions.

  • How do you differentiate between Decision trees and Random forest?

    Decision trees are single structures while a Random forest makes up several decision trees collected together.

  • Why is dimensionality reduction beneficial?

    Because it helps in reducing space for storage and also in reducing the time for computation.

  • What are the variants of Back Propagation?

    Batch Gradient Descent, Stochastic Gradient Descent, and Mini-Batch Gradient Descent.

  • What is List Comprehension?

    It is a method of creating lists based on lists that are existing before. It is usually faster and more compact.

    State the difference between extend() method and append () method

    append() method is used for adding a new element to a list while the extend() method links the first list together with another list.

  • What do you understand by the Mini-Batch Gradient Descent?

    It is a variant of the Stochastic Gradient Descent and also one of the most popular optimization algorithms.

  • A data with five dimensions would be best represented using?

    A Numpy array of dimensions

  • Explain Data Visualization?

    It involves the graphical representation of data and information using visual elements like tables, maps, etc.

  • What is the difference between a Skewed and Uniform Distribution?

    A uniform distribution occurs when all the observations in a data set spread across the range of distribution equally. A skewed distribution is when more data sets appear on one side of a graph than the other.

  • What is Back Propagation?

    Back Propagation is a training algorithm for a multilayer neural network.

  • What do you understand by Root cause analysis?

    It is a problem-solving technique used in discovering the root causes of problems or faults. It is useful for businesses to understand the reasons for the results obtained.

  • Explain a Box-Cox Transformation?

    A Box-Cox transformation is a method of transforming non-normal dependent variables into a normal shape.

  • Define Data Aggregation?

    It is the process of gathering data and presenting it in an organized and summarized form.

  • What are the three data structures in Pandas?

    Dataframe, Panel, and Series.

  • Differentiate between Deep Learning and Machine Learning?

    Deep Learning is a subfield of Machine Learning and it deals with algorithms inspired by the structure and functions of Artificial Neural Networks.

    Machine learning on the other hand is an aspect of computer science that allows computers to learn without explicitly being programmed.

  • Mention the common operation of data in Pandas?

    • Data Aggregation
    • Data Transformation
    • Data Cleaning
    • Data Preprocessing
    • Data normalization
    • Data Standardization
  • In Pandas, what is the function of GroupBy?

    It can be used to group large amounts of data and compute operations on the groups.

  • If you have a series A and B, how can you get the items in series A absent in series B?

    Use series.isin() in Pandas

  • Explain a hyperbolic tree?

    It is also called ‘hypertree' and it involves the visualization of data inspired by hyperbolic geometry.

  • List some disadvantages of Data Visualization?

    • Non-Interactive dashboards
    • It does not allow for in-depth analysis
  • Define Cluster Analysis?

    Cluster Analysis involves grouping a set of objects in a way such that similar objects are placed together in a group or clusters.

  • In R, there are several Data Mining packages. Mention a few?

    purr for Data Wrangling, dplyr for Data Manipulation, Ggplot2 for Data Visualization, Hmisc for Data Analysis

  • What is Data Mining?

    Data Mining is a subfield of computer science and it involves the extraction of information from a data set and transform it into an understandable form for reference purposes.

  • How will you differentiate between a heat map and a treemap?

    A treemap is a powerful visualization used for illustrating hierarchical data like tree structures and part to whole relationships.

    A heat map is great for comparing categories while using colors and sizes.

  • What do you understand by aggregation and disaggregation of data?

    Aggregation of data simply refers to the compilation and summarization of data while disaggregation involves the breaking down of data into smaller components.

  • What is an LSTM Network?

    The LSTM Network means Long-Short-Term Memory and it is a special kind of recurrent neural network capable of learning long term dependencies and remembering information for a long time.

  • Explain how a Long-Short-Term-Memory Network works

    The LTSM network can decide what to forget and what to remember. Then, it updates cell state values selectively and decides what part of the current state will make it to the output.

  • When is a time-series data declared to be stationery?

    When the mean and variance of the series are constant with time.

  • What are the feature vectors?

    Feature vectors are used to represent numbers or symbols of an object in a mathematical way that is easier to analyze.

  • Do gradient descent methods always converge to similar points?

    No, not always.

  • When should you update an algorithm?

    When the underlying data source is changing, when you want the model to evolve as data streams through a system, and when it is not stationary.

  • Explain Survivorship Bias?

    Survivorship Bias occurs when attention is only paid to processes that support surviving while ignoring those that do not because of their lack of prominence. Often, it generates wrong conclusions.

  • What are Artificial Neural Networks?

    They are particular sets of algorithms that have revolutionized machine learning.

  • How can you combat Underfitting and Overfitting?

    By resampling the data to estimate the model accuracy and having a dataset to validate and evaluate the model.

  • What do you understand by a Confounding Variable?

    A Confounding variable is one that influences a dependent variable alongside an independent variable.

  • What is a ROC Curve?

    A ROC curve is a graphical representation that shows the difference between false-positive rates and true positive rates at various threshold levels.

  • Explain what you understand by Cluster sampling

    Cluster Sampling is an alternative that can be used to study a large population when simple random sampling cannot be effective. Samples are collected in clusters.

  • In Deep Learning, what is the difference between Epoch and Batch?

    Epoch refers to one iteration over an entire data set and Batch refers to splitting a dataset into batches since the data set cannot be passed into a neural network at once.

  • What is a Recurrent Neural Network?

    It is a type of Artificial Neural Network [ANN] that is designed to recognize patterns from data sequences like time series, etc.

Read
Success Stories
Make an Enquiry