Home / Blog / Artificial Intelligence / IMDB Data Analysis using ANN

IMDB Data Analysis using ANN

July 28, 2024
32

Meet the Author : Mr. Bharani Kumar

Bharani Kumar Depuru is a well known IT personality from Hyderabad. He is the Founder and Director of AiSPRY and 360DigiTMG. Bharani Kumar is an IIT and ISB alumni with more than 18+ years of experience, he held prominent positions in the IT elites like HSBC, ITC Infotech, Infosys, and Deloitte. He is a prevalent IT consultant specializing in Industrial Revolution 4.0 implementation, Data Analytics practice setup, Artificial Intelligence, Big Data Analytics, Industrial IoT, Business Intelligence and Business Management. Bharani Kumar is also the chief trainer at 360DigiTMG with more than Ten years of experience and has been making the IT transition journey easy for his students. 360DigiTMG is at the forefront of delivering quality education, thereby bridging the gap between academia and industry.

Business Problem

We can determine if a statement is good or negative by utilising the Internet movie database as a dataset.

Data Collection and Pre-processing

from keras.datasets import imdb

Click here to explore 360DigiTMG.

Loading the Datasets

(train_data, train_labels), (test_data, test_labels) = imdb.load_data(num_words=10000)

train_data[0] # Training data
train_labels[0] # Training labels
max([max(sequence) for sequence in train_data])
word_index = imdb.get_word_index() # accessing the word index

Reversing the Index to Word

reverse_word_index = dict(
[(value, key) for (key, value) in word_index.items()])
decoded_review = ' '.join(
[reverse_word_index.get(i - 3, '?') for i in train_data[0]])

Example for Enumeration

my_list = ['a','b','c','d']

for x, value in enumerate(my_list,1):
print(x,value)

import numpy as np # loading numpy

Vectorization-Converting Text into Numerical Representation

def vectorize_sequences(sequences, dimension=10000):
results = np.zeros((len(sequences), dimension))
for i, sequence in enumerate(sequences):
results[i, sequence] = 1.
return results
x_train = vectorize_sequences(train_data) # Passing the training data to change into numeric
x_test = vectorize_sequences(test_data) ) # Passing the testing data to change into numeric
x_train[0] # Numerical form on training data

Converting the Inputs to Float Type

y_train = np.asarray(train_labels).astype('float32')
y_test = np.asarray(test_labels).astype('float32')

Defining the Model

from keras import models # Importing the model from keras from keras import layers # Importing the model from keras
model = models.Sequential() # Defining the empty sequential model
model.add(layers.Dense(16, activation='relu', input_shape=(10000,))) # Adding dense layer with

Neurons, Input Layer and Activation Function

model.add(layers.Dense(16, activation='relu'))
model.add(layers.Dense(1, activation='sigmoid'))

from keras import optimizers # Importing optimizers from keras
model.compile(optimizer=optimizers.RMSprop(lr=0.001),loss='binary_crossentropy',metrics=['accuracy']) # Utilizing the optimizers ,the loss function and accuracy

Watch Free Videos on Youtube

Splitting the Data into Training and Validation

x_val = x_train[:10000] # All the data from row number 0 to 9999
partial_x_train = x_train[10000:] # Remaining data are store here
y_val = y_train[:10000] # All the labels from row number 0 to 9999
partial_y_train = y_train[10000:] # Remaining labels from 9999 till end

model = model.fit(partial_x_train, partial_y_train, epochs=20, batch_size=512, validation_data=(x_val, y_val)) # Model training on training data and testing the model on validation data

history_dict = model.history # Getting the values which was calculated by the model
history_dict.keys()

Plotting Validation Scores to Visualise the Performance of the Model

import matplotlib.pyplot as plt
acc = model.history['accuracy'] # Get the training accuracy values
val_acc = model.history['val_accuracy'] # Get the validation accuracy values
loss = model.history['loss'] # Training loss
val_loss = model.history['val_loss'] # Validation loss
epochs = range(1, len(acc)+1) # Number of epochs
plt.plot(epochs, loss, 'bo', label='Training loss') # Dotted curve with blue colour
plt.plot(epochs, val_loss, 'b', label='Validation loss') # Simple curve with blue colour
plt.title('Training and validation loss')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()
plt.show()

plt.clf()
acc_values = history_dict['accuracy']
val_acc_values = history_dict['val_accuracy']
plt.plot(epochs, acc, 'bo', label='Training acc')
plt.plot(epochs, val_acc, 'b', label='Validation acc')
plt.title('Training and validation accuracy')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()
plt.show()

Fine Tuning the Model to Avoid Overfitting

model = models.Sequential()
model.add(layers.Dense(16, activation='relu', input_shape=(10000,)))
model.add(layers.Dense(16, activation='relu'))
model.add(layers.Dense(1, activation='sigmoid'))
model.compile(optimizer='rmsprop',loss='binary_crossentropy',metrics=['accuracy'])
model.fit(x_train, y_train, epochs=4, batch_size=512) # Early stopping regularization technique is used
results = model.evaluate(x_test, y_test)
model.predict(x_test) # Predicted values on Test data

Click here to learn Data Science Course, Data Science Course in Hyderabad, Data Science Course in Bangalore