IMDB Data Analysis using ANN

  • July 28, 2020
  • 2892
  • 32
Business Problem

We can determine if a statement is good or negative by utilising the Internet movie database as a dataset.

Data Collection and Pre-processing

from keras.datasets import imdb

Loading the Datasets

(train_data, train_labels), (test_data, test_labels) = imdb.load_data(num_words=10000)

train_data[0] # Training data
train_labels[0] # Training labels
max([max(sequence) for sequence in train_data])
word_index = imdb.get_word_index() # accessing the word index

Reversing the Index to Word

reverse_word_index = dict(
[(value, key) for (key, value) in word_index.items()])
decoded_review = ' '.join(
[reverse_word_index.get(i - 3, '?') for i in train_data[0]])

Example for Enumeration

my_list = ['a','b','c','d']

for x, value in enumerate(my_list,1):

import numpy as np # loading numpy

Vectorization-Converting Text into Numerical Representation

def vectorize_sequences(sequences, dimension=10000):
results = np.zeros((len(sequences), dimension))
for i, sequence in enumerate(sequences):
results[i, sequence] = 1.
return results
x_train = vectorize_sequences(train_data) # Passing the training data to change into numeric
x_test = vectorize_sequences(test_data) ) # Passing the testing data to change into numeric
x_train[0] # Numerical form on training data

Converting the Inputs to Float Type

y_train = np.asarray(train_labels).astype('float32')
y_test = np.asarray(test_labels).astype('float32')

Defining the Model

from keras import models # Importing the model from keras from keras import layers # Importing the model from keras
model = models.Sequential() # Defining the empty sequential model
model.add(layers.Dense(16, activation='relu', input_shape=(10000,))) # Adding dense layer with

Neurons, Input Layer and Activation Function

model.add(layers.Dense(16, activation='relu'))
model.add(layers.Dense(1, activation='sigmoid'))

from keras import optimizers # Importing optimizers from keras
model.compile(optimizer=optimizers.RMSprop(lr=0.001),loss='binary_crossentropy',metrics=['accuracy']) # Utilizing the optimizers ,the loss function and accuracy

Splitting the Data into Training and Validation

x_val = x_train[:10000] # All the data from row number 0 to 9999
partial_x_train = x_train[10000:] # Remaining data are store here
y_val = y_train[:10000] # All the labels from row number 0 to 9999
partial_y_train = y_train[10000:] # Remaining labels from 9999 till end

model = model.fit(partial_x_train, partial_y_train, epochs=20, batch_size=512, validation_data=(x_val, y_val)) # Model training on training data and testing the model on validation data

history_dict = model.history # Getting the values which was calculated by the model

Plotting Validation Scores to Visualise the Performance of the Model

import matplotlib.pyplot as plt
acc = model.history['accuracy'] # Get the training accuracy values
val_acc = model.history['val_accuracy'] # Get the validation accuracy values
loss = model.history['loss'] # Training loss
val_loss = model.history['val_loss'] # Validation loss
epochs = range(1, len(acc)+1) # Number of epochs
plt.plot(epochs, loss, 'bo', label='Training loss') # Dotted curve with blue colour
plt.plot(epochs, val_loss, 'b', label='Validation loss') # Simple curve with blue colour
plt.title('Training and validation loss')

acc_values = history_dict['accuracy']
val_acc_values = history_dict['val_accuracy']
plt.plot(epochs, acc, 'bo', label='Training acc')
plt.plot(epochs, val_acc, 'b', label='Validation acc')
plt.title('Training and validation accuracy')

Fine Tuning the Model to Avoid Overfitting

model = models.Sequential()
model.add(layers.Dense(16, activation='relu', input_shape=(10000,)))
model.add(layers.Dense(16, activation='relu'))
model.add(layers.Dense(1, activation='sigmoid'))
model.fit(x_train, y_train, epochs=4, batch_size=512) # Early stopping regularization technique is used
results = model.evaluate(x_test, y_test)
model.predict(x_test) # Predicted values on Test data

