First experience with Machine Learning and a call for help

in HiveCodinglast year

image.png

Source

I'm not a programmer but I've been trying to learn some Python to improve some things at my work and now I'm trying to use Machine Learning to solve a problem we have but I'm stuck and could use some help!

This is my first post on Hive Coding and I apologize if this is not really the place to ask for help regarding programming but I thought it was worth a shot!

I'd like to add that I don't really have a lot of experience with programming and even less so with Machine Learning subjects so if
at any point, it looks like I have no idea what I'm doing, that's probably the case.

With that being said, let me introduce my problem and the road I've walked so far and maybe see if someone can give me a hand going forward.

The problem

I work at en ed-tech company and every semester we run a satisfaction survey across out customer base to better understand how we can improve our product. It's a simple survey where the user is invited to grade our product on a scale from 0 to 10 and there is an open question where they can write freely about what they love the most about the product or what they think we should improve.

That open question is actually more important to us than the grade itself because it's how we get valuable insight for our product teams and, because we have many product teams that focus on specific aspects of our products, we manually classify all user answers so we can direct them to the appropriate team.

The problem is that our user base has grown a lot in the past two years so this process is really time-consuming.

My attempt to solve the issue

As I said before, I'm not an experienced programmer but I have started learning a bit of Python so I had the idea to try and use Machine Learning algorithms to automate the classification process.

Searching online I found an algorithm that uses a Naive Bayes model to classify text so I tried to implement it.

The issue here is that all the tutorials I found online do a very good job showing how to train a model but I couldn't find a single one that shows how to "feed" fresh data into a trained model so it does the actual work of classifying the input.

The data

I have two datasets that I'm calling "training_data" and "real_data".

"training_data" is a dataset that we manually classified containing the user answers from last year and it's what I used to train the model. It has two columns:

  • 'texto' contains the user answers that are strings of variable sizes. Example: "I really love how organized the information is on the app!"
  • 'codigo' contains a numeric label that represents a particular feature or aspect of our products. Example: 1

"real_data" is a dataset that contains the user answers of this year. It contains a single column, 'texto' in the same format as in "training_data"

The code

This is the code I'm using so far. As I said before, I got as far as training the model using the dataset "training_data" but I have no idea how to feed the data on "real_data" to this model so it can classify the user answers. I even tried to repeat part of the process using the new dataset but it did not work.

import re
import numpy as np
import pandas as pd
# the Naive Bayes model
from sklearn.naive_bayes import MultinomialNB
# function to split the data for cross-validation
from sklearn.model_selection import train_test_split
# function for transforming documents into counts
from sklearn.feature_extraction.text import CountVectorizer
# function for encoding categories
from sklearn.preprocessing import LabelEncoder

# read training data
trainingdata = pd.read_csv('training_data.csv')

# function to normalize text
def normalize_text(s):
    s = s.lower()

    # remove punctuation that is not word-internal (e.g., hyphens, apostrophes)
    s = re.sub('\s\W',' ',s)
    s = re.sub('\W\s',' ',s)

    # make sure we didn't introduce any double spaces
    s = re.sub('\s+',' ',s)

    return s

# normalizing training data
trainingdata['TEXT'] = [normalize_text(s) for s in trainingdata['texto']]

# pull the training data into vectors
vectorizer = CountVectorizer()
x = vectorizer.fit_transform(trainingdata['TEXT'])
encoder = LabelEncoder()
y = encoder.fit_transform(trainingdata['codigo'])

# split into train and test sets
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2)

# train the model
nb = MultinomialNB()
nb.fit(x_train, y_train)

y_predicted = nb.predict(x_test)

TL;DR

I'm trying to use Machine Learning to classify several text entries. I believe I found a way to train a Naive Bayes model to do so but I have no idea how to feed fresh data to the trained model and get it to classify the text entries based on the training data.

As I've said before, I don't really have a lot of coding experience so forgive me for the lack of details. Any help is much appreciated and if you have any follow-up questions please ask away and I'll do my best to answer them!

Sort:  

Thank you for contributing in Hive Coding! This is the perfect place to ask these questions.

You should be able to use the predict function on your new 'real_data' as well - and see how well your algorithm performs on unseen data.

Just use this y_predicted = nb.predict(x_test) but instead of using the x_test data used from the training set, use the real_data.

Thanks! I'll give that a try!

!PIZZA

PIZZA!

PIZZA Holders sent $PIZZA tips in this post's comments:
@tfranzini(1/10) tipped @aicoding (x1)

Learn more at https://hive.pizza.