Counting unique values in a column

I am doing sentiment analysis and I want to count the number of unique values of my labels. When I use value_counts() I get multiple values for the same class of labels
labelled_df['sentiment'].value_counts()

the result of the command above is given below
negative 476
neutral 277
positive 122
negative 24
neutral 5
Name: sentiment, dtype: int64

I don’t understand why I am having two values for the negative and neutral classes

value_counts

  1. What is the type of “sentiment” column?
  2. What does happen when you use unique() method?
  3. Did you try to group_by() this column and then count()?
    obraz

the sentiment column is the label for the dataset. The labels are either negative, positive or neutral. When I apply unique() and nunique() methods I get negative, positive or neutral. negative, neutral and 5 respectively as the answer. When I use groupby() and count() method, I get the result shown below.
groupby

Still no idea what’s the type of the “sentiment” column.

Would be great to know what dataset it is.

What is the “Unnamed: 0” in your result using groupby()?

How does your dataframe looks when you print()/display() it BTW?

I am working on sentiment analysis of tweets. The tweets are either labelled as negative, positive or neutral. The unnamed is pandas way of naming a column header without a name. It contains a count of the rows. I have not tried print()/display() methods

I’m not asking what the “sentiment” column represents, or how it relates to anything in this dataset. I want to know it’s datatype.

Could you share link to this dataset? There are few such datasets on kaggle, and probably even more if you dig deep enough. Having a look at it might help me understand what’s going on.

it’s an object data type as shown below using info() method
datatype

unfortunately can’t share the dataset because of the ethical clearance on the dataset from my university. I can give you the result of any command you want me to execute. for example print() and display() gives this error
AttributeError: 'DataFrame' object has no attribute 'print'
AttributeError: 'DataFrame' object has no attribute 'display'
respectively.

I have been able to solve the problem by using find and replace in Google sheets. This ensures that there’s no space before or after the labels. In order word, I examined whether replacing text with numbers would eliminate the problem and it did. For example, I used 1, 2 and 3 to replace positive, negative and neutral respectively. I tested value_counts() before reversing the operation. Then I replaced 1, 2 and 3 to replace positive, negative and neutral respectively

1 Like

I suspected that the labels have some sort of additional character that discriminates them from others, but had to be sure about the type (if it was pd.Categorical then my theory fails).

object means more or less that this is just a string. Anyway good to know that you’ve found the way around it (I’m suspecting replacing it with numbers changed the cell type to number and removed the spaces).