Hi, I'm Rayen I’m an aspiring data scientist. I've worked on a couple of person projects and this is my porfolio. You’re welcome to look around.
The data science team at myanimelist.net wants to improve their site and create a recommendation system using machine learning techniques to recommend animes to users. Information is webscraped to construct an anime database, then two recommendation systems are built using a content filtering system and a collaborative filtering system.
A recommender using content-based filtering was constructed using the features of the anime, such as genre, and description. The content-based recommendation system is proficient at detecting the sequels of the anime based on their plot lines, however, it led to a dry and repetitive user experience as the recommended content is too similar. Next, user information was gathered and a recommender using the collaborative filtering was built. Five different algorithms were cross validated, of which, the SVD algorithm performed the best and was implemented into the system. The results were investigated using a sample user, and the results from this method were accurate in their predictions.
Given the already robust database that myanimelist.net has, both approaches would have no restrictions if they are implemented, although a collaborative filtering approach seems to produce the more accurate results and is the favoured approach to use. However, based on the results of this study, a hybrid approach may need to be used to take the strengths of each method.
The first section of this report will focus on building a content-based recommender system. A content filtering recommendation system works by analyzing the content of the items and using that information to identify other items that are similar in terms of their content. In this case, this could be features such as the plot of the series, similar genres, characters, etc.
The analysis for the content recommendation systems will be as follows:
import numpy as np
from bs4 import BeautifulSoup
import requests
import re
import requests
import lxml.html as lh
import pandas as pd
import pickle
import os
import scipy.stats as stats
import seaborn as sns
import matplotlib.pyplot as plt
pd.options.mode.chained_assignment = None # default='warn'
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.cluster import KMeans
from sklearn.metrics.pairwise import cosine_similarity, linear_kernel
from sklearn.decomposition import PCA
The purpose of this segment is to get familiar with the data. This is to see if the human eye can detect any outliers and manually remove them. Also, this is a good chance to do some feature engineering and see if any of the features are a strong indicator of score.
Some things to explore might be:
Some questions might include:
Another key process is feature engineering, which is the process of selecting the most relevant features, or variables for a model. During this process, it is also worth investigating if any of the features would be a key predictor in determining the public rating. Below, the outputs printed shows the percentage of missing values for each of the features in the anime details dataframe. To reduce the number of variables, features can be eliminated based on their lack of data and whether or not they duplicate information from another column. For instance, the column “aired_start” which is the start broadcast date, is reliable, while its counterpart “aired_end” can be removed as it has too many missing values. It can also be observed that combining the columns from “genre” and “genres” into “genres_overall” did have a significant effect because the null values in the column were reduced to 7%. Although ‘genres_overall’ has the majority of the records filled, “themes_overall” and “demographics_overall” still have many missing values. Since themes and demographics has somewhat of an overlap with the information provided by genres, they will be removed due to the data being relatively unreliable.
top_anime_all_table = pd.read_csv('/Users/rayenfeng/Documents/code/anime_rec_project/data_sources_pickle/all_anime_directory.csv')
#top_anime_all_table.head()
# import clean frame
with open('/Users/rayenfeng/Documents/code/anime_rec_project/data_sources_pickle/all_anime_details_frame_no_nan_cvas.pkl', 'rb') as f:
all_anime_details_frame_no_nan_cvas = pickle.load(f)
with open('/Users/rayenfeng/Documents/code/anime_rec_project/data_sources_pickle/anime_info_frame_cleaned_main.pkl', 'rb') as f:
anime_info_frame_cleaned = pickle.load(f)
# percentage of missing values per column, use columns with less missing for more data reliability.
print("Percent Missing")
anime_info_frame_cleaned.isna().sum() * 100/ len(anime_info_frame_cleaned)
Percent Missing
English 42.77
Type 0.00
Episodes 0.00
Status 0.00
aired_start 1.99
aired_end 44.28
premiered 61.09
broadcast_weekday 74.20
broadcast_daytime 74.72
producers 35.44
studios 12.27
source 0.00
genres_multi 30.69
genres_singular 76.99
genres_overall 7.68
themes_singular 60.54
themes_multi 70.29
themes_overall 30.83
demographic_singular 64.40
demographics_multi 99.21
demographics_overall 63.61
duration_in_min 0.08
public_score_rating 0.00
popularity 0.00
Favorites 0.00
Completed 0.00
On-Hold 0.00
Dropped 0.00
Plan to Watch 0.00
Total 0.00
dtype: float64
A key feature is “public_score_rating”, which is the average score that users give a certain anime. This can be plotted on a histogram to see the rating distribution. The below figure shows the rating score distribution and as expected, the distribution is skewed to the right, with the average score being about 6.9. It is important to note that this data was scraped in order of user ranking, therefore, the expected mean is slightly lower than what is shown.
plt.subplots(figsize=(9,6))
sns.set_theme(style="whitegrid")
sns.histplot(data=anime_info_frame_cleaned, x="public_score_rating", bins = 20, kde = True)
plt.title('Public Score Rating Distribution')
anime_info_frame_cleaned.describe().round(1)
| Episodes | aired_start | aired_end | duration_in_min | public_score_rating | popularity | Favorites | Completed | On-Hold | Dropped | Plan to Watch | Total | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 10000.0 | 9801 | 5572 | 9992.0 | 10000.0 | 10000.0 | 10000.0 | 10000.0 | 10000.0 | 10000.0 | 10000.0 | 10000.0 |
| mean | 14.4 | 2008-12-17 17:05:49.311294720 | 2008-12-19 11:16:03.962670336 | 29.0 | 6.9 | 6324.2 | 982.0 | 53621.1 | 2119.5 | 2567.1 | 19067.7 | 82313.8 |
| min | 1.0 | 1929-10-14 00:00:00 | 1962-02-25 00:00:00 | 0.2 | 5.8 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 13.0 | 180.0 |
| 25% | 1.0 | 2003-08-01 00:00:00 | 2003-09-25 00:00:00 | 13.0 | 6.3 | 2610.5 | 3.0 | 1054.8 | 55.0 | 92.0 | 878.0 | 2452.8 |
| 50% | 4.0 | 2012-04-07 00:00:00 | 2012-01-26 00:00:00 | 24.0 | 6.8 | 5624.5 | 17.0 | 5472.5 | 219.0 | 221.5 | 3588.0 | 10733.0 |
| 75% | 13.0 | 2017-10-06 00:00:00 | 2017-09-26 06:00:00 | 26.0 | 7.3 | 9680.2 | 148.0 | 32157.0 | 1110.2 | 1184.0 | 15501.8 | 55669.0 |
| max | 3057.0 | 2022-12-01 00:00:00 | 2023-01-08 00:00:00 | 168.0 | 9.1 | 18239.0 | 210883.0 | 3133727.0 | 168119.0 | 203501.0 | 586370.0 | 3580542.0 |
| std | 49.2 | NaN | NaN | 27.3 | 0.7 | 4332.2 | 6279.9 | 167132.2 | 6788.4 | 7588.8 | 40923.0 | 224824.1 |

Another feature is the air date, which is the date which an anime started it’s broadcast. Figure 8 is a plot showing the average anime score in each year. As shown in the graph, it can be seen that the average anime rating seems to increase over time, especially starting from 1970. This is most likely because over the years, the production quality advanced with available technology. This along with anime being more popular in recent times may be a factor for the increase in score. This is shown by Figure 9, where the number of anime produced every year increases.
# rating by year
info_by_year = anime_info_frame_cleaned.groupby(pd.Grouper(key='aired_start', axis=0, freq='Y'))['public_score_rating'].mean()
plt.subplots(figsize=(10,6))
sns.set_theme(style="whitegrid")
plt.title('Year vs Score')
sns.lineplot(data=info_by_year)
plt.show()
plt.subplots(figsize=(10,6))
plt.title('Anime produced over time')
sns.histplot(data=anime_info_frame_cleaned, x="aired_start", bins = 15)

<Axes: title={'center': 'Anime produced over time'}, xlabel='aired_start', ylabel='Count'>

# anime produced in 1929 seems to be an outlier, this is highlighted
anime_info_frame_cleaned[anime_info_frame_cleaned['aired_start'] == anime_info_frame_cleaned['aired_start'].min()]
| English | Type | Episodes | Status | aired_start | aired_end | premiered | broadcast_weekday | broadcast_daytime | producers | ... | demographics_overall | duration_in_min | public_score_rating | popularity | Favorites | Completed | On-Hold | Dropped | Plan to Watch | Total | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 5875 | The Stolen Lump | Movie | 1 | Finished Airing | 1929-10-14 | NaT | NaN | NaN | NaT | NaN | ... | NaN | 10.0 | 5.871 | 8952 | 2 | 2581 | 18 | 75 | 448 | 3177 |
1 rows × 30 columns
def plot_pie_value_counts(series, sep_num):
val_counts = series.value_counts()#.sort_values(ascending = False)
first_sec = val_counts[:sep_num]
other_sec = val_counts[sep_num:].sum()
if other_sec > 0:
first_sec['other'] = other_sec
plt.subplots(figsize=(6, 5))
colors = sns.color_palette('tab10')
plt.pie(first_sec,
labels=first_sec.index,
labeldistance=1.15,
wedgeprops = { 'linewidth' : 2, 'edgecolor' : 'white'},
colors = colors,)
plt.title('value counts of ' + str(series.name))
plt.show()
plot_pie_value_counts(anime_info_frame_cleaned['Type'], 7)

plot_pie_value_counts(anime_info_frame_cleaned['Episodes'], 9)

Another feature to investigate is the effect of broadcast time on the user score. Below is a boxplot that shows the average user score by weekday, while the following plot is a distribution of the number of shows that are broadcast during the day. It can be seen that the weekday does not make a significant difference in determining the score, however, it does seem that there are popular times to broadcast anime. As expected, there is a strong correlation with broadcast time with the school schedule, with most shows start broadcasting when students come home from school or stay up late into the night.
plt.subplots(figsize=(10,6))
plt.title('Anime produced over time')
sns.boxplot(data = anime_info_frame_cleaned, x= 'broadcast_weekday', y="public_score_rating") #, order=my_order)
#data=info_by_year, x = info_by_year.index, y="public_score_rating")
<Axes: title={'center': 'Anime produced over time'}, xlabel='broadcast_weekday', ylabel='public_score_rating'>

hour_counts = anime_info_frame_cleaned['broadcast_daytime'].apply(lambda x: x.hour)
plt.subplots(figsize=(10,6))
plt.title('Anime broadcast start during the day')
sns.histplot(data=hour_counts, bins = 23)
<Axes: title={'center': 'Anime broadcast start during the day'}, xlabel='broadcast_daytime', ylabel='Count'>

# print(anime_info_frame_cleaned['studios'].dropna().apply(lambda y: len(y)).value_counts())
# print(anime_info_frame_cleaned['producers'].dropna().apply(lambda y: len(y)).value_counts())
# get list of anime with studios greater than 3
len_of_studios = anime_info_frame_cleaned['studios'].dropna().apply(lambda y: len(y))
len_greater_studio = list(len_of_studios[len_of_studios > 3].index)
anime_info_frame_cleaned[anime_info_frame_cleaned['studios'].index.isin(len_greater_studio)]
| English | Type | Episodes | Status | aired_start | aired_end | premiered | broadcast_weekday | broadcast_daytime | producers | ... | demographics_overall | duration_in_min | public_score_rating | popularity | Favorites | Completed | On-Hold | Dropped | Plan to Watch | Total | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 10178 | NaN | Special | 4 | Finished Airing | 2011-01-07 | 2013-03-24 | NaN | NaN | NaT | [NHK] | ... | [Josei] | 25.0 | 7.371 | 4292 | 51 | 7407 | 886 | 507 | 10460 | 20787 |
| 42161 | Pokétoon | ONA | 8 | Finished Airing | 2020-06-04 | 2021-12-28 | NaN | NaN | NaT | [Nintendo, Creatures Inc.] | ... | [Kids] | 8.0 | 7.351 | 7228 | 14 | 1732 | 682 | 247 | 1443 | 5779 |
| 28149 | Japan Anima(tor)'s Exhibition | ONA | 35 | Finished Airing | 2014-11-07 | 2015-10-09 | NaN | NaN | NaT | [Dwango] | ... | NaN | 8.0 | 7.341 | 3198 | 123 | 12289 | 4178 | 1963 | 15517 | 38424 |
| 49357 | Star Wars: Visions | ONA | 9 | Finished Airing | 2021-09-22 | NaT | NaN | NaN | NaT | [Twin Engine] | ... | NaN | 15.0 | 7.131 | 2122 | 379 | 46872 | 2900 | 2486 | 20000 | 78538 |
| 6867 | Halo Legends | ONA | 8 | Finished Airing | 2009-11-07 | 2010-02-16 | NaN | NaN | NaT | [Casio Entertainment] | ... | NaN | 14.0 | 6.991 | 3154 | 192 | 28256 | 699 | 895 | 8662 | 39534 |
| 4094 | Batman: Gotham Knight | OVA | 6 | Finished Airing | 2008-07-08 | NaT | NaN | NaN | NaT | [Cyclone Graphics] | ... | NaN | 12.0 | 6.941 | 3749 | 62 | 21344 | 250 | 325 | 5776 | 28174 |
| 2832 | NaN | Special | 15 | Finished Airing | 2007-06-07 | 2007-06-27 | NaN | NaN | NaT | NaN | ... | NaN | 1.0 | 6.721 | 4201 | 9 | 13809 | 446 | 277 | 6981 | 21996 |
| 38022 | Monster Strike the Animation | ONA | 63 | Finished Airing | 2018-07-08 | 2019-12-31 | NaN | NaN | NaT | [XFLAG] | ... | NaN | 10.0 | 6.681 | 7681 | 8 | 863 | 239 | 349 | 2844 | 5042 |
| 37290 | Animation × Paralympic: Who Is Your Hero? | Special | 15 | Finished Airing | 2017-11-10 | 2022-08-22 | NaN | NaN | NaT | [NHK, NHK Enterprises] | ... | NaN | 5.0 | 6.481 | 11999 | 2 | 146 | 100 | 128 | 392 | 1043 |
9 rows × 30 columns
themes_overall_s_1 = anime_info_frame_cleaned['themes_overall'].dropna()
themes_overall_df_1 = pd.DataFrame(item for item in themes_overall_s_1).set_index(themes_overall_s_1.index)
demographics_overall_s_1 = anime_info_frame_cleaned['demographics_overall'].dropna()
demographics_overall_df_1 = pd.DataFrame(item for item in demographics_overall_s_1).set_index(demographics_overall_s_1.index)
genres_overall_s_1 = anime_info_frame_cleaned['genres_overall'].dropna()
genres_overall_df_1 = pd.DataFrame(item for item in genres_overall_s_1).set_index(genres_overall_s_1.index)
print('\n*********** Themes value counts ***********\n')
print(pd.Series(themes_overall_df_1.values.flatten()).value_counts())
print('\n*********** demographics value counts ***********\n')
print(pd.Series(demographics_overall_df_1.values.flatten()).value_counts())
print('\n*********** Genres value counts ***********\n')
print(pd.Series(genres_overall_df_1.values.flatten()).value_counts())
*********** Themes value counts ***********
School 1410
Music 1024
Mecha 855
Historical 737
Military 489
Super Power 444
Mythology 387
Martial Arts 362
Space 355
Adult Cast 341
Parody 340
Harem 320
Psychological 297
Isekai 217
Detective 213
Team Sports 197
Mahou Shoujo 197
Idols (Female) 196
Gag Humor 183
Strategy Game 182
CGDCT 176
Iyashikei 173
Samurai 143
Gore 123
Vampire 120
Anthropomorphic 114
Workplace 105
Time Travel 102
Video Game 98
Idols (Male) 86
Racing 74
Performing Arts 72
Love Polygon 68
Combat Sports 67
Otaku Culture 67
Reincarnation 64
Visual Arts 62
Survival 57
Pets 55
Reverse Harem 55
Childcare 43
Organized Crime 40
Romantic Subtext 38
Educational 34
Medical 32
High Stakes Game 30
Delinquents 29
Showbiz 26
Crossdressing 25
Magical Sex Shift 22
Name: count, dtype: int64
*********** demographics value counts ***********
Shounen 1729
Seinen 704
Kids 650
Shoujo 550
Josei 85
Name: count, dtype: int64
*********** Genres value counts ***********
Comedy 3951
Action 3161
Fantasy 2436
Adventure 2246
Sci-Fi 1957
Drama 1917
Romance 1560
Supernatural 1090
Slice of Life 1008
Mystery 645
Ecchi 624
Sports 461
Horror 269
Suspense 144
Award Winning 136
Boys Love 90
Gourmet 79
Avant Garde 75
Girls Love 73
Name: count, dtype: int64
genre_val = pd.Series(genres_overall_df_1.values.flatten()).dropna()
plot_pie_value_counts(genre_val, 9)

# function to convert dataframe to long form for easier manipulation
def convert_to_long(overall_df_1):
public_score_series = top_anime_all_table[['id', 'public_score']].set_index('id')
public_score_series
df_stacked = overall_df_1.stack().to_frame()
a = pd.DataFrame(df_stacked[0].values.tolist(), index = df_stacked.index)
stacked_df = a.reset_index().drop(columns = ['level_1']).rename(columns={"level_0": "id", 0: "category"})
b = stacked_df.set_index('id').join(public_score_series)
return b
# function to plot category
def plot_category(b):
sns.set(rc={'figure.figsize':(16,10)})
sns.set_theme(style="whitegrid", palette="Set2")
plt.xticks(rotation = 90)
# Find the order
my_order = b.groupby(by=["category"])["public_score"].mean().sort_values(ascending = False).index
# plot boxplot
sns.boxplot(data=b, x="category", y="public_score", order=my_order)
plt.show()
genres_overall_long = convert_to_long(genres_overall_df_1)
demographics_overall_long = convert_to_long(demographics_overall_df_1)
themes_overall_long = convert_to_long(themes_overall_df_1)
plot_category(themes_overall_long)

plot_category(genres_overall_long)

The above is a boxplot showing the average score of a show separated by genre. As expected, there seems to be minor differences in the average rating of the shows in relation to the genre. For instance, shows that have the “Award winning” genre score significantly better than those with “Avant Garde” genre. However, with the shows in the middle, it’s hard to tell a significant difference between the averages of the scores. For examples, genres such as mystery accounts for a small percentage of shows while shows with comedy as its genre can have an extremely wide range of ratings since people have different senses of humor. Due to the variance between the samples and the broad range that can be labeled with a genre, it is hard to attribute genre as a determining factor for the score of the anime.
# get list of studio by score.
studios_s_1 = anime_info_frame_cleaned['studios'].dropna()
studios_df_1 = pd.DataFrame(item for item in studios_s_1).set_index(studios_s_1.index)
studios_overall_long = convert_to_long(studios_df_1)
studios_overall_long =studios_overall_long.rename(columns={"category": "studio"})
studio_mean_scores = studios_overall_long.groupby('studio').mean().sort_values(by = 'public_score', ascending = False)
studio_prod_counts = studios_overall_long['studio'].value_counts().rename('counts')#.head(50)
studio_avg_score_rank = studio_mean_scores.join(studio_prod_counts).reset_index()
print('------ Studios with highest average ranking anime ------\n ')
display(studio_avg_score_rank.head(10))
print('------ Studios with highest average anime score with more than 20 animes produced ------\n ')
studio_avg_score_rank[studio_avg_score_rank['counts'] > 20].head(20)
------ Studios with highest average ranking anime ------
| studio | public_score | counts | |
|---|---|---|---|
| 0 | K-Factory | 8.403333 | 3 |
| 1 | Studio Bind | 8.350000 | 3 |
| 2 | Egg Firm | 8.292500 | 4 |
| 3 | Nippon Ramayana Film Co. | 8.250000 | 1 |
| 4 | Studio Signpost | 8.060000 | 3 |
| 5 | AHA Entertainment | 7.920000 | 1 |
| 6 | Studio Chizu | 7.917500 | 4 |
| 7 | Samsara Animation Studio | 7.870000 | 1 |
| 8 | Frontier One | 7.850000 | 1 |
| 9 | Studio Massket | 7.800000 | 1 |
------ Studios with highest average anime score with more than 20 animes produced ------
| studio | public_score | counts | |
|---|---|---|---|
| 29 | Kyoto Animation | 7.433103 | 116 |
| 30 | Wit Studio | 7.425902 | 61 |
| 35 | CloverWorks | 7.383488 | 43 |
| 42 | Bones | 7.337361 | 144 |
| 43 | White Fox | 7.329070 | 43 |
| 45 | David Production | 7.322500 | 44 |
| 46 | MAPPA | 7.321475 | 61 |
| 55 | ufotable | 7.288906 | 64 |
| 57 | Bandai Namco Pictures | 7.287727 | 44 |
| 59 | Lerche | 7.265091 | 55 |
| 69 | Shaft | 7.235645 | 124 |
| 70 | Studio Ghibli | 7.234390 | 41 |
| 78 | Trigger | 7.203200 | 25 |
| 81 | A-1 Pictures | 7.199378 | 209 |
| 86 | GoHands | 7.186400 | 25 |
| 88 | Seven Arcs | 7.182857 | 28 |
| 92 | Production I.G | 7.177816 | 293 |
| 95 | Manglobe | 7.171935 | 31 |
| 98 | P.A. Works | 7.148367 | 49 |
| 100 | 8bit | 7.135472 | 53 |
An initial assumption was made stating that certain studios can influence the rating of the show, hence, an investigation was done to see if the production studio can predict the quality of work. At first, it can be seen certain studios do have a much higher rating, however, this is because the studios have worked on a fewer anime which have scored high. Since there are many factors that go into the enjoyment of the show such as plot, voice acting etc., it is hard to pinpoint the studio is the main factor for the anime’s success. Thus, with a small sample size of shows produced, it is hard to say that these studios have a impact on the score. The above tables show a list of 20 studios which have produced over 25 anime along with their overall rank and average score of all the anime they produced. This table shows that if the studios that have produced less than 20 anime are removed, there is no major difference in the scores, with the first of which being “Kyoto Animations” which is already ranked 29th.
After the data has been cleaned and processed, the data can be used in a content recommendation system.
As mentioned previously, the concept of a content-filtering recommendation method is to find similar shows similar to the current one that a user is interested in. The general methodology is to convert the text to vector format and then use cosine similarity to generate similarity matrix. This matrix would simple be ranked and then read to extract the top recommended shows.
Again, the most relevant features need to be extracted. As explored in the previous sections, the most relevant information will be the following features of a given anime. The chosen features will include description, genres, studios, voice actors, staff, and characters as they are expected to have the highest impact when finding similar anime.
The following procedure is to eliminate the null values. Any record with a null value from any category will be removed. This reduced the sample size by 30%, from the initial 10,000 anime records to a total of 6,975 records. The reduction effect was mitigated as features with many missing values such as “themes” or “demographic” were removed. The remaining sample size is reasonable considering that all the values for every feature was retained. A sample of this table is shown.
anime_info_frame_cleaned
anime_info_frame_cleaned.isna().sum()
anime_info_f_master_fil = anime_info_frame_cleaned[['Type', 'Episodes', 'Status',
'aired_start', 'studios', 'source',
'genres_overall', 'duration_in_min',
'public_score_rating','popularity']]
print('\n**** Filtered Frame NA counts ****\n')
print(anime_info_f_master_fil.isna().sum())
# dorp na values
anime_info_f_master_fil_nona = anime_info_f_master_fil.dropna()
**** Filtered Frame NA counts ****
Type 0
Episodes 0
Status 0
aired_start 199
studios 1227
source 0
genres_overall 768
duration_in_min 8
public_score_rating 0
popularity 0
dtype: int64
### make dataframe columns with list to text and format for machine learning
# display(all_anime_details_frame_no_nan_cvas)
### join characters, staff and VA to info dataframe
anime_info_dataset_master1 = anime_info_f_master_fil_nona.join(all_anime_details_frame_no_nan_cvas , how = 'inner')
anime_info_dataset_master_lists = anime_info_dataset_master1[['studios', 'genres_overall',
'characters', 'voice_actors', 'staff']]
anime_info_dataset_master_lists2 = anime_info_dataset_master_lists.applymap(lambda xs: ', '.join(str(x) for x in xs))
anime_info_dataset_master1[['studios', 'genres_overall','characters', 'voice_actors', 'staff']] = anime_info_dataset_master_lists2[['studios', 'genres_overall', 'characters', 'voice_actors', 'staff']]
ds = anime_info_dataset_master1.reset_index()
ds
| index | Type | Episodes | Status | aired_start | studios | source | genres_overall | duration_in_min | public_score_rating | popularity | characters | voice_actors | staff | description | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 5114 | TV | 64 | Finished Airing | 2009-04-05 | Bones | Manga | Action, Adventure, Drama, Fantasy | 24.0 | 9.111 | 3 | Edward_Elric, Alphonse_Elric, Roy_Mustang, Mae... | Romi_Park, Rie_Kugimiya, Shinichiro_Miki, Keij... | Justin_Cook, Noritomo_Yonai, Yasuhiro_Irie, Ma... | After a horrific alchemy experiment goes wrong... |
| 1 | 41467 | TV | 13 | Currently Airing | 2022-10-11 | Pierrot | Manga | Action, Adventure, Fantasy | 24.0 | 9.101 | 677 | Ichigo_Kurosaki, Rukia_Kuchiki, Renji_Abarai, ... | Masakazu_Morita, Fumiko_Orikasa, Kentarou_Itou... | Tomohisa_Taguchi, Yukio_Nagasaki, Hikaru_Murat... | Substitute Soul Reaper Ichigo Kurosaki spends ... |
| 2 | 43608 | TV | 13 | Finished Airing | 2022-04-09 | A-1 Pictures | Manga | Comedy, Romance | 23.0 | 9.091 | 239 | Kaguya_Shinomiya, Yuu_Ishigami, Chika_Fujiwara... | Aoi_Koga, Ryouta_Suzuki, Konomi_Kohara, Makoto... | Shinichi_Omata, Jin_Aketagawa, Masakazu_Obara,... | The elite members of Shuchiin Academy's studen... |
| 3 | 9253 | TV | 24 | Finished Airing | 2011-04-06 | White Fox | Visual novel | Drama, Sci-Fi, Suspense | 24.0 | 9.081 | 13 | Rintarou_Okabe, Kurisu_Makise, Mayuri_Shiina, ... | Mamoru_Miyano, Asami_Imai, Kana_Hanazawa, Tomo... | Justin_Cook, Gaku_Iwasa, Takeshi_Yasuda, Shins... | Eccentric scientist Rintarou Okabe has a never... |
| 4 | 28977 | TV | 51 | Finished Airing | 2015-04-08 | Bandai Namco Pictures | Manga | Action, Comedy, Sci-Fi | 24.0 | 9.071 | 337 | Gintoki_Sakata, Kagura, Shinpachi_Shimura, Kot... | Tomokazu_Sugita, Rie_Kugimiya, Daisuke_Sakaguc... | Youichi_Fujita, Chizuru_Miyawaki, Shinji_Takam... | Gintoki, Shinpachi, and Kagura return as the f... |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 6936 | 8132 | Movie | 1 | Finished Airing | 2010-02-27 | Sunrise | Original | Action | 13.0 | 5.841 | 9944 | Chouhi_Gundam, Kan-u_Gundam, Ryuubi_Gundam | Masayuki_Katou, Hiroki_Yasumoto, Yuuki_Kaji | Hiroyuki_Satou, Kenichi_Suzuki, Kunihiro_Mori,... | A Romance of the Three Kingdoms retelling usin... |
| 6937 | 2754 | OVA | 3 | Finished Airing | 1989-04-28 | J.C.Staff | Manga | Action, Adventure, Sci-Fi | 37.0 | 5.841 | 8980 | Cleopatra_Corns, Suen, Marianne, Dr_Randol, Na... | Maria_Kawamura, Hiromi_Tsuru, Yoshino_Takamori... | Kazuo_Katou, Naoyuki_Yoshinaga, Sukehiro_Tomit... | The Cleopatra Corns Group, aka Cleopatra DC, i... |
| 6938 | 31318 | TV | 12 | Finished Airing | 2015-10-04 | 8bit | Original | Action, Adventure, Fantasy | 23.0 | 5.841 | 1792 | Felia, Mo_Ritika_Tzetzes_Ura, Sougo_Amagi, Kao... | Ayaka_Ohashi, Inori_Minase, Yuusuke_Kobayashi,... | Youhei_Kisara, Yasuhito_Kikuchi, Atsushi_Nakay... | In the world of Gift, the bowels of the planet... |
| 6939 | 9862 | Movie | 1 | Finished Airing | 1992-08-08 | Production Reed, Asahi Production | Original | Adventure, Fantasy | 43.0 | 5.841 | 13444 | Mary_Bell, Yuuri | Chieko_Honda, Satomi_Koorogi | Shigeyuki_Yamamori | Mary Bell, Yuri, Ken, Ribbon, and Tambourine g... |
| 6940 | 34332 | TV | 12 | Finished Airing | 2019-01-11 | Fukushima Gaina | Original | Comedy, Slice of Life | 5.0 | 5.841 | 8489 | Miku, Nagisa, Mona, Moe, Shiina, Suzu, Fumi, M... | Miku_Itou, Sarah_Emi_Bridcutt, Suzuko_Mimori, ... | Yoshinori_Asao, Kisuke_Koizumi, AOP, Ryouzou_O... | The anime will center on a group of young girl... |
6941 rows × 15 columns
The selected columns of text were then converted to a vector format. For this, TfidfVectorizer in the sci-kit learn package in python was used. TfidfVectorizer is a class in scikit-learn that can be used to convert a collection of raw documents into a matrix of Tf-idf features. Tf-idf stands for term frequency-inverse document frequency, and it is a measure of the importance of a word in a document relative to a collection of documents. The goal of using Tf-idf is to down-weight the importance of words that are common to many documents and up-weight the importance of words that are rare or specific to a particular document. The process of TfidfVectorizer is as follows:
In practical terms, cosine similarity is a measure of similarity between two vectors based on the angle between them. If the vectors are pointing in the same direction, their cosine similarity will be 1. If they are pointing in opposite directions, their cosine similarity will be -1. If the vectors are orthogonal to each other, their cosine similarity will be 0.
# return item name based on id
def item(id1):
# display(top_anime_all_table[top_anime_all_table['id'] == id1])
item_dis = top_anime_all_table.loc[top_anime_all_table['id'] == id1 ]
return item_dis.loc[:,'Title'].item()
A function was built to iterate through each record of the anime information table and find the most similar items computed by the cosine similarity function. The top 100 similar items were extracted then sorted, excluding the first entry as it would be the anime itself. These values were then stored in a dictionary, then converted into a table. A sample of this table can be shown in Figure 15 below, where “watch_id” is the watched show, and the “id” is the recommendation id along with its cosine similarity listed.
# convert words in array to vector format
def find_similarity_cosine(ds, column = 'description'):
tf = TfidfVectorizer(analyzer='word', ngram_range=(1, 3), min_df=0, stop_words='english')
tfidf_matrix = tf.fit_transform(ds[column])
# find cosine similarity
cosine_similarities = cosine_similarity(tfidf_matrix, tfidf_matrix)
# itterate through results and find similar
results = {}
for idx, row in ds.iterrows():
# sort top 100
similar_indices = cosine_similarities[idx].argsort()[:-100:-1]
# get similaries
similar_items = [(cosine_similarities[idx][i], ds['index'][i]) for i in similar_indices]
results[row['index']] = similar_items[1:]
# convert to dataframe
content_rec_des = pd.DataFrame.from_dict(results, orient="index").stack().to_frame()
content_rec_des = pd.DataFrame(content_rec_des[0].values.tolist(), index = content_rec_des.index)
content_rec_des = content_rec_des.reset_index(level=[0,1])
content_rec_des = content_rec_des.rename(columns = {'level_0': 'watch_id' ,
'level_1' :'rec_no',
0 : 'cos_sim',
1: 'id'})
content_rec_des['unique_key'] = content_rec_des['watch_id'].astype(str) + content_rec_des['id'].astype(str)
return content_rec_des
def reccomend_content(results_df, a_id, number_of_rec):
a_df = results_df[results_df['watch_id'] == a_id]
a_df_sort = a_df.sort_values(by = ['cos_sim'], ascending = False)
anime_ids_list = list(a_df_sort['id'][:number_of_rec])
# print('reccomendations for ' + item(a_id) + 'id = ' + str(a_id))
display(top_anime_all_table[top_anime_all_table['id'] == a_id])
return top_anime_all_table[top_anime_all_table['id'].isin(anime_ids_list)]
# generate similar series based on these categories.
# plot / description seems to work the best.
results_des = find_similarity_cosine(ds,'description')
results_genres = find_similarity_cosine(ds,'genres_overall')
results_stu = find_similarity_cosine(ds,'studios')
results_va = find_similarity_cosine(ds,'voice_actors')
results_staff = find_similarity_cosine(ds,'staff')
results_char = find_similarity_cosine(ds,'characters')
results_des
| watch_id | rec_no | cos_sim | id | unique_key | |
|---|---|---|---|---|---|
| 0 | 5114 | 0 | 0.158597 | 121 | 5114121 |
| 1 | 5114 | 1 | 0.089567 | 9135 | 51149135 |
| 2 | 5114 | 2 | 0.077452 | 430 | 5114430 |
| 3 | 5114 | 3 | 0.040949 | 6421 | 51146421 |
| 4 | 5114 | 4 | 0.026087 | 1266 | 51141266 |
| ... | ... | ... | ... | ... | ... |
| 680213 | 34332 | 93 | 0.017632 | 30988 | 3433230988 |
| 680214 | 34332 | 94 | 0.017347 | 1175 | 343321175 |
| 680215 | 34332 | 95 | 0.017303 | 17409 | 3433217409 |
| 680216 | 34332 | 96 | 0.017286 | 1254 | 343321254 |
| 680217 | 34332 | 97 | 0.017284 | 32171 | 3433232171 |
680218 rows × 5 columns
### manual checking
# full metal alchemist id = 5114,
# code geass id = 1575
# gintama id = 9969
# kimi no na wa, move id = 32281
an_id = 32281
by_des = reccomend_content(results_char, an_id, 5)
by_des
| Rank | Title | link | id | public_score | prive_rating | watch_status | |
|---|---|---|---|---|---|---|---|
| 24 | 25 | Kimi no Na wa. | https://myanimelist.net/anime/32281/Kimi_no_Na_wa | 32281 | 8.85 | 10.0 | Completed |
| Rank | Title | link | id | public_score | prive_rating | watch_status | |
|---|---|---|---|---|---|---|---|
| 238 | 239 | Tenki no Ko | https://myanimelist.net/anime/38826/Tenki_no_Ko | 38826 | 8.30 | 7.0 | Completed |
| 699 | 700 | Kotonoha no Niwa | https://myanimelist.net/anime/16782/Kotonoha_n... | 16782 | 7.91 | NaN | Add to list |
| 2568 | 2569 | Blend S | https://myanimelist.net/anime/34618/Blend_S | 34618 | 7.29 | 7.0 | Completed |
| 2639 | 2640 | Slayers: The Motion Picture | https://myanimelist.net/anime/536/Slayers__The... | 536 | 7.28 | NaN | Add to list |
| 2648 | 2649 | Watashi ni Tenshi ga Maiorita! Special | https://myanimelist.net/anime/38999/Watashi_ni... | 38999 | 7.28 | NaN | Add to list |
The recommendation was also computed but with characters, genres, studios, voice actors and staff . A breakdown is listed for each below.
Although content-based recommendation systems will always recommend similar items, which the user would have a high chance to enjoy, the weakness is that this system leaves no room to discover new shows. For instance, most of the series that are recommended are sequels to the originals, however, it is not as good as recommending series that are outside this range. This would ultimately lead to a repetitive user experience, hence the need for collaborative filtering.
Collaborative filtering is a method of recommending items to users based on the preferences of similar users. It works by identifying users who have similar preferences and uses those preferences to recommend items to the active user. There are two main types of collaborative filtering systems:
This notebook will investigate model-based collaborative filtering. One of the advantages of a model-based collaborative filtering system is that it does not need to rely on understanding the item content. A major downside to this method is the cold start problem, where the system does not recommend new items well when there has been no user-item interaction with it. This implies that the sample size and database must be very robust in order to make an accurate prediction.
This notebook will also be using user data scraped from myanimelist, the code for which can be found under another notebook
import numpy as np
from bs4 import BeautifulSoup
import requests
import re
import requests
import lxml.html as lh
import pandas as pd
import pickle
import os
import scipy.stats as stats
import seaborn as sns
import matplotlib.pyplot as plt
pd.options.mode.chained_assignment = None # default='warn'
top_anime_all_table = pd.read_csv('/Users/rayenfeng/Documents/code/anime_rec_project/data_sources_pickle/all_anime_directory.csv')
top_anime_all_table.head()
| Rank | Title | link | id | public_score | prive_rating | watch_status | |
|---|---|---|---|---|---|---|---|
| 0 | 1 | Fullmetal Alchemist: Brotherhood | https://myanimelist.net/anime/5114/Fullmetal_A... | 5114 | 9.11 | 10.0 | Completed |
| 1 | 2 | Bleach: Sennen Kessen-hen | https://myanimelist.net/anime/41467/Bleach__Se... | 41467 | 9.10 | NaN | Add to list |
| 2 | 3 | Kaguya-sama wa Kokurasetai: Ultra Romantic | https://myanimelist.net/anime/43608/Kaguya-sam... | 43608 | 9.09 | 10.0 | Completed |
| 3 | 4 | Steins;Gate | https://myanimelist.net/anime/9253/Steins_Gate | 9253 | 9.08 | 10.0 | Completed |
| 4 | 5 | Gintama° | https://myanimelist.net/anime/28977/Gintama° | 28977 | 9.07 | NaN | Completed |
# import scraped user information.
all_user_rating_df1 = pd.read_csv('/Users/rayenfeng/Documents/code/anime_rec_project/data_sources_pickle/all_user_anime_ratings.csv')
all_user_rating_df1 = all_user_rating_df1.rename(columns = {'level_0' : 'username', 'level_1': 'sub_link', '0': 'user_rating'})
# import data scrape from the second file
all_user_rating_df2 = pd.read_csv('/Users/rayenfeng/Documents/code/anime_rec_project/data_sources_pickle/all_user_anime_ratings2.csv')
all_user_rating_df2 = all_user_rating_df2.rename(columns = {'level_0' : 'username', 'level_1': 'sub_link', '0': 'user_rating'})
# combine both dataframes.
all_user_rating_df = pd.concat([all_user_rating_df1, all_user_rating_df2], axis=0)
all_user_rating_df
| username | sub_link | user_rating | |
|---|---|---|---|
| 0 | tazillo | /anime/47/Akira | 5 |
| 1 | tazillo | /anime/6547/Angel_Beats | 3 |
| 2 | tazillo | /anime/9989/Ano_Hi_Mita_Hana_no_Namae_wo_Bokut... | - |
| 3 | tazillo | /anime/11111/Another | 2 |
| 4 | tazillo | /anime/477/Aria_the_Animation | - |
| ... | ... | ... | ... |
| 678738 | neongenesis92i | /anime/1251/Fushigi_no_Umi_no_Nadia | 9 |
| 678739 | sarah501689 | /anime/40421/Given_Movie | 10 |
| 678740 | aReallyBigFan | /anime/38088/Digimon_Adventure__Last_Evolution... | 10 |
| 678741 | MauvaiseHerbe | /anime/40911/Yuukoku_no_Moriarty | 3 |
| 678742 | Storm9265 | /anime/32005/Detective_Conan_Movie_20__The_Dar... | 1 |
1596003 rows × 3 columns
# extract user ids into seperate column
all_user_rating_df['id'] = all_user_rating_df['sub_link'].apply(lambda x: int(x.split('/')[2]))
#print number of unique ids and animes
print('******* the number of unique users: ' + str(len(all_user_rating_df['username'].unique())) + ' *******\n')
print('******* the number of unique animes: ' + str(len(all_user_rating_df['id'].unique())) + ' *******\n')
### '-' means that user has not finished watching the show, can replace with nan.
all_user_rating_df['user_rating'] = all_user_rating_df['user_rating'].replace('-', np.nan)
all_user_rating_df_no_null = all_user_rating_df.dropna()
print('******* null values: *******\n')
print(all_user_rating_df_no_null.isna().sum())
all_user_rating_df_no_null_test = all_user_rating_df_no_null.copy()
all_user_rating_df_no_null_test.head()
******* the number of unique users: 4521 *******
******* the number of unique animes: 19056 *******
******* null values: *******
username 0
sub_link 0
user_rating 0
id 0
dtype: int64
| username | sub_link | user_rating | id | |
|---|---|---|---|---|
| 0 | tazillo | /anime/47/Akira | 5 | 47 |
| 1 | tazillo | /anime/6547/Angel_Beats | 3 | 6547 |
| 3 | tazillo | /anime/11111/Another | 2 | 11111 |
| 5 | tazillo | /anime/7817/B-gata_H-kei | 5 | 7817 |
| 6 | tazillo | /anime/5081/Bakemonogatari | 7 | 5081 |
# import my ratings.
my_ratings1 = top_anime_all_table.dropna(subset = ['prive_rating'])
my_ratings2 = my_ratings1[['link', 'id', 'prive_rating']]
my_ratings2['username'] = 'Destinyflame'
my_ratings2['sub_link'] = my_ratings2['link'].apply(lambda x: x.replace('https://myanimelist.net', ''))
my_ratings3 = my_ratings2.drop(columns=['link'])
my_ratings3 = my_ratings3[['username', 'sub_link', 'prive_rating', 'id']].rename(columns = {'prive_rating': 'user_rating'})
display(my_ratings3)
print('\n ******* null values: *******\n')
print(all_user_rating_df_no_null.isna().sum())
| username | sub_link | user_rating | id | |
|---|---|---|---|---|
| 0 | Destinyflame | /anime/5114/Fullmetal_Alchemist__Brotherhood | 10.0 | 5114 |
| 2 | Destinyflame | /anime/43608/Kaguya-sama_wa_Kokurasetai__Ultra... | 10.0 | 43608 |
| 3 | Destinyflame | /anime/9253/Steins_Gate | 10.0 | 9253 |
| 5 | Destinyflame | /anime/38524/Shingeki_no_Kyojin_Season_3_Part_2 | 10.0 | 38524 |
| 8 | Destinyflame | /anime/15417/Gintama__Enchousen | 10.0 | 15417 |
| ... | ... | ... | ... | ... |
| 5904 | Destinyflame | /anime/36407/Kenja_no_Mago | 6.0 | 36407 |
| 6529 | Destinyflame | /anime/34934/Koi_to_Uso | 6.0 | 34934 |
| 7001 | Destinyflame | /anime/38610/Tejina-senpai | 5.0 | 38610 |
| 7282 | Destinyflame | /anime/36511/Tokyo_Ghoul_re | 7.0 | 36511 |
| 7307 | Destinyflame | /anime/32901/Eromanga-sensei | 7.0 | 32901 |
298 rows × 4 columns
******* null values: *******
username 0
sub_link 0
user_rating 0
id 0
dtype: int64
# combined my ratings with all user ratings
combined_anime_data = pd.concat([all_user_rating_df_no_null_test, my_ratings3], axis=0)
# keep anime with more than 25 review
combined_anime_data['reviews'] = combined_anime_data.groupby(['id'])['user_rating'].transform('count')
# diplay(combined_anime_data)
combined_anime_data1 = combined_anime_data[combined_anime_data['reviews'] > 25]
combined_anime_data2 = combined_anime_data1[['username', 'id', 'user_rating']].astype({'user_rating': 'int64'})
combined_anime_data2
| username | id | user_rating | |
|---|---|---|---|
| 0 | tazillo | 47 | 5 |
| 1 | tazillo | 6547 | 3 |
| 3 | tazillo | 11111 | 2 |
| 5 | tazillo | 7817 | 5 |
| 6 | tazillo | 5081 | 7 |
| ... | ... | ... | ... |
| 5904 | Destinyflame | 36407 | 6 |
| 6529 | Destinyflame | 34934 | 6 |
| 7001 | Destinyflame | 38610 | 5 |
| 7282 | Destinyflame | 36511 | 7 |
| 7307 | Destinyflame | 32901 | 7 |
1262257 rows × 3 columns
from surprise import NMF, SVD, SVDpp, KNNBasic, KNNWithMeans, KNNWithZScore, CoClustering, NormalPredictor
from surprise.model_selection import cross_validate
from surprise import Reader, Dataset
reader = Reader(rating_scale=(1, 10))
data = Dataset.load_from_df(combined_anime_data2, reader)
# get the list of the anime ids
unique_ids = combined_anime_data2['id'].unique()
# get the list of the ids that the username destinyflame has rated
ids_user_destinyflame = combined_anime_data2.loc[combined_anime_data2['username'] == 'Destinyflame', 'id']
# remove the rated movies for the recommendations
animes_to_predict = np.setdiff1d(unique_ids , ids_user_destinyflame)
# define funciton that predicts how the user would score anime
# inputs are username as string and algorithm as suprise algo, default is set to NMF.
def find_user_reccomednations(user_name, algorithm = NMF()):
algo = algorithm
algo.fit(data.build_full_trainset())
my_recs = []
for i_id in animes_to_predict:
my_recs.append((i_id, algo.predict(uid = user_name ,iid = i_id).est))
my_recs_df = pd.DataFrame(my_recs, columns = ['anime_ids', 'predictions']).sort_values('predictions', ascending=False)
return my_recs_df
# NMF, SVD, SVDpp, KNNBasic, KNNWithMeans, KNNWithZScore, CoClustering, NormalPredictor
display(find_user_reccomednations('Destinyflame', SVD()))
| anime_ids | predictions | |
|---|---|---|
| 3590 | 28977 | 9.787165 |
| 2397 | 9969 | 9.555680 |
| 5168 | 42938 | 9.422987 |
| 4821 | 39486 | 9.370195 |
| 4289 | 35247 | 9.368408 |
| ... | ... | ... |
| 2769 | 13405 | 3.064923 |
| 325 | 413 | 2.769959 |
| 4758 | 38853 | 2.754034 |
| 4048 | 33394 | 2.599562 |
| 1596 | 3287 | 2.403478 |
5476 rows × 2 columns
# Previewing results
top_anime_all_table[top_anime_all_table['id'].isin([50160, 28977])]
| Rank | Title | link | id | public_score | prive_rating | watch_status | |
|---|---|---|---|---|---|---|---|
| 4 | 5 | Gintama° | https://myanimelist.net/anime/28977/Gintama° | 28977 | 9.07 | NaN | Completed |
| 38 | 39 | Kingdom 4th Season | https://myanimelist.net/anime/50160/Kingdom_4t... | 50160 | 8.77 | NaN | Add to list |
it’s a good thing that it thinks i’ll like gintama, I probably watched this series and didn’t update the score. As seen in this image, the status is completed, but I forgot to score it, if I did, it would be a 9 or 10 because Gintama is one of my favourite shows
With each dataframe method, there is bound to be some sort of error within the predicted value and the actual value. For this, each reccomendation algorythm can be evaluated by first splitting the data into two subsets; the test and train dataframe. The training dataset will be used to train the algoythm. Then this algorythm will be applyed to the test dataset and will predict a rating. The accuracy with the actual score in the test datset will be computed.
Python Surprise package default cross-validation system splits the data using and runs a cross validation x times. Then gets the RSME, which is the root squared mean error. The second measure is MAE, which is the Mean of Absolute value of Errors. The lower the number the better.
For this segment, a recommendation system will be built with the surprise package in Python which is a package for building and analyzing recommender systems that deal with explicit rating data. After importing the packages, a simple function was built that predicts the score, given a username and a certain algorithm. The recommendation system will then be tested using 5 different recommendation algorithms, and these algorithms will be evaluated based on their prediction accuracy. The implementation of these algorithms will be focused on, as the mathematics behind them are complex in nature and a deep dive into them is beyond the score of this project, hence a brief description will be given below of each algorithm used.
Cross-validation is a resampling procedure used to evaluate a machine learning model on a limited data sample. The goal of cross-validation is to tune the parameters (i.e., the hyperparameters) of a model in a way that maximizes the model’s ability to predict the target variable. The algorithms will be validated with the k-fold cross validation technique.
In k-fold cross-validation, the data sample is randomly partitioned into k smaller sets, or folds. The model is trained on k-1 folds and tested on the remaining fold. This process is repeated k times, with a different fold being used as the test set each time. The performance measure is then averaged across all k iterations.
The average performance of the algorithms will be measured using MAE and RMSE. The computation time will also be considered during the evaluation but is not the main focus. MAE and RMSE are two different metrics that can be used to evaluate the performance of a machine-learning model during cross-validation.
MAE stands for mean absolute error. It is a measure of the average absolute difference between the predicted values and the true values. It is calculated as the sum of the absolute differences between the predicted and true values, divided by the total number of predictions. MAE is a simple and easy-to-understand metric, but it is sensitive to outliers because it does not square the differences before taking the mean. Equation (3 is the mathematical formula for MAE.
RMSE stands for root mean squared error. It is a measure of the average squared difference between the predicted values and the true values. It is calculated as the square root of the mean squared error (MSE), which is the average of the squared differences between the predicted and true values. RMSE is more sensitive to outliers than MAE because it squares the differences before taking the mean. Equation (4 is the mathematical formula for RMSE.
In general, a model with a lower MAE or RMSE is considered to be a better model because it is making predictions that are closer to the true values.
cross_validation_score = {}
# Iterate over all recommender system algorithms
for rec_system in [NMF(), SVD() , SVDpp(), KNNWithZScore(), CoClustering()]:
# Perform cross validation
cross_val_df = cross_validate(rec_system, data, cv = 3)
method_name = str(rec_system).split(' ')[0].split('.')[-1]
cross_validation_score[method_name] = cross_val_df
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
cross_val_df_all_method = pd.DataFrame.from_dict(cross_validation_score, orient='index')
cross_val_df_all_method_means = cross_val_df_all_method.applymap(lambda x: np.array(x).mean())
cross_val_df_all_method_means.sort_values(by=['test_rmse'])
| test_rmse | test_mae | fit_time | test_time | |
|---|---|---|---|---|
| SVD | 1.285141 | 0.955953 | 9.375140 | 3.445283 |
| SVDpp | 1.328549 | 0.989146 | 422.367654 | 185.372463 |
| KNNWithZScore | 1.367502 | 1.026495 | 25.286533 | 170.895483 |
| CoClustering | 1.424752 | 1.070978 | 19.545526 | 3.579695 |
| NMF | 1.913623 | 1.613487 | 16.732938 | 3.112131 |
The best performing algorithm is SVD and it also has the lowest fit time. Therfore, the SVD algorithm will be run.
ranked_user_reccomendations_df = find_user_reccomednations('Destinyflame', SVD())
# display(ranked_user_reccomendations_df)
top_anime_all_table[top_anime_all_table['id'].isin(ranked_user_reccomendations_df['anime_ids'][:50])]
| Rank | Title | link | id | public_score | prive_rating | watch_status | |
|---|---|---|---|---|---|---|---|
| 4 | 5 | Gintama° | https://myanimelist.net/anime/28977/Gintama° | 28977 | 9.07 | NaN | Completed |
| 6 | 7 | Gintama' | https://myanimelist.net/anime/9969/Gintama | 9969 | 9.05 | NaN | Completed |
| 7 | 8 | Gintama: The Final | https://myanimelist.net/anime/39486/Gintama__T... | 39486 | 9.05 | NaN | Add to list |
| 11 | 12 | Fruits Basket: The Final | https://myanimelist.net/anime/42938/Fruits_Bas... | 42938 | 9.02 | NaN | Add to list |
| 13 | 14 | 3-gatsu no Lion 2nd Season | https://myanimelist.net/anime/35180/3-gatsu_no... | 35180 | 8.95 | NaN | Add to list |
| 22 | 23 | Owarimonogatari 2nd Season | https://myanimelist.net/anime/35247/Owarimonog... | 35247 | 8.89 | NaN | Add to list |
| 31 | 32 | Kizumonogatari III: Reiketsu-hen | https://myanimelist.net/anime/31758/Kizumonoga... | 31758 | 8.80 | NaN | Add to list |
| 32 | 33 | Bocchi the Rock! | https://myanimelist.net/anime/47917/Bocchi_the... | 47917 | 8.79 | NaN | Add to list |
| 40 | 41 | Hajime no Ippo | https://myanimelist.net/anime/263/Hajime_no_Ippo | 263 | 8.75 | NaN | Add to list |
| 41 | 42 | Mushishi Zoku Shou 2nd Season | https://myanimelist.net/anime/24701/Mushishi_Z... | 24701 | 8.74 | NaN | Add to list |
| 47 | 48 | Rurouni Kenshin: Meiji Kenkaku Romantan - Tsui... | https://myanimelist.net/anime/44/Rurouni_Kensh... | 44 | 8.71 | NaN | Add to list |
| 53 | 54 | Fate/stay night Movie: Heaven's Feel - III. Sp... | https://myanimelist.net/anime/33050/Fate_stay_... | 33050 | 8.69 | NaN | Add to list |
| 55 | 56 | One Piece | https://myanimelist.net/anime/21/One_Piece | 21 | 8.68 | NaN | Dropped |
| 61 | 62 | Hajime no Ippo: New Challenger | https://myanimelist.net/anime/5258/Hajime_no_I... | 5258 | 8.66 | NaN | Add to list |
| 66 | 67 | Mob Psycho 100 III | https://myanimelist.net/anime/50172/Mob_Psycho... | 50172 | 8.65 | NaN | Add to list |
| 72 | 73 | Tengen Toppa Gurren Lagann | https://myanimelist.net/anime/2001/Tengen_Topp... | 2001 | 8.63 | NaN | Add to list |
| 78 | 79 | Seishun Buta Yarou wa Yumemiru Shoujo no Yume ... | https://myanimelist.net/anime/38329/Seishun_Bu... | 38329 | 8.61 | NaN | Add to list |
| 81 | 82 | Hajime no Ippo: Rising | https://myanimelist.net/anime/19647/Hajime_no_... | 19647 | 8.59 | NaN | Add to list |
| 82 | 83 | JoJo no Kimyou na Bouken Part 5: Ougon no Kaze | https://myanimelist.net/anime/37991/JoJo_no_Ki... | 37991 | 8.58 | NaN | Add to list |
| 83 | 84 | Kizumonogatari II: Nekketsu-hen | https://myanimelist.net/anime/31757/Kizumonoga... | 31757 | 8.58 | NaN | Add to list |
| 90 | 91 | Spy x Family Part 2 | https://myanimelist.net/anime/50602/Spy_x_Fami... | 50602 | 8.57 | NaN | Add to list |
| 104 | 105 | Bakuman. 3rd Season | https://myanimelist.net/anime/12365/Bakuman_3r... | 12365 | 8.54 | NaN | Add to list |
| 119 | 120 | Fate/stay night Movie: Heaven's Feel - II. Los... | https://myanimelist.net/anime/33049/Fate_stay_... | 33049 | 8.51 | NaN | Add to list |
| 135 | 136 | Mahou Shoujo Madoka★Magica Movie 3: Hangyaku n... | https://myanimelist.net/anime/11981/Mahou_Shou... | 11981 | 8.47 | NaN | Add to list |
| 143 | 144 | Steins;Gate Movie: Fuka Ryouiki no Déjà vu | https://myanimelist.net/anime/11577/Steins_Gat... | 11577 | 8.46 | NaN | Add to list |
| 148 | 149 | Zoku Owarimonogatari | https://myanimelist.net/anime/36999/Zoku_Owari... | 36999 | 8.45 | NaN | Add to list |
| 153 | 154 | JoJo no Kimyou na Bouken Part 3: Stardust Crus... | https://myanimelist.net/anime/26055/JoJo_no_Ki... | 26055 | 8.44 | NaN | Add to list |
| 162 | 163 | Major S5 | https://myanimelist.net/anime/5028/Major_S5 | 5028 | 8.41 | NaN | Add to list |
| 172 | 173 | Kara no Kyoukai Movie 7: Satsujin Kousatsu (Go) | https://myanimelist.net/anime/5205/Kara_no_Kyo... | 5205 | 8.40 | NaN | Add to list |
| 188 | 189 | Kizumonogatari I: Tekketsu-hen | https://myanimelist.net/anime/9260/Kizumonogat... | 9260 | 8.37 | NaN | Add to list |
| 198 | 199 | Re:Zero kara Hajimeru Isekai Seikatsu 2nd Season | https://myanimelist.net/anime/39587/Re_Zero_ka... | 39587 | 8.35 | NaN | Add to list |
| 199 | 200 | Bakuman. 2nd Season | https://myanimelist.net/anime/10030/Bakuman_2n... | 10030 | 8.35 | NaN | Add to list |
| 203 | 204 | Gotcha! | https://myanimelist.net/anime/42984/Gotcha | 42984 | 8.34 | NaN | Add to list |
| 220 | 221 | Katanagatari | https://myanimelist.net/anime/6594/Katanagatari | 6594 | 8.32 | NaN | Add to list |
| 222 | 223 | Kemono no Souja Erin | https://myanimelist.net/anime/5420/Kemono_no_S... | 5420 | 8.32 | NaN | Add to list |
| 239 | 240 | World Trigger 3rd Season | https://myanimelist.net/anime/44940/World_Trig... | 44940 | 8.30 | NaN | Add to list |
| 271 | 272 | Stranger: Mukou Hadan | https://myanimelist.net/anime/2418/Stranger__M... | 2418 | 8.27 | NaN | Add to list |
| 279 | 280 | JoJo no Kimyou na Bouken Part 6: Stone Ocean | https://myanimelist.net/anime/48661/JoJo_no_Ki... | 48661 | 8.26 | NaN | Add to list |
| 292 | 293 | Gyakkyou Burai Kaiji: Hakairoku-hen | https://myanimelist.net/anime/10271/Gyakkyou_B... | 10271 | 8.25 | NaN | Add to list |
| 295 | 296 | Detective Conan: Episode One - The Great Detec... | https://myanimelist.net/anime/34036/Detective_... | 34036 | 8.24 | NaN | Add to list |
| 417 | 418 | Boku no Hero Academia 2nd Season | https://myanimelist.net/anime/33486/Boku_no_He... | 33486 | 8.13 | NaN | Watching |
| 420 | 421 | Mahou Shoujo Lyrical Nanoha: The Movie 2nd A's | https://myanimelist.net/anime/10153/Mahou_Shou... | 10153 | 8.13 | NaN | Add to list |
| 428 | 429 | Chuunibyou demo Koi ga Shitai! Movie: Take On Me | https://myanimelist.net/anime/35608/Chuunibyou... | 35608 | 8.12 | NaN | Add to list |
| 450 | 451 | One Piece Film: Strong World | https://myanimelist.net/anime/4155/One_Piece_F... | 4155 | 8.10 | NaN | Add to list |
| 454 | 455 | Tsukimonogatari | https://myanimelist.net/anime/28025/Tsukimonog... | 28025 | 8.10 | NaN | Add to list |
| 541 | 542 | Kara no Kyoukai Movie 3: Tsuukaku Zanryuu | https://myanimelist.net/anime/3783/Kara_no_Kyo... | 3783 | 8.03 | NaN | Add to list |
| 625 | 626 | One Piece Film: Red | https://myanimelist.net/anime/50410/One_Piece_... | 50410 | 7.96 | NaN | Add to list |
| 658 | 659 | Clannad: Mou Hitotsu no Sekai, Tomoyo-hen | https://myanimelist.net/anime/4059/Clannad__Mo... | 4059 | 7.94 | NaN | Add to list |
| 686 | 687 | Code Geass: Fukkatsu no Lelouch | https://myanimelist.net/anime/34437/Code_Geass... | 34437 | 7.92 | NaN | Add to list |
| 2906 | 2907 | Darling in the FranXX | https://myanimelist.net/anime/35849/Darling_in... | 35849 | 7.22 | NaN | Add to list |
Well, I hope all this work was worth it, just to find some anime to watch. It’ll be annoying to update this dataset each time, so I hope I can find a way to integrate this with something else.
As explored, collaborative filtering and content-based filtering are two different techniques used to make recommendations in a recommendation system.
Content-based filtering is a method of making recommendations based on the characteristics of the items being recommended. In order for it to do so, content-based filtering algorithms require information about the items being recommended, such as their features, descriptions, or genres.
Collaborative filtering is a method of making recommendations based on the preferences of similar users by identifying users who have similar tastes and preferences and using those preferences to make recommendations to the current user. A strength of collaborative filtering algorithms is that it does not require any information about the items being recommended. In this case, only the user, anime id, and score were given to produce accurate recommendations.
Both collaborative filtering and content-based filtering have their own strengths and weaknesses. Collaborative filtering can make personalized recommendations based on the preferences of similar users, but it may struggle to make recommendations for users who are new to the system or who have few ratings. Content-based filtering can make recommendations based on the characteristics of the items, but it may struggle to capture complex relationships between items or to make recommendations that are outside the user’s usual interests.
In this case, given the already robust database that myanimelist.net has, both approaches would have no restrictions if they were implemented, although a collaborative filtering approach seems to be produce more accurate results and is the favoured approach to use. However, based on the results of this study, a hybrid approach may need to be used to take the strengths of each method.
This report compared content based and collaborative recommendation systems. For content-based systems, a model-based system was used. The database contained 10,000 anime and their data such as genres, characters, and descriptions. During the data exploration, outliers were found and it could be seen that no feature is a definite predictor of score. This process also identified five key features; description, genres, characters, staff, and voice actors which are expected to relate similar items. The cosine similarity between animes was found for each of the features and similar items were computed for each feature. It was found that content-based recommendation method was very proficient at find sequels of shows, however, as it searches by keywords and term frequency, it may ultimately lead to a repetitive user experience
For a collaborative filtering method, information from over 4000 unique users containing 1.5 million anime ratings were gathered. Five different algorithms were cross validated and the SVD algorithm performed the best and was implemented into the system. The results were investigated using a sample user, and the results from this method were accurate in their predictions.
The following section is outlined for the further development of the anime recommendation system. In this study, a user-user based method was used for collaborative filtering and a model-based method was used for content filtering. For the future, it can be good to strengthen the system by exploring an item-item based collaborative approach and a memory-based approach to content approach. Additionally, packages such as scikit learn and surprise were used, but a different approach such as using deep learning algorithms in the TensorFlow package might also provide good insight. Although it was outside the score of the project, a good addition may be to make the system interactive, and web based. By making this web-based, it can synchronize with the database which would increase user and item information. This increase in information can be ultimately used to make better recommendations using a hybrid method which combines the two systems and their strengths.
https://heartbeat.comet.ml/recommender-systems-with-python-part-i-content-based-filtering-5df4940bd831
https://towardsdatascience.com/hands-on-content-based-recommender-system-using-python-1d643bf314e4
https://heartbeat.comet.ml/recommender-systems-with-python-part-i-content-based-filtering-5df4940bd831
https://towardsdatascience.com/hands-on-content-based-recommender-system-using-python-1d643bf314e4
https://predictivehacks.com/how-to-run-recommender-systems-in-python/
Suprise Python Documentaion:
https://surprise.readthedocs.io/en/stable/index.html
For information on prediction algorithms package:
https://surprise.readthedocs.io/en/stable/prediction_algorithms_package.html
MAE and RMSE documentation https://towardsdatascience.com/what-are-rmse-and-mae-e405ce230383#:~:text=Technically%2C%20RMSE%20is%20the%20Root,actual%20values%20of%20a%20variable.