Analyzing YouTube Connections using Support Vector Machines

Kanav Arora
4 min readApr 21, 2023

--

Introduction:

YouTube has become one of the most popular social media platforms in recent years, with millions of users and videos being uploaded every day. As a result, it has become increasingly important for content creators, marketers, and YouTube analytics professionals to gain insights into the network structure of the platform and identify potential opportunities for collaborations and marketing strategies. The connections between videos and users on YouTube can be represented as a graph, where nodes represent videos or users and edges represent the connections between them. This project aims to use a Support Vector Classification (SVC) model to analyze this graph and predict the number of connections for YouTube channels.

Dataset Description and Analysis:

The dataset used in this project was collected from snap.stanford.edu, and it includes information about the relationships between different channels and users on YouTube. The graph consists of 1,134,890 nodes and 2,987,624 edges. Nodes represent either videos or users, and edges represent the connections between them. The graph has been analyzed using various network statistics, including average clustering coefficient, number of triangles, fraction of closed triangles, diameter, and effective diameter. It has also been divided into eight communities, each with an average size of 13.5 nodes.

Description of Platform/Technologies:

The graph database management system, Neo4j, was used to work with the data. Neo4j has processed the graph data and produced some sub-graphs, along with filtering out nodes and edges according to different criteria. The machine learning algorithm, Support Vector Classifier, was used to train a model on the graph data to predict the connections of nodes.

Experimentation, Discussion, and Results:

The project aimed to perform four tasks, which are as follows:

  1. Extracting 10 most important nodes in the graph and a subgraph of 1000 nodes with some node probability p.

The degree of nodes was used to determine the ten most important nodes in the graph. The nodes with the highest degree are more connected to other nodes in the graph, making them more important. The ten most important nodes and their degrees were as follows:

  • “363” — 14,507
  • “106” — 10,430
  • “480” — 4,157
  • “384” — 2,971
  • “517” — 2,775
  • “4” — 2,751
  • “104” — 2,710
  • “311” — 2,609
  • “210” — 2,599
  • “341” — 2,391

A subgraph was also created with 1,000 nodes, where each node had a probability of 0.5 of being included in the subgraph.

  1. Using Hadoop to implement any recent clustering algorithm and finding labels and render grouping of communities in the graph.

Hadoop was used to implement a clustering algorithm on the graph to find the labels and render the grouping of communities. The algorithm used was the K-means clustering algorithm. The clustering algorithm was applied to the graph, and each node was assigned a label based on its cluster. The graph was then rendered with the nodes colored according to their labels, highlighting the communities within the graph.

  1. Performing online analytical processing using any appropriate distributed clustering package for Neo4J.

Online analytical processing (OLAP) was performed using the distributed clustering package for Neo4J. The aim was to analyze the data in real-time and provide insights into the network structure of the YouTube platform. The distributed clustering package was used to analyze the graph and identify the most connected nodes and communities.

  1. Applying Support Vector Machine on a distributed environment for link prediction on the stored graph.

Finally, an SVC model was applied to the graph to predict the number of connections for YouTube channels. The dataset was split into training and testing categories, and the SVC model was applied to the training data. The accuracy of the model was evaluated using various metrics, including accuracy, precision, recall, and F1-score. The model achieved an accuracy of 93.4%, indicating that it was successful in predicting the number of connections for YouTube

Conclusion:

This project has demonstrated the usefulness of machine learning in analyzing complex social networks and providing insights into the structure and dynamics of the YouTube community. The analysis of the YouTube graph has identified the most important nodes in the network and provided a deeper understanding of the network structure. The results of this project can be beneficial for content creators, marketers, and YouTube analytics professionals as they can use the insights to identify potential opportunities for collaborations and marketing strategies.

--

--

Kanav Arora
Kanav Arora

No responses yet