当前位置：文档库 › 网络表示学习-LINE-Large-scale Information Network Embedding

网络表示学习-LINE-Large-scale Information Network Embedding

LINE:Large-scale Information Network Embedding

Jian T ang 1,Meng Qu 2?,Mingzhe Wang 2,Ming Zhang 2,Jun Y an 1,Qiaozhu Mei 3

Microsoft Research Asia,{jiatang,junyan }@https://www.wendangku.net/doc/5f6368569.html,

School of EECS,Peking University,{mnqu,wangmingzhe,mzhang_cs}@https://www.wendangku.net/doc/5f6368569.html,

School of Information,University of Michigan,qmei@https://www.wendangku.net/doc/5f6368569.html,

ABSTRACT

This paper studies the problem of embedding very large information networks into low-dimensional vector spaces,which is useful in many tasks such as visualization,node classi?cation,and link prediction.Most existing graph em-bedding methods do not scale for real world information networks which usually contain millions of nodes.In this paper,we propose a novel network embedding method called the “LINE,”which is suitable for arbitrary types of informa-tion networks:undirected,directed,and/or weighted.The method optimizes a carefully designed objective function that preserves both the local and global network structures.An edge-sampling algorithm is proposed that addresses the limitation of the classical stochastic gradient descent and improves both the e?ectiveness and the e?ciency of the in-ference.Empirical experiments prove the e?ectiveness of the LINE on a variety of real-world information networks,including language networks,social networks,and citation networks.The algorithm is very e?cient,which is able to learn the embedding of a network with millions of vertices and billions of edges in a few hours on a typical single ma-chine.The source code of the LINE is available online.1

Categories and Subject Descriptors

I.2.6[Arti?cial Intelligence ]:Learning

General Terms

Algorithms,Experimentation

Keywords

information network embedding;scalability;feature learn-ing;dimension reduction

This work was done when the second author was an intern at Microsoft Research Asia.1

https://https://www.wendangku.net/doc/5f6368569.html,/tangjianpku/LINE

Copyright is held by the International World Wide Web Conference Com-mittee (IW3C2).IW3C2reserves the right to provide a hyperlink to the author’s site if the Material is used in electronic media.WWW 2015,May 18–22,2015,Florence,Italy.ACM 978-1-4503-3469-3/15/05.

https://www.wendangku.net/doc/5f6368569.html,/10.1145/2736277.2741093.

1.INTRODUCTION

Information networks are ubiquitous in the real world with examples such as airline networks,publication networks,so-cial and communication networks,and the World Wide Web.The size of these information networks ranges from hundreds of nodes to millions and billions of nodes.Analyzing large information networks has been attracting increasing atten-tion in both academia and industry.This paper studies the problem of embedding information networks into low-dimensional spaces,in which every vertex is represented as a low-dimensional vector.Such a low-dimensional embed-ding is very useful in a variety of applications such as vi-sualization [21],node classi?cation [3],link prediction [10],and recommendation [23].

Various methods of graph embedding have been proposed in the machine learning literature (e.g.,[4,20,2]).They generally perform well on smaller networks.The problem becomes much more challenging when a real world informa-tion network is concerned,which typically contains millions of nodes and billions of edges.For example,the Twitter followee-follower network contains 175million active users and around twenty billion edges in 2012[14].Most exist-ing graph embedding algorithms do not scale for networks of this size.For example,the time complexity of classical graph embedding algorithms such as MDS [4],IsoMap [20],Laplacian eigenmap [2]are at least quadratic to the number of vertices,which is too expensive for networks with mil-lions of nodes.Although a few very recent studies approach the embedding of large-scale networks,these methods either use an indirect approach that is not designed for networks (e.g.,[1])or lack a clear objective function tailored for net-work embedding (e.g.,[16]).We anticipate that a new model with a carefully designed objective function that preserves properties of the graph and an e?cient optimization tech-nique should e?ectively ?nd the embedding of millions of nodes.

In this paper,we propose such a network embedding model called the “LINE,”which is able to scale to very large,arbi-trary types of networks:undirected,directed and/or weighted.The model optimizes an objective which preserves both the local and global network structures.Naturally,the local structures are represented by the observed links in the net-works,which capture the ?rst-order proximity between the vertices.Most existing graph embedding algorithms are de-signed to preserve this ?rst-order proximity,e.g.,IsoMap [20]and Laplacian eigenmap [2],even if they do not scale.We observe that in a real-world network many (if not the major-ity of)legitimate links are actually not observed.In other

a r X i v :1503.03578v 1 [c s .L G ] 12 M a r 2015

Figure1:A toy example of information network.Edges can be undirected,directed,and/or weighted.Vertex6and7 should be placed closely in the low-dimensional space as they are connected through a strong tie.Vertex5and6should also be placed closely as they share similar neighbors. words,the observed?rst-order proximity in the real world data is not su?cient for preserving the global network struc-tures.As a complement,we explore the second-order prox-imity between the vertices,which is not determined through the observed tie strength but through the shared neighbor-hood structures of the vertices.The general notion of the second-order proximity can be interpreted as nodes with shared neighbors being likely to be similar.Such an intu-ition can be found in the theories of sociology and linguistics. For example,“the degree of overlap of two people’s friend-ship networks correlates with the strength of ties between them,”in a social network[6];and“You shall know a word by the company it keeps”(Firth,J.R.1957:11)in text cor-pora[5].Indeed,people who share many common friends are likely to share the same interest and become friends,and words that are used together with many similar words are likely to have similar meanings.

Fig.1presents an illustrative example.As the weight of the edge between vertex6and7is large,i.e.,6and7have a high?rst-order proximity,they should be represented closely to each other in the embedded space.On the other hand, though there is no link between vertex5and6,they share many common neighbors,i.e.,they have a high second-order proximity and therefore should also be represented closely to each other.We expect that the consideration of the second-order proximity e?ectively complements the sparsity of the ?rst-order proximity and better preserves the global struc-ture of the network.In this paper,we will present care-fully designed objectives that preserve the?rst-order and the second-order proximities.

Even if a sound objective is found,optimizing it for a very large network is challenging.One approach that attracts attention in recent years is using the stochastic gradient de-scent for the optimization.However,we show that directly deploying the stochastic gradient descent is problematic for real world information networks.This is because in many networks,edges are weighted and the weights usually present a high variance.Consider a word co-occurrence network,in which the weights(co-occurrences)of word pairs may range from one to hundreds of thousands.These weights of the edges will be multiplied into the gradients,resulting in the explosion of the gradients and thus compromise the perfor-mance.To address this,we propose a novel edge-sampling method,which improves both the e?ectiveness and e?ciency of the inference.We sample the edges with the probabilities proportional to their weights,and then treat the sampled edges as binary edges for model updating.With this sam-pling process,the objective function remains the same and the weights of the edges no longer a?ect the gradients.

The LINE is very general,which works well for directed or undirected,weighted or unweighted graphs.We evalu-ate the performance of the LINE with various real-world information networks,including language networks,social networks,and citation networks.The e?ectiveness of the learned embeddings is evaluated within multiple data min-ing tasks,including word analogy,text classi?cation,and node classi?cation.The results suggest that the LINE model outperforms other competitive baselines in terms of both ef-fectiveness and e?ciency.It is able to learn the embedding of a network with millions of nodes and billions of edges in a few hours on a single machine.

To summarize,we make the following contributions:

?We propose a novel network embedding model called the“LINE,”which suits arbitrary types of information networks and easily scales to millions of nodes.It has

a carefully designed objective function that preserves

both the?rst-order and second-order proximities.

?We propose an edge-sampling algorithm for optimizing the objective.The algorithm tackles the limitation of the classical stochastic gradient decent and improves the e?ectiveness and e?ciency of the inference.

?We conduct extensive experiments on real-world infor-mation networks.Experimental results prove the ef-fectiveness and e?ciency of the proposed LINE model. Organization.The rest of this paper is organized as follows.Section2summarizes the related work.Section3 formally de?nes the problem of large-scale information net-work embedding.Section4introduces the LINE model in details.Section5presents the experimental results.Finally we conclude in Section6.

2.RELATED WORK

Our work is related to classical methods of graph em-bedding or dimension reduction in general,such as multi-dimensional scaling(MDS)[4],IsoMap[20],LLE[18]and Laplacian Eigenmap[2].These approaches typically?rst construct the a?nity graph using the feature vectors of the data points,e.g.,the K-nearest neighbor graph of data,and then embed the a?nity graph[22]into a low dimensional space.However,these algorithms usually rely on solving the leading eigenvectors of the a?nity matrices,the complexity of which is at least quadratic to the number of nodes,making them ine?cient to handle large-scale networks.

Among the most recent literature is a technique called graph factorization[1].It?nds the low-dimensional embed-ding of a large graph through matrix factorization,which is optimized using stochastic gradient descent.This is possi-ble because a graph can be represented as an a?nity ma-trix.However,the objective of matrix factorization is not designed for networks,therefore does not necessarily pre-serve the global network structure.Intuitively,graph fac-torization expects nodes with higher?rst-order proximity are represented closely.Instead,the LINE model uses an objective that is particularly designed for networks,which preserves both the?rst-order and the second-order prox-imities.Practically,the graph factorization method only applies to undirected graphs while the proposed model is applicable for both undirected and directed graphs.

The most recent work related with ours is DeepWalk[16], which deploys a truncated random walk for social network embedding.Although empirically e?ective,the DeepWalk does not provide a clear objective that articulates what net-work properties are preserved.Intuitively,DeepWalk ex-pects nodes with higher second-order proximity yield similar low-dimensional representations,while the LINE preserves both?rst-order and second-order proximities.DeepWalk uses random walks to expand the neighborhood of a vertex, which is analogical to a depth-?rst search.We use a breadth-?rst search strategy,which is a more reasonable approach to the second-order proximity.Practically,DeepWalk only ap-plies to unweighted networks,while our model is applicable for networks with both weighted and unweighted edges.

In Section5,we empirically compare the proposed model with these methods using various real world networks.

3.PROBLEM DEFINITION

We formally de?ne the problem of large-scale information network embedding using?rst-order and second-order prox-imities.We?rst de?ne an information network as follows: Definition1.(Information Network)An informa-

tion network is de?ned as G=(V,E),where V is the set of vertices,each representing a data object and E is the set of edges between the vertices,each representing a re-lationship between two data objects.Each edge e∈E is an ordered pair e=(u,v)and is associated with a weight w uv>0,which indicates the strength of the relation.If G is undirected,we have(u,v)≡(v,u)and w uv≡w vu;if G is directed,we have(u,v)≡(v,u)and w uv≡w vu.

In practice,information networks can be either directed (e.g.,citation networks)or undirected(e.g.,social network of users in Facebook).The weights of the edges can be either binary or take any real value.Note that while negative edge weights are possible,in this study we only consider non-negative weights.For example,in citation networks and social networks,w uv takes binary values;in co-occurrence networks between di?erent objects,w uv can take any non-negative value.The weights of the edges in some networks may diverge as some objects co-occur many times while oth-ers may just co-occur a few times.

Embedding an information network into a low-dimensional space is useful in a variety of applications.To conduct the embedding,the network structures must be preserved.The ?rst intuition is that the local network structure,i.e.,the local pairwise proximity between the vertices,must be pre-served.We de?ne the local network structures as the?rst-order proximity between the vertices:

Definition2.(First-order Proximity)The?rst-order proximity in a network is the local pairwise proximity be-tween two vertices.For each pair of vertices linked by an edge(u,v),the weight on that edge,w uv,indicates the?rst-order proximity between u and v.If no edge is observed between u and v,their?rst-order proximity is0.

The?rst-order proximity usually implies the similarity of two nodes in a real-world network.For example,people who are friends with each other in a social network tend to share similar interests;pages linking to each other in World Wide Web tend to talk about similar topics.Because of this im-portance,many existing graph embedding algorithms such as IsoMap,LLE,Laplacian eigenmap,and graph factoriza-tion have the objective to preserve the?rst-order proximity. However,in a real world information network,the links observed are only a small proportion,with many others missing[10].A pair of nodes on a missing link has a zero ?rst-order proximity,even though they are intrinsically very similar to each other.Therefore,?rst-order proximity alone is not su?cient for preserving the network structures,and it is important to seek an alternative notion of proximity that addresses the problem of sparsity.A natural intuition is that vertices that share similar neighbors tend to be sim-ilar to each other.For example,in social networks,people who share similar friends tend to have similar interests and thus become friends;in word co-occurrence networks,words that always co-occur with the same set of words tend to have similar meanings.We therefore de?ne the second-order proximity,which complements the?rst-order proximity and preserves the network structure.

Definition3.(Second-order Proximity)The second-order proximity between a pair of vertices(u,v)in a net-work is the similarity between their neighborhood network structures.Mathematically,let p u=(w u,1,...,w u,|V|)de-note the?rst-order proximity of u with all the other vertices, then the second-order proximity between u and v is deter-mined by the similarity between p u and p v.If no vertex is linked from/to both u and v,the second-order proximity between u and v is0.

We investigate both?rst-order and second-order proxim-ity for network embedding,which is de?ned as follows. Definition4.(Large-scale Information Network Em-bedding)Given a large network G=(V,E),the problem of Large-scale Information Network Embedding aims to represent each vertex v∈V into a low-dimensional space R d,i.e.,learning a function f G:V→R d,where d |V|. In the space R d,both the?rst-order proximity and the second-order proximity between the vertices are preserved. Next,we introduce a large-scale network embedding model that preserves both?rst-and second-order proximities.

4.LINE:LARGE-SCALE INFORMATION

NETWORK EMBEDDING

A desirable embedding model for real world information networks must satisfy several requirements:?rst,it must be able to preserve both the?rst-order proximity and the second-order proximity between the vertices;second,it must scale for very large networks,say millions of vertices and bil-lions of edges;third,it can deal with networks with arbitrary types of edges:directed,undirected and/or weighted.In this section,we present a novel network embedding model called the“LINE,”which satis?es all the three requirements.

4.1Model Description

We describe the LINE model to preserve the?rst-order proximity and second-order proximity separately,and then introduce a simple way to combine the two proximity.

4.1.1LINE with First-order Proximity

The?rst-order proximity refers to the local pairwise prox-imity between the vertices in the network.To model the

?rst-order proximity,for each undirected edge (i,j ),we de-?ne the joint probability between vertex v i and v j as follows:

p 1(v i ,v j )=

1+exp(? u T i ·

u j ),

(1)

where u i ∈R d is the low-dimensional vector representation

of vertex v i .Eqn.(1)de?nes a distribution p (·,·)over the space V ×V ,and its empirical probability can be de?ned as ?p 1(i,j )=w ij ,where W = (i,j )∈E w ij .To preserve the ?rst-order proximity,a straightforward way is to minimize the following objective function:

O 1=d (?p 1(·,·),p 1(·,·)),

(2)

where d (·,·)is the distance between two distributions.We choose to minimize the KL-divergence of two probability dis-tributions.Replacing d (·,·)with KL-divergence and omit-ting some constants,we have:

O 1=?

(i,j )∈E

w ij log p 1(v i ,v j ),(3)

Note that the ?rst-order proximity is only applicable for

undirected graphs,not for directed graphs.By ?nding the { u i }i =1..|V |that minimize the objective in Eqn.(3),we can represent every vertex in the d-dimensional space.

4.1.2LINE with Second-order Proximity

The second-order proximity is applicable for both directed and undirected graphs.Given a network,without loss of generality,we assume it is directed (an undirected edge can be considered as two directed edges with opposite directions and equal weights).The second-order proximity assumes that vertices sharing many connections to other vertices are similar to each other.In this case,each vertex is also treated as a speci?c “context”and vertices with similar distributions over the “contexts”are assumed to be similar.Therefore,each vertex plays two roles:the vertex itself and a speci?c “context”of other vertices.We introduce two vectors u i and u i ,where u i is the representation of v i when it is treated as a vertex while u i is the representation of v i when it is treated as a speci?c “context”.For each directed edge (i,j ),we ?rst de?ne the probability of “context”v j generated by vertex v i as:

p 2(v j |v i )=exp( u T

j ·

u i ) |V |

k =1exp(

u T

k · u i ),(4)

where |V |is the number of vertices or “contexts.”For each

vertex v i ,Eqn.(4)actually de?nes a conditional distribution p 2(·|v i )over the contexts,i.e.,the entire set of vertices in the network.As mentioned above,the second-order proximity assumes that vertices with similar distributions over the con-texts are similar to each other.To preserve the second-order proximity,we should make the conditional distribution of the contexts p 2(·|v i )speci?ed by the low-dimensional rep-resentation be close to the empirical distribution ?p 2(·|v i ).Therefore,we minimize the following objective function:

O 2= i ∈V

λi d (?p 2(·|v i ),p 2(·|v i )),(5)

where d (·,·)is the distance between two distributions.As

the importance of the vertices in the network may be di?er-ent,we introduce λi in the objective function to represent

the prestige of vertex i in the network,which can be mea-sured by the degree or estimated through algorithms such as

PageRank [15].The empirical distribution ?p 2(·|v i )is de?ned

as ?p 2(v j |v i )=w ij

d i ,wher

e w ij is the weight o

f the edge (i,j )

and d i is the out-degree of vertex i ,i.e.d i =

k ∈N (i )w ik ,where N (i )is the set of out-neighbors of v i .In this paper,for simplicity we set λi as the degree of vertex i ,i.e.,λi =d i ,and here we also adopt KL-divergence as the distance func-tion.Replacing d (·,·)with KL-divergence,setting λi =d i and omitting some constants,we have:

O 2=?

(i,j )∈E

w ij log p 2(v j |v i ).

(6)

By learning { u i }i =1..|V |and { u i }i =1..|V |that minimize

this objective,we are able to represent every vertex v i with a d-dimensional vector u i .

4.1.3

Combining ?rst-order and second-order prox-imities

To embed the networks by preserving both the ?rst-order

and second-order proximity,a simple and e?ective way we ?nd in practice is to train the LINE model which preserves the ?rst-order proximity and second-order proximity sepa-rately and then concatenate the embeddings trained by the two methods for each vertex.A more principled way to combine the two proximity is to jointly train the objective function (3)and (6),which we leave as future work.

4.2Model Optimization

Optimizing objective (6)is computationally expensive,which requires the summation over the entire set of ver-tices when calculating the conditional probability p 2(·|v i ).To address this problem,we adopt the approach of negative sampling proposed in [13],which samples multiple negative edges according to some noisy distribution for each edge (i,j ).More speci?cally,it speci?es the following objective function for each edge (i,j ):

log σ( u j

· u i )+

K i =1

E v n ～P n (v )[log σ(? u n T

u i )],(7)

where σ(x )=1/(1+exp(?x ))is the sigmoid function.The

?rst term models the observed edges,the second term mod-els the negative edges drawn from the noise distribution and K is the number of negative edges.We set P n (v )∝d v 3/4as proposed in [13],where d v is the out-degree of vertex v .For the objective function (3),there exists a trivial solu-tion:u ik =∞,for i=1,...,|V |and k =1,...,d .To avoid the trivial solution,we can still utilize the negative sampling

approach (7)by just changing u T

j to u T j .

We adopt the asynchronous stochastic gradient algorithm (ASGD)[17]for optimizing Eqn.(7).In each step,the ASGD algorithm samples a mini-batch of edges and then updates the model parameters.If an edge (i,j )is sampled,the gradient w.r.t.the embedding vector u i of vertex i will be calculated as:

?O 2

? u i =w ij ·?log p 2(v j |v i )? u i

(8)

Note that the gradient will be multiplied by the weight of the edge.This will become problematic when the weights

of edges have a high variance.For example,in a word co-occurrence network,some words co-occur many times (e.g.,tens of thousands)while some words co-occur only a few times.In such networks,the scales of the gradients diverge and it is very hard to ?nd a good learning rate.If we select a large learning rate according to the edges with small weights,the gradients on edges with large weights will explode while the gradients will become too small if we select the learning rate according to the edges with large weights.

4.2.1Optimization via Edge Sampling

The intuition in solving the above problem is that if the weights of all the edges are equal (e.g.,network with binary edges),then there will be no problem of choosing an appro-priate learning rate.A simple treatment is thus to unfold a weighted edge into multiple binary edges,e.g.,an edge with weight w is unfolded into w binary edges.This will solve the problem but will signi?cantly increase the memory requirement,especially when the weights of the edges are very large.To resolve this,one can sample from the origi-nal edges and treat the sampled edges as binary edges,with the sampling probabilities proportional to the original edge weights.With this edge-sampling treatment,the overall ob-jective function remains the same.The problem boils down to how to sample the edges according to their weights.

Let W =(w 1,w 2,...,w |E |)denote the sequence of the weights of the edges.One can simply calculate the sum of

the weights w sum = |E |

i =1w i ?rst,and then to sample a random value within the range of [0,w sum ]to see which in-terval [ i ?1j =0w j , i

j =0w j )the random value falls into.This approach takes O (|E |)time to draw a sample,which is costly when the number of edges |E |is large.We use the alias table method [9]to draw a sample according to the weights of the edges,which takes only O (1)time when repeatedly drawing samples from the same discrete distribution.

Sampling an edge from the alias table takes constant time,O (1),and optimization with negative sampling takes O (d (K +1))time,where K is the number of negative samples.There-fore,overall each step takes O (dK )time.In practice,we ?nd that the number of steps used for optimization is usu-ally proportional to the number of edges O (|E |).Therefore,the overall time complexity of the LINE is O (dK |E |),which is linear to the number of edges |E |,and does not depend on the number of vertices |V |.The edge sampling treatment improves the e?ectiveness of the stochastic gradient descent without compromising the e?ciency.

4.3Discussion

We discuss several practical issues of the LINE model.Low degree vertices.One practical issue is how to ac-curately embed vertices with small degrees.As the number of neighbors of such a node is very small,it is very hard to accurately infer its representation,especially with the second-order proximity based methods which heavily rely on the number of “contexts.”An intuitive solution to this is expanding the neighbors of those vertices by adding higher order neighbors,such as neighbors of neighbors.In this pa-per,we only consider adding second-order neighbors,i.e.,neighbors of neighbors,to each vertex.The weight between vertex i and its second-order neighbor j is measured as

w ij = k ∈N (i )

w ik

w kj

d k

.(9)In practice,one can only add a subset of vertices {j }which have the largest proximity w ij with the low degree vertex i .New vertices.Another practical issue is how to ?nd the representation of newly arrived vertices.For a new vertex i ,if its connections to the existing vertices are known,we can obtain the empirical distribution ?p 1(·,v i )and ?p 2(·|v i )over existing vertices.To obtain the embedding of the new ver-tex,according to the objective function Eqn.(3)or Eqn.(6),a straightforward way is to minimize either one of the fol-lowing objective functions

? j ∈N (i )

w ji log p 1(v j ,v i ),or ? j ∈N (i )

w ji log p 2(v j |v i ),(10)

by updating the embedding of the new vertex and keeping the embeddings of existing vertices.If no connections be-tween the new vertex and existing vertices are observed,we must resort to other information,such as the textual infor-mation of the vertices,and we leave it as our future work.

5.EXPERIMENTS

We empirically evaluated the e?ectiveness and e?ciency of the LINE.We applied the method to several large-scale real-world networks of di?erent types,including a language network,two social networks,and two citation networks.

5.1Experiment Setup

Data Sets.

(1)Language network.We constructed a word co-occurrence network from the entire set of English Wikipedia pages.Words within every 5-word sliding window are con-sidered to be co-occurring with each other.Words with frequency smaller than 5are ?ltered out.(2)Social net-works.We use two social networks:Flickr and Youtube 2.The Flickr network is denser than the Youtube network (the same network as used in DeepWalk [16]).(3)Citation Networks.Two types of citation networks are used:an au-thor citation network and a paper citation network.We use the DBLP data set [19]3to construct the citation networks between authors and between papers.The author citation network records the number of papers written by one author and cited by another author.The detailed statistics of these networks are summarized into Table 1.They represent a variety of information networks:directed and undirected,binary and weighted.Each network contains at least half a million nodes and millions of edges,with the largest network containing around two million nodes and a billion edges.

Compared Algorithms.

We compare the LINE model with several existing graph embedding methods that are able to scale up to very large networks.We do not compare with some classical graph embedding algorithms such as MDS,IsoMap,and Laplacian eigenmap,as they cannot handle networks of this scale.?Graph factorization (GF)[1].We compare with the matrix factorization techniques for graph factorization.An information network can be represented as an a?n-ity matrix,and is able to represent each vertex with a 2

Available at https://www.wendangku.net/doc/5f6368569.html,/data-imc2007.html 3

Available at https://www.wendangku.net/doc/5f6368569.html,/citation

Table1:Statistics of the real-world information networks.

Language Network Social Network Citation Network Name Wikipedia Flickr Youtube DBLP(AuthorCitation)DBLP(PaperCitation) Type undirected,weighted undirected,binary undirected,binary dircted,weighted directed,binary |V|1,985,0981,715,2561,138,499524,061781,109 |E|1,000,924,08622,613,9812,990,44320,580,2384,191,677

Avg.degree504.2226.37 5.2578.5410.73

#Labels754777

#train70,00075,95831,70320,68410,398

low-dimensional vector through matrix factorization.

Graph factorization is optimized through stochastic

gradient descent and is able to handle large networks.

It only applies to undirected networks.

?DeepWalk[16].DeepWalk is an approach recently

proposed for social network embedding,which is only

applicable for networks with binary edges.For each

vertex,truncated random walks starting from the ver-

tex are used to obtain the contextual information,and

therefore only second-order proximity is utilized.

?LINE-SGD.This is the LINE model introduced in Sec-

tion4.1that optimizes the objective Eqn.(3)or Eqn.(6)

directly with stochastic gradient descent.With this

approach,the weights of the edges are directly multi-

plied into the gradients when the edges are sampled

for model updating.There are two variants of this ap-

proach:LINE-SGD(1st)and LINE-SGD(2nd),which

use?rst-and second-order proximity respectively.

?LINE.This is the LINE model optimized through the

edge-sampling treatment introduced in Section4.2.In

each stochastic gradient step,an edge is sampled with

the probability proportional to its weight and then

treated as binary for model updating.There are also

two variants:LINE(1st)and LINE(2nd).Like the

graph factorization,both LINE(1st)and LINE-SGD(1st) only apply to undirected graphs.LINE(2nd)and LINE-

SGD(2nd)apply to both undirected and directed graphs.

?LINE(1st+2nd):To utilize both?rst-order and second-

order proximity,a simple and e?ective way is to con-

catenate the vector representations learned by LINE(1st) and LINE(2nd)into a longer vector.After concate-

nation,the dimensions should be re-weighted to bal-

ance the two representations.In a supervised learning

task,the weighting of dimensions can be automatically

found based on the training data.In an unsupervised

task,however,it is more di?cult to set the weights.

Therefore we only apply LINE(1st+2nd)to the sce-

nario of supervised tasks.

Parameter Settings.

The mini-batch size of the stochastic gradient descent is set as1for all the methods.Similar to[13],the learning rate is set with the starting valueρ0=0.025andρt=ρ0(1?t/T), where T is the total number of mini-batches or edge samples. For fair comparisons,the dimensionality of the embeddings of the language network is set to200,as used in word em-bedding[13].For other networks,the dimension is set as 128by default,as used in[16].Other default settings in-clude:the number of negative samples K=5for LINE and LINE-SGD;the total number of samples T=10billion for LINE(1st)and LINE(2nd),T=20billion for GF;window size win=10,walk length t=40,walks per vertexγ=40 for DeepWalk.All the embedding vectors are?nally nor-malized by setting|| w||2=1.

5.2Quantitative Results

5.2.1Language Network

We start with the results on the language network,which contains two million nodes and a billion edges.Two appli-cations are used to evaluate the e?ectiveness of the learned embeddings:word analogy[12]and document classi?cation.

Table2:Results of word analogy on Wikipedia data.

Algorithm Semantic(%)Syntactic(%)Overall(%)Running time GF61.3844.0851.93 2.96h DeepWalk50.7937.7043.6516.64h

SkipGram69.1457.9463.02 2.82h LINE-SGD(1st)9.727.488.50 3.83h

LINE-SGD(2nd)20.429.5614.49 3.94h LINE(1st)58.0849.4253.35 2.44h

LINE(2nd)73.7959.7266.10 2.55h Word Analogy.This task is introduced by Mikolov et al.[12].Given a word pair(a,b)and a word c,the task aims to?nd a word d,such that the relation between c and d is similar to the relation between a and b,or denoted as:a: b→c:?.For instance,given a word pair(“China”,“Beijing”) and a word“France,”the right answer should be“Paris”because“Beijing”is the capital of“China”just as“Paris”is the capital of“France.”Given the word embeddings,this task is solved by?nding the word d?whose embedding is closest to the vector u b? u a+ u c in terms of cosine proximity, i.e.,d?=argmax

cos(( u b? u a+ u c), u d).Two categories of word analogy are used in this task:semantic and syntactic. Table2reports the results of word analogy using the em-beddings of words learned on the Wikipedia corpora(Skip-Gram)or the Wikipedia word network(all other meth-ods).For graph factorization,the weight between each pair of words is de?ned as the logarithm of the number of co-occurrences,which leads to better performance than the original value of co-occurrences.For DeepWalk,di?erent cuto?thresholds are tried to convert the language network into a binary network,and the best performance is achieved when all the edges are kept in the network.We also com-pare with the state-of-the-art word embedding model Skip-Gram[12],which learns the word embeddings directly from the original Wikipedia pages and is also implicitly a matrix factorization approach[8].The window size is set as5,the same as used for constructing the language network.

We can see that LINE(2nd)outperforms all other meth-ods,including the graph embedding methods and the Skip-Gram.This indicates that the second-order proximity bet-

Table3:Results of Wikipedia page classi?cation on Wikipedia data set.

Metric Algorithm10%20%30%40%50%60%70%80%90%

Micro-F1

GF79.6380.5180.9481.1881.3881.5481.6381.7181.78 DeepWalk78.8979.9280.4180.6980.9281.0881.2181.3581.42 SkipGram79.8480.8281.2881.5781.7181.8781.9882.0582.09 LINE-SGD(1st)76.0377.0577.5777.8578.0878.2578.3978.4478.49 LINE-SGD(2nd)74.6876.5377.5478.1878.6378.9679.1979.4079.57 LINE(1st)79.6780.5580.9481.2481.4081.5281.6181.6981.67 LINE(2nd)79.9380.9081.3181.6381.8081.9182.0082.1182.17 LINE(1st+2nd)81.04**82.08**82.58**82.93**83.16**83.37**83.52**83.63**83.74**

Macro-F1

GF79.4980.3980.8281.0881.2681.4081.5281.6181.68 DeepWalk78.7879.7880.3080.5680.8280.9781.1181.2481.32 SkipGram79.7480.7181.1581.4681.6381.7881.8881.9882.01 LINE-SGD(1st)75.8576.9077.4077.7177.9478.1278.2478.2978.36 LINE-SGD(2nd)74.7076.4577.4378.0978.5378.8379.0879.2979.46 LINE(1st)79.5480.4480.8281.1381.2981.4381.5181.6081.59 LINE(2nd)79.8280.8181.2281.5281.7181.8281.9282.0082.07 LINE(1st+2nd)80.94**81.99**82.49**82.83**83.07**83.29**83.42**83.55**83.66** Signi?cantly outperforms GF at the:**0.01and*0.05level,paired t-test.

ter captures the word semantics compared to the?rst-order proximity.This is not surprising,as a high second-order proximity implies that two words can be replaced in the same context,which is a stronger indicator of similar se-mantics than?rst-order co-occurrences.It is intriguing that the LINE(2nd)outperforms the state-of-the-art word em-bedding model trained on the original corpus.The reason may be that a language network better captures the global structure of word co-occurrences than the original word se-quences.Among other methods,both graph factorization and LINE(1st)signi?cantly outperform DeepWalk even if DeepWalk explores second-order proximity.This is because DeepWalk has to ignore the weights(i.e.,co-occurrences)of the edges,which is very important in a language network. The performance by the LINE models directly optimized with SGD is much worse,because the weights of the edges in the language network diverge,which range from a single digit to tens of thousands,making the learning process suf-fer.The LINE optimized by the edge-sampling treatment e?ectively addresses this problem,and performs very well using either?rst-order or second-order proximity.

All the models are run on a single machine with1T mem-ory,40CPU cores at2.0GHZ using16threads.Both the LINE(1st)and LINE(2nd)are quite e?cient,which take less than3hours to process such a network with2million nodes and a billion edges.Both are at least10%faster than graph factorization,and much more e?cient than DeepWalk (?ve times slower).The reason that LINE-SGDs are slightly slower is that a threshold-cutting technique has to be applied to prevent the gradients from exploding.

Document Classi?cation.Another way to evaluate the quality of the word embeddings is to use the word vectors to compute document representation,which can be evaluated with document classi?cation tasks.To obtain document vec-tors,we choose a very simple approach,taking the average of the word vector representations in that document.This is because we aim to compare the word embeddings with di?er-ent approaches instead of?nding the best method for docu-ment embeddings.The readers can?nd advanced document embedding approaches in[7].We download the abstracts of Wikipedia pages from https://www.wendangku.net/doc/5f6368569.html,/ 3.9/en/long_abstracts_en.nq.bz2and the categories of these pages from https://www.wendangku.net/doc/5f6368569.html,/3.9/en/ article_categories_en.nq.bz2.We choose7diverse cate-gories for classi?cation including“Arts,”“History,”“Human,”“Mathematics,”“Nature,”“Technology,”and“Sports.”For each category,we randomly select10,000articles,and ar-ticles belonging to multiple categories are discarded.We randomly sample di?erent percentages of the labeled doc-uments for training and use the rest for evaluation.All document vectors are used to train a one-vs-rest logistic re-gression classi?er using the LibLinear package4.We report the classi?cation metrics Micro-F1and Macro-F1[11].The results are averaged over10di?erent runs by sampling dif-ferent training data.

Table3reports the results of Wikipedia page classi?ca-tion.Similar conclusion can be made as in the word anal-ogy task.The graph factorization outperforms DeepWalk as DeepWalk ignores the weights of the edges.The LINE-SGDs perform worse due to the divergence of the weights of the edges.The LINE optimized by the edge-sampling treat-ment performs much better than directly deploying SGD. The LINE(2nd)outperforms LINE(1st)and is slightly bet-ter than the graph factorization.Note that with the su-pervised task,it is feasible to concatenate the embeddings learned with LINE(1st)and LINE(2nd).As a result,the LINE(1st+2nd)method performs signi?cantly better than all other methods.This indicates that the?rst-order and second-order proximities are complementary to each other. To provide the readers more insight about the?rst-order and second-order proximities,Table4compares the most similar words to a given word using?rst-order and second-order proximity.We can see that by using the contextual proximity,the most similar words returned by the second-order proximity are all semantically related words.The most similar words returned by the?rst-order proximity are a mixture of syntactically and semantically related words. Table4:Comparison of most similar words using1st-order and2nd-order proximity.

Word Similarity Top similar words

good

1st luck bad faith assume nice

2nd decent bad excellent lousy reasonable information

1st provide provides detailed facts veri?able

2nd infomation informaiton informations nonspammy animecons graph

1st graphs algebraic?nite symmetric topology

2nd graphs subgraph matroid hypergraph undirected learn

1st teach learned inform educate how

2nd learned teach relearn learnt understand

5.2.2Social Network

Compared with the language networks,the social net-works are much sparser,especially the Youtube network. We evaluate the vertex embeddings through a multi-label 4https://www.wendangku.net/doc/5f6368569.html,.tw/~cjlin/liblinear/

Table5:Results of multi-label classi?cation on the Flickr network.

Metric Algorithm10%20%30%40%50%60%70%80%90%

Micro-F1

GF53.2353.6853.9854.1454.3254.3854.4354.5054.48 DeepWalk60.3860.7760.9061.0561.1361.1861.1961.2961.22 DeepWalk(256dim)60.4161.0961.3561.5261.6961.7661.8061.9161.83 LINE(1st)63.2763.6963.8263.9263.9664.0364.0664.1764.10 LINE(2nd)62.8363.2463.3463.4463.5563.5563.5963.6663.69 LINE(1st+2nd)63.20**63.97**64.25**64.39**64.53**64.55**64.61**64.75**64.74**

Macro-F1

GF48.6648.7348.8448.9149.0349.0349.0749.0849.02 DeepWalk58.6058.9359.0459.1859.2659.2959.2859.3959.30 DeepWalk(256dim)59.0059.5959.8059.9460.0960.1760.1860.2760.18 LINE(1st)62.1462.5362.6462.7462.7862.8262.8662.9662.89 LINE(2nd)61.4661.8261.9262.0262.1362.1262.1762.2362.25 LINE(1st+2nd)62.23**62.95**63.20**63.35**63.48**63.48**63.55**63.69**63.68** Signi?cantly outperforms DeepWalk at the:**0.01and*0.05level,paired t-test.

classi?cation task that assigns every node into one or more communities.Di?erent percentages of the vertices are ran-domly sampled for training and the rest are used for evalu-ation.The results are averaged over10di?erent runs.

Flickr Network.Let us?rst take a look at the results on the Flickr network.We choose the most popular5commu-nities as the categories of the vertices for multi-label classi?-cation.Table5reports the results.Again,LINE(1st+2nd) signi?cantly outperforms all other methods.LINE(1st)is slightly better than LINE(2nd),which is opposite to the re-sults on the language network.The reasons are two fold:(1)?rst-order proximity is still more important than second-order proximity in social network,which indicates strong ties;(2)when the network is too sparse and the average number of neighbors of a node is too small,the second-order proximity may become inaccurate.We will further investi-gate this issue in Section5.4.LINE(1st)outperforms graph factorization,indicating a better capability of modeling the ?rst-order proximity.LINE(2nd)outperforms DeepWalk, indicating a better capability of modeling the second-order proximity.By concatenating the representations learned by LINE(1st)and LINE(2nd),the performance further im-proves,con?rming that the two proximities are complemen-tary to each other.

Youtube Network.Table6reports the results on Youtube network,which is extremely sparse and the average degree is as low as5.In most cases with di?erent percentages of training data,LINE(1st)outperforms LINE(2nd),con-sistent with the results on the Flickr network.Due to the extreme sparsity,the performance of LINE(2nd)is even infe-rior to DeepWalk.By combining the representations learned by the LINE with both the?rst-and second-order proxim-ity,the performance of LINE outperforms DeepWalk with either128or256dimension,showing that the two proxim-ities are complementary to each other and able to address the problem of network sparsity.

It is interesting to observe how DeepWalk tackles the net-work sparsity through truncated random walks,which en-rich the neighbors or contexts of each vertex.The random walk approach acts like a depth-?rst search.Such an ap-proach may quickly alleviate the sparsity of the neighbor-hood of nodes by bringing in indirect neighbors,but it may also introduce nodes that are long range away.A more rea-sonable way is to expand the neighborhood of each vertex using a breadth-?rst search strategy,i.e.,recursively adding neighbors of neighbors.To verify this,we expand the neigh-borhood of the vertices whose degree are less than1,000 by adding the neighbors of neighbors until the size of the extended neighborhood reaches1,000nodes.We?nd that adding more than1,000vertices does not further increase the performance.

The results in the brackets in Table6are obtained on this reconstructed network.The performance of GF,LINE(1st) and LINE(2nd)all improves,especially LINE(2nd).In the reconstructed network,the LINE(2nd)outperforms Deep-Walk in most cases.We can also see that the performance

of LINE(1st+2nd)on the reconstructed network does not improve too much compared with those on the original net-work.This implies that the combination of?rst-order and second-order proximity on the original network has already captured most information and LINE(1st+2nd)approach is

a quite e?ective and e?cient way for network embedding, suitable for both dense and sparse networks.

5.2.3Citation Network

We present the results on two citation networks,both of which are directed networks.Both the GF and LINE meth-ods,which use?rst-order proximity,are not applicable for directed networks,and hence we only compare DeepWalk and LINE(2nd).We also evaluate the vertex embeddings through a multi-label classi?cation task.We choose7popu-lar conferences including AAAI,CIKM,ICML,KDD,NIPS, SIGIR,and WWW as the classi?cation categories.Authors publishing in the conferences or papers published in the con-ferences are assumed to belong to the categories correspond-ing to the conferences.

DBLP(AuthorCitation)Network.Table7reports the results on the DBLP(AuthorCitation)network.As this network is also very sparse,DeepWalk outperforms LINE(2nd). However,by reconstructing the network through recursively adding neighbors of neighbors for vertices with small degrees (smaller than500),the performance of LINE(2nd)signif-icantly increases and outperforms DeepWalk.The LINE model directly optimized by stochastic gradient descent, LINE(2nd),does not perform well as expected.

DBLP(PaperCitation)Network.Table8reports the re-sults on the DBLP(PaperCitation)network.The LINE(2nd) signi?cantly outperforms DeepWalk.This is because the random walk on the paper citation network can only reach papers along the citing path(i.e.,older papers)and cannot reach other references.Instead,the LINE(2nd)represents each paper with its references,which is obviously more rea-sonable.The performance of LINE(2nd)is further improved when the network is reconstructed by enriching the neigh-bors of vertices with small degrees(smaller than200).

5.3Network Layouts

An important application of network embedding is to cre-ate meaningful visualizations that layout a network on a

Table6:Results of multi-label classi?cation on the Youtube network.The results in the brackets are on the reconstructed network,which adds second-order neighbors(i.e.,neighbors of neighbors)as neighbors for vertices with a low degree.

Metric Algorithm1%2%3%4%5%6%7%8%9%10%

Micro-F1

25.4326.1626.6026.9127.3227.6127.8828.1328.3028.51

(24.97)(26.48)(27.25)(27.87)(28.31)(28.68)(29.01)(29.21)(29.36)(29.63)

DeepWalk39.6841.7842.7843.5543.9644.3144.6144.8945.0645.23 DeepWalk(256dim)39.9442.1743.1944.0544.4744.8445.1745.4345.6545.81 LINE(1st)

35.4338.0839.3340.2140.7741.2441.5341.8942.0742.21

(36.47)(38.87)(40.01)(40.85)(41.33)(41.73)(42.05)(42.34)(42.57)(42.73)

LINE(2nd)

32.9836.7038.9340.2641.0841.7942.2842.7043.0443.34

(36.78)(40.37)(42.10)(43.25)(43.90)(44.44)(44.83)(45.18)(45.50)(45.67) LINE(1st+2nd)

39.01*41.8943.1444.0444.6245.0645.3445.69**45.91**46.08**

(40.20)(42.70)(43.94**)(44.71**)(45.19**)(45.55**)(45.87**)(46.15**)(46.33**)(46.43**)

Macro-F1

7.388.449.359.8010.3810.7911.2111.5511.8112.08

(11.01)(13.55)(14.93)(15.90)(16.45)(16.93)(17.38)(17.64)(17.80)(18.09)

DeepWalk28.3930.9632.2833.4333.9234.3234.8335.2735.5435.86 DeepWalk(256dim)28.9531.7933.1634.4234.9335.4435.9936.4136.7837.11 LINE(1st)

28.7431.2432.2633.0533.3033.6033.8634.1834.3334.44

(29.40)(31.75)(32.74)(33.41)(33.70)(33.99)(34.26)(34.52)(34.77)(34.92)

LINE(2nd)

17.0621.7325.2827.3628.5029.5930.4331.1431.81

32.32

(22.18)

(27.25)(29.87)(31.88)(32.86)(33.73)(34.50)(35.15)(35.76)(36.19)

LINE(1st+2nd)

29.8531.9333.9635.46**36.25**36.90**37.48**38.10**38.46**38.82**

(29.24)(33.16**)(35.08**)(36.45**)(37.14**)(37.69**)(38.30**)(38.80**)(39.15**)(39.40**)

Table7:Results of multi-label classi?cation on DBLP(AuthorCitation)network.

Metric Algorithm10%20%30%40%50%60%70%80%90%

Micro-F1

DeepWalk63.9864.5164.7564.8164.9264.9964.9965.0064.90

LINE-SGD(2nd)56.6458.9559.8960.2060.4460.6160.5860.7360.59

LINE(2nd)62.4963.3063.6363.7763.8463.9463.9664.0063.77

(64.69*)(65.47**)(65.85**)(66.04**)(66.19**)(66.25**)(66.30**)(66.12**)(66.05**)

Macro-F1

DeepWalk63.0263.6063.8463.9063.9864.0664.0964.1164.05

LINE-SGD(2nd)55.2457.6358.5658.8259.1159.2759.2859.4659.37

LINE(2nd)61.4362.3862.7362.8762.9363.0563.0763.1362.95

(63.49*)(64.42**)(64.84**)(65.05**)(65.19**)(65.26**)(65.29**)(65.14**)(65.14**)

Signi?cantly outperforms DeepWalk at the:**0.01and*0.05level,paired t-test.

Table8:Results of multi-label classi?cation on DBLP(PaperCitation)network.

Metric Algorithm10%20%30%40%50%60%70%80%90%

Micro-F1

DeepWalk52.8353.8054.2454.7555.0755.1355.4855.4255.90

LINE(2nd)58.4259.5860.2960.7860.9461.2061.3961.3961.79

(60.10**)(61.06**)(61.46**)(61.73**)(61.85**)(62.10**)(62.21**)(62.25**)(62.80**)

Macro-F1

DeepWalk43.7444.8545.3445.8546.2046.2546.5146.3646.73

LINE(2nd)48.7450.1050.8451.3151.6151.7751.9451.8952.16

(50.22**)(51.41**)(51.92**)(52.20**)(52.40**)(52.59**)(52.78**)(52.70**)(53.02**)

Signi?cantly outperforms DeepWalk at the:**0.01and*0.05level,paired t-test.

(a)GF(b)DeepWalk(c)LINE(2nd)

Figure2:Visualization of the co-author network.The authors are mapped to the2-D space using the t-SNE package with learned embeddings as input.Color of a node indicates the community of the author.Red:“data Mining,”blue:“machine learning,”green:“computer vision.”

two dimensional space.We visualize a co-author network

extracted from the DBLP data.We select some conferences

from three di?erent research?elds:WWW,KDD from“data

mining,”NIPS,ICML from“machine learning,”and CVPR,

ICCV from“computer vision.”The co-author network is

built from the papers published in these conferences.Au-

thors with degree less than3are?ltered out,and?nally the

network contains18,561authors and207,https://www.wendangku.net/doc/5f6368569.html,ying

out this co-author network is very challenging as the three

research?elds are very close to each other.We?rst map the

co-author network into a low-dimensional space with di?er-

ent embedding approaches and then further map the low-

dimensional vectors of the vertices to a2-D space with the

t-SNE package[21].Fig.2compares the visualization results

with di?erent embedding approaches.The visualization us-

ing graph factorization is not very meaningful,in which the

authors belonging to the same communities are not clustered

together.The result of DeepWalk is much better.However,

many authors belonging to di?erent communities are clus-

tered tightly into the center area,most of which are high

degree vertices.This is because DeepWalk uses a random

walk based approach to enrich the neighbors of the vertices,

which brings in a lot of noise due to the randomness,es-

pecially for vertices with higher degrees.The LINE(2nd)

performs quite well and generates meaningful layout of the network (nodes with same colors are distributed closer).

5.4

Performance https://www.wendangku.net/doc/5f6368569.html,work Sparsity

q q q

q q q q q

q q

q q 0.52

0.56

0.60

# percentage of links (%)M i c r o ?F 1

LINE(1st)

LINE(2nd)

(a)Sparsity.q

0.35

0.40

0.45

0.50

Degree group

M i c r o ?F 1

q q q

q q q q

LINE(Original+1st)LINE(Dense+1st)LINE(Original+2nd)LINE(Dense+2nd)LINE(Dense+1st+2nd)DeepWalk

(b)Degree of vertex.Figure 3:Performance https://www.wendangku.net/doc/5f6368569.html,work sparsity.In this subsection,we formally analyze the performance of the above models w.r.t.the sparsity of networks.We use the social networks as examples.We ?rst investigate how the sparsity of the networks a?ects the LINE(1st)and LINE(2nd).Fig.3(a)shows the results w.r.t.the percentage of links on the Flickr network.We choose Flickr network as it is much denser than the Youtube network.We ran-domly select di?erent percentages of links from the original network to construct networks with di?erent levels of spar-sity.We can see that in the beginning,when the network is very sparse,the LINE(1st)outperforms LINE(2nd).As we gradually increase the percentage of links,the LINE(2nd)begins to outperform the LINE(1st).This shows that the second-order proximity su?ers when the network is extremely sparse,and it outperforms ?rst-order proximity when there are su?cient nodes in the neighborhood of a node.

Fig.3(b)shows the performance w.r.t.the degrees of the vertices on both the original and reconstructed Youtube networks.We categorize the vertices into di?erent groups according to their degrees including (0,1],[2,3],[4,6],[7,12],[13,30],[31,+∞),and then evaluate the performance of ver-tices in di?erent groups.Overall,the performance of dif-ferent models increases when the degrees of the vertices in-crease.In the original network,the LINE(2nd)outperforms LINE(1st)except for the ?rst group,which con?rms that the second-order proximity does not work well for nodes with a low degree.In the reconstructed dense network,the perfor-mance of the LINE(1st)or LINE(2nd)improves,especially the LINE(2nd)that preserves the second-order proximity.We can also see that the LINE(2nd)model on the recon-structed network outperforms DeepWalk in all the groups.

5.5Parameter Sensitivity

Next,we investigate the performance w.r.t.the parame-ter dimension d and the converging performance of di?erent models w.r.t the number of samples on the reconstructed Youtube network.Fig.4(a)reports the performance of the LINE model w.r.t.the dimension d .We can see that the performance of the LINE(1st)or LINE(2nd)drops when the dimension becomes too large.Fig.4(b)shows the results of the LINE and DeepWalk w.r.t.the number of samples during the optimization.The LINE(2nd)consistently out-performs LINE(1st)and DeepWalk,and both the LINE(1st)and LINE(2nd)converge much faster than DeepWalk.

q q

2050100

200500

0.390.400.410.420.430.44

dimension

M i c r o ?F 1

LINE(1st)LINE(2nd)

(a)#Dimension.q

q q q q q q q q q q q q q q

q q q q q q q q q q q q q q q q q q

50001000015000

0.0

0.1

0.20.3

0.4

0.5

#samples (*Million)

M i c r o ?F 1

LINE(1st)LINE(2nd)DeepWalk

(b)#Samples.Figure 4:Sensitivity w.r.t.dimension and samples.

5.6Scalability

510

# threads

S p e e d u p r a t i o

LINE(1st)LINE(2nd)

(a)Speed up v.s.#threads.q q q q q

510

0.20.3

0.40.5

0.6

# threads

M i c r o ?F 1

LINE(1st)LINE(2nd)

(b)Micro-F1v.s.#threads.

Figure 5:Performance w.r.t.#threads.Finally,we investigate the scalability of the LINE model optimized by the edge-sampling treatment and asynchronous stochastic gradient descent,which deploys multiple threads for optimization.Fig.5(a)shows the speed up w.r.t.the number of threads on the Youtube data set.The speed up is quite close to linear.Fig.5(b)shows that the classi?cation performance remains stable when using multiple threads for model updating.The two ?gures together show that the inference algorithm of the LINE model is quite scalable.

6.CONCLUSION

This paper presented a novel network embedding model called the“LINE,”which can easily scale up to networks with millions of vertices and billions of edges.It has carefully de-signed objective functions that preserve both the ?rst-order and second-order proximities,which are complementary to each other.An e?cient and e?ective edge-sampling method is proposed for model inference,which solved the limitation of stochastic gradient descent on weighted edges without compromising the e?ciency.Experimental results on vari-ous real-world networks prove the e?ciency and e?ectiveness of LINE.In the future,we plan to investigate higher-order proximity beyond the ?rst-order and second-order proxim-ities in the network.Besides,we also plan to investigate the embedding of heterogeneous information networks,e.g.,vertices with multiple types.

Acknowledgments

The authors thank the three anonymous reviewers for the helpful comments.The co-author Ming Zhang is supported by the National Natural Science Foundation of China (NSFC Grant No.61472006);Qiaozhu Mei is supported by the Na-tional Science Foundation under grant numbers IIS-1054199and CCF-1048168.

7.REFERENCES

[1]A.Ahmed,N.Shervashidze,S.Narayanamurthy,

V.Josifovski,and A.J.Smola.Distributed large-scale natural graph factorization.In Proceedings of the22nd international conference on World Wide Web,pages

37–48.International World Wide Web Conferences

Steering Committee,2013.

[2]M.Belkin and https://www.wendangku.net/doc/5f6368569.html,placian eigenmaps and

spectral techniques for embedding and clustering.In

NIPS,volume14,pages585–591,2001.

[3]S.Bhagat,G.Cormode,and S.Muthukrishnan.Node

classi?cation in social networks.In Social Network

Data Analytics,pages115–148.Springer,2011.

[4]T.F.Cox and M.A.Cox.Multidimensional scaling.

CRC Press,2000.

[5]J.R.Firth.A synopsis of linguistic theory,1930–1955.

In J.R.Firth(Ed.),Studies in linguistic analysis,

pages1–32.

[6]M.S.Granovetter.The strength of weak ties.

American journal of sociology,pages1360–1380,1973.

[7]Q.Le and T.Mikolov.Distributed representations of

sentences and documents.In Proceedings of The31st

International Conference on Machine Learning,pages 1188–1196,2014.

[8]O.Levy and Y.Goldberg.Neural word embedding as

implicit matrix factorization.In Advances in Neural

Information Processing Systems,pages2177–2185,

2014.

[9]A.Q.Li,A.Ahmed,S.Ravi,and A.J.Smola.

Reducing the sampling complexity of topic models.In Proceedings of the20th ACM SIGKDD international

conference on Knowledge discovery and data mining,

pages891–900.ACM,2014.

[10]D.Liben-Nowell and J.Kleinberg.The link-prediction

problem for social networks.Journal of the American

society for information science and technology,

58(7):1019–1031,2007.

[11]C.D.Manning,P.Raghavan,and H.Sch¨u tze.

Introduction to information retrieval,volume1.

Cambridge university press Cambridge,2008.

[12]T.Mikolov,K.Chen,G.Corrado,and J.Dean.

E?cient estimation of word representations in vector

space.arXiv preprint arXiv:1301.3781,2013.

[13]T.Mikolov,I.Sutskever,K.Chen,G.S.Corrado,and

J.Dean.Distributed representations of words and

phrases and their compositionality.In Advances in

Neural Information Processing Systems,pages

3111–3119,2013.[14]S.A.Myers,A.Sharma,P.Gupta,and J.Lin.

Information network or social network?:the structure of the twitter follow graph.In Proceedings of the

companion publication of the23rd international

conference on World wide web companion,pages

493–498.International World Wide Web Conferences

Steering Committee,2014.

[15]L.Page,S.Brin,R.Motwani,and T.Winograd.The

pagerank citation ranking:Bringing order to the web.

1999.

[16]B.Perozzi,R.Al-Rfou,and S.Skiena.Deepwalk:

Online learning of social representations.In

Proceedings of the20th ACM SIGKDD international

conference on Knowledge discovery and data mining,

pages701–710.ACM,2014.

[17]B.Recht,C.Re,S.Wright,and F.Niu.Hogwild:A

lock-free approach to parallelizing stochastic gradient

descent.In Advances in Neural Information Processing Systems,pages693–701,2011.

[18]S.T.Roweis and L.K.Saul.Nonlinear dimensionality

reduction by locally linear embedding.Science,

290(5500):2323–2326,2000.

[19]J.Tang,J.Zhang,L.Yao,J.Li,L.Zhang,and Z.Su.

Arnetminer:extraction and mining of academic social networks.In Proceedings of the14th ACM SIGKDD

international conference on Knowledge discovery and

data mining,pages990–998.ACM,2008.

[20]J.B.Tenenbaum,V.De Silva,and https://www.wendangku.net/doc/5f6368569.html,ngford.A

global geometric framework for nonlinear

dimensionality reduction.Science,

290(5500):2319–2323,2000.

[21]L.Van der Maaten and G.Hinton.Visualizing data

using t-sne.Journal of Machine Learning Research,

9(2579-2605):85,2008.

[22]S.Yan,D.Xu,B.Zhang,H.-J.Zhang,Q.Yang,and

S.Lin.Graph embedding and extensions:a general

framework for dimensionality reduction.Pattern

Analysis and Machine Intelligence,IEEE Transactions on,29(1):40–51,2007.

[23]X.Yu,X.Ren,Y.Sun,Q.Gu,B.Sturt,

U.Khandelwal,B.Norick,and J.Han.Personalized

entity recommendation:A heterogeneous information network approach.In Proceedings of the7th ACM

international conference on Web search and data

mining,pages283–292.ACM,2014.