Tuesday, October 30, 2007

Jon Kleinberg CS lecture at U of T

Right now is the lecture by Jon Kleinberg, Department of Computer Science, Cornell University which is on Challenges in Mining Social Network Data: Processes, Privacy and Paradoxes. He has generated seminal results in social networks and document retrieval. I've read his research work on the HITS algorithm which uses hubs and authorities in order to classify search, and from which Google's PageRank is somewhat related to. I've never heard Jon speak so I'm very glad to hear him speak.

He will also speak tonight to kick off the Social Networking Week at U of T. What can computer science contribute to social networks? Today there is a convergence of social and technological networks, computing and information systems with intrinsic social structure. Social network data is a very active area in sociology, social psychology and anthropology. So what can the different fields learn from each other (sociology, social psychology, anthropology from computer science)? This is the research area which I am also part of as well, and it's an exciting research area in my opinion with the emergence of social networking sites like Facebook and MySpace. Mining social networks has a long history in social sciences eg. with Wayne Zachary's PhD work on the university karate club, observing social ties and rivalries. Split in the network could be explained by the minimum cut in the social network.

Social network data spans many orders of magnitude. For example there were 240 million nodes of all IM communication over one month on Microsoft Instant Messenger (Leskovec-Horvitz '07), 4.4 million nodes of declared friendships on blogging community LiveJournal (Liben-Nowell et al., 2005). How can we find the point where the lines of research in large scale and small scale networks converge? In social networks, we can find behaviours of diffusion in social networks that cascade from node to node like an epidemic, which is identified by radial structures in the graph. There have been empirical studies of diffusion in the social sciences like the spread of new agricultural and medical practices (Coleman et al., 1966). The diffusion curves are based on the probability of adopting new behaviour which depends on number of friends who have adopted (Bass 1969, Granovetter 1978 and Schelling 1978). All of the diffusion curves seem to have diminishing returns property, for example in editing a Wikipedia article (Cosley et al., 2007) and joining a LiveJournal community (Backstrom et al., 2006).

These results can then be used for general prediction. Given a network and v's position in it at t1, estimate the probability v will join a given group by t2. Kleinberg has formulated this as a probability estimation problem (Backstrom-Huttenlocher-Kleinberg-Lan 2006). Do disconnected friends or connected friends make joining more likely? Disconnected friends provide an informational advantage but connected friends provide safety/trust advantages. For example in LiveJournal, joining probability increases significantly with more connections among friends in the group (in otherwise friends that are within a clique than not).

If connectedness among friends promotes joining, do highly "clustered" groups grow more quickly? Kleinberg defines clustering to be # of triangles / # of open triads and you can determine community by examining the growth from t1 to t2 as a function of clustering. Leskovec, McGlohon, Faloutsos, Glance and Hurst (2007) have looked into the diffusion of topics in networks of news media and bloggers which shows cascading behaviour. Leskovec, Adamic and Huberman (2006) describe how incentives can be used to propagate interesting recommendations along social network links. How to push questions to people within the social network? (Kleinberg, Raghavan, 2005)

One of the most important questions in mining social network data is how to protect privacy in the dataset. There has been some research where anonymizing data actually caused problems from using on-line pseudonyms and using search engine query logs. If you are part of a small network and based on connectivity, you may be able to find yourself, so anonymization doesn't help. An attacker can attack an anonymized network by being part of the system. Kleinberg has done some work on this by creating a template (Backstrom, Dwork, Kleinberg, 2007). The idea is an attacker creates a small network of nodes through creating accounts called subgraph H and attach them to targeted nodes in the original network. From Ramsey theory, in a random n-node graph, H is unique.

Take home message: how do we build deeper models of the processes at work inside large-scale social networks? How do we make data available without compromising privacy?

It was great to finally meet and talk with Jon Kleinberg!

On Technorati: ,

No comments: