Tuesday, July 24, 2007

Cimple Project for Community Information Management talk

Anhai is giving his talk now, the project deals with community information management. There are numerous online communities, with each community having many data sources and many members, such as the movie fan community. Members often want to find information about what is new in the community, the connections between members in the community, and the topics in the community. The whole idea is to create structured data portals via extraction, integration and collaboration of information sources. The extraction of information from the data sources will create entities which then connections between entities are inferred, by creating a graph.

Building a structured portal semi-automatically like Citeseer is not new. Prior work involves collecting a large number of data sources, and then using machine learning techniques. With their approach, they just choose a small set of data sources and do a compositional and incremental approach. To populate the portal, they choose the top 20% of data sources which generate 80% of the rest of the community. They create a prototype of their system called DBLife for the database research community. The 20% of top data sources become a seed for building the portal community. A plan is then created to generate a daily ER graph, they first find entities and then find relations. How to find these entities? They first find entities within the data sources, and then match with other similar ones. For example, my name is Alvin Chin but also my name could also be "Chin Alvin", so the two are related and are matched to the same entity. They also generate variations of names. This technique works well for the majority of cases. Of course there are other cases where this doesn't work like for example Asian names. In this case, they apply a stricter matching approach particular to those cases.

The next step is to determine the co-occurence relations between entities. They also create a plan to find label relations. For addressing the expansion, they look at the nodes of members within the community and crawl those to expand the tree. They then enlist the users who go to the community portal that allow them to edit information in a wiki-style format. Right now, they don't incorporate the changes back into the structured database, but those are plans for future research.

It's interesting that he said that the decisions and research that his group has done has worked very well. But I suspect this is because they've been able to select the right data sources, so the data is clean and there already is a community well defined on the web for the database research community, therefore their technique works well. One of the things where they haven't addresses (one of many) are the capture and extraction of social interactions. This is where my PhD research can help.

All in all, I felt it was a good talk, and shows the potential of research in web communities.

On Technorati: ,

No comments: