Detecting a Multi-Level Content Similarity from Microblogs based on Community Structures and Named Entities
Swit Phuvipadawat and Tsuyoshi Murata
Department of Computer Science, Graduate School of Information Science and Engineering, Tokyo Institute of Technology, Japan
Abstract—This paper presents a method for finding the content similarity for microblogs. In particular, we process data from Twitter for a breaking news detection and tracking application. The goal is to find a collection of similar messages. The method gives two levels of collections. In the first level, similarity is defined by TF-IDF. Since contents in microblogs have short lengths, we emphasize on specific terms called named entities. Message groups are obtained in the first level. In the second level, we construct a network from the message groups and named entities and perform a community detection. We evaluate and visualize the community results based on several community detection algorithms. We demonstrate that this method can be used to explore similar messages with results in both tightly and loosely coupled manners.
Index Terms—twitter, topic detection and tracking, information retrieval, network analysis
Cite: Swit Phuvipadawat and Tsuyoshi Murata, "Detecting a Multi-Level Content Similarity from Microblogs based on Community Structures and Named Entities," Journal of Emerging Technologies in Web Intelligence, Vol. 3, No. 1, pp. 11-19, February 2011. doi:10.4304/jetwi.3.1.11-19
Index Terms—twitter, topic detection and tracking, information retrieval, network analysis
Cite: Swit Phuvipadawat and Tsuyoshi Murata, "Detecting a Multi-Level Content Similarity from Microblogs based on Community Structures and Named Entities," Journal of Emerging Technologies in Web Intelligence, Vol. 3, No. 1, pp. 11-19, February 2011. doi:10.4304/jetwi.3.1.11-19
Array