Weighted Combinations of Ontological Features and Keywords for Document Clustering
Corressponding author's email:
tapchikhgkdt@hcmute.edu.vnKeywords:
named entity, latent semantics, clustering qualityAbstract
Keyword-based information processing has limitations due to simple treatment of words. In this paper, we introduce named entities as objectives into document clustering, which are the key elements defining document semantics and in many cases are of user concerns. First, the traditional keyword-based vector space model is adapted with vectors defined over spaces of entity names, types, name-type pairs, and identifiers, instead of keywords. Then, hierarchical document clustering can be performed using the similarity measure defined as a distance between the vectors representing documents. Experimental results are presented and discussed. Clustering documents by information of named entities could be useful for managing web-based learning materials with respect to related objects.
Downloads: 0
References
Baeza-Yates, R., Ribeiro-Neto, B.: Modern Information Retrieval. Addison-Wesley (1999).
Berners-Lee, T., Hendler, J., Lassila, 0.: The Semantic Web. Scientific American (2001).
cao, T.H., Do, H.T., Hong, D.T., Quan, T.T.: Fuzzy Named Entity-Based Document Clustering. In: Proceedings of the 17th IEEE International Conference on Fuzzy Systems (2008) 2028-2034.
cao, T.H. (2008) PRICA1'08
Castells, P., Fernåndez, M., Vallet, D.: An Adaptation of the Vector-Space Model for Ontology-Based Information Retrieval. IEEE Transactions on Knowledge and Data Engineering 19 (2006) 261-272.
Dill, S. et al.: SemTag and Seeker: Bootstrapping the Semantic Web via Automated Semantic Annotation. In: Proceedings of the 12th Int. Conference on the WWW (2003).
Goncalves, A., Zhu, J., song, D., Uren, V., Pacheco, R.: LRD: Latent Relation Discovery for Vector Space Expansion and Information Retrieval. In: Proceedings of the 7th International Conference on Web-Age Information Management (2006).
Hartigan, J., Wong, M.: Algorithm AS 136: A K-means Clustering Algorithm. Applied Statistics 28 (1979) 100-108.
He, J., Tan, A.-H., Tan, C.-L., Sung, S.-Y.: On Quantitative Evaluation of Clustering Algorithms. In: Wu, W. et al. (eds.): Clustering and Information Retrieval. Kluwer Academic (2003) 105-133.
Kiryakov, A., Popov, B., Terziev, 1., Manov, D., Ognyanoff, D.: Semantic Annotation, Indexing, and Retrieval. Journal of Web Semantics 2 (2005).
Meilä, M.: Compare Clusterings - An Information Based Distance. Journal of Multivariate Analysis (2007) 873-895.
Sekine, S.: Named Entity: History and Future. Proteus Project Report (2004).Toda, H. , Kataoka, R.: A Search Result Clustering Method Using Informatively Named Entities. In: Proceedings of the 7th ACM International Workshop on Web Information and Data Management (2005) 81-86.
Downloads
Published
How to Cite
License

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.
Copyright © JTE.


