Weighted Combinations of Ontological Features and Keywords for Document Clustering

Authors

  • Van T.T. Duong Ton Duc Thang University, Ho Chi Minh City, Viet Nam

Corressponding author's email:

tapchikhgkdt@hcmute.edu.vn

Keywords:

named entity, latent semantics, clustering quality

Abstract

Keyword-based information processing has limitations due to simple treatment of words. In this paper, we introduce named entities as objectives into document clustering, which are the key elements defining document semantics and in many cases are of user concerns. First, the traditional keyword-based vector space model is adapted with vectors defined over spaces of entity names, types, name-type pairs, and identifiers, instead of keywords. Then, hierarchical document clustering can be performed using the similarity measure defined as a distance between the vectors representing documents. Experimental results are presented and discussed. Clustering documents by information of named entities could be useful for managing web-based learning materials with respect to related objects.

Downloads: 0

Download data is not yet available.

References

Baeza-Yates, R., Ribeiro-Neto, B.: Modern Information Retrieval. Addison-Wesley (1999).

Berners-Lee, T., Hendler, J., Lassila, 0.: The Semantic Web. Scientific American (2001).

cao, T.H., Do, H.T., Hong, D.T., Quan, T.T.: Fuzzy Named Entity-Based Document Clustering. In: Proceedings of the 17th IEEE International Conference on Fuzzy Systems (2008) 2028-2034.

cao, T.H. (2008) PRICA1'08

Castells, P., Fernåndez, M., Vallet, D.: An Adaptation of the Vector-Space Model for Ontology-Based Information Retrieval. IEEE Transactions on Knowledge and Data Engineering 19 (2006) 261-272.

Dill, S. et al.: SemTag and Seeker: Bootstrapping the Semantic Web via Automated Semantic Annotation. In: Proceedings of the 12th Int. Conference on the WWW (2003).

Goncalves, A., Zhu, J., song, D., Uren, V., Pacheco, R.: LRD: Latent Relation Discovery for Vector Space Expansion and Information Retrieval. In: Proceedings of the 7th International Conference on Web-Age Information Management (2006).

Hartigan, J., Wong, M.: Algorithm AS 136: A K-means Clustering Algorithm. Applied Statistics 28 (1979) 100-108.

He, J., Tan, A.-H., Tan, C.-L., Sung, S.-Y.: On Quantitative Evaluation of Clustering Algorithms. In: Wu, W. et al. (eds.): Clustering and Information Retrieval. Kluwer Academic (2003) 105-133.

Kiryakov, A., Popov, B., Terziev, 1., Manov, D., Ognyanoff, D.: Semantic Annotation, Indexing, and Retrieval. Journal of Web Semantics 2 (2005).

Meilä, M.: Compare Clusterings - An Information Based Distance. Journal of Multivariate Analysis (2007) 873-895.

Sekine, S.: Named Entity: History and Future. Proteus Project Report (2004).Toda, H. , Kataoka, R.: A Search Result Clustering Method Using Informatively Named Entities. In: Proceedings of the 7th ACM International Workshop on Web Information and Data Management (2005) 81-86.

Downloads

Published

28-12-2009

How to Cite

[1]
Van T.T. Duong, “Weighted Combinations of Ontological Features and Keywords for Document Clustering”, JTE, vol. 4, no. 3, pp. 21–30, Dec. 2009.

Issue

Section

Research Article

Categories

Similar Articles

1 2 3 4 > >> 

You may also start an advanced similarity search for this article.