Weighted Combinations of Ontological Features and Keywords for Document Clustering

Van T.T. Duong

Authors

Van T.T. Duong Ton Duc Thang University, Ho Chi Minh City, Viet Nam

Corressponding author's email:

tapchikhgkdt@hcmute.edu.vn

Keywords:

named entity, latent semantics, clustering quality

Abstract

Keyword-based information processing has limitations due to simple treatment of words. In this paper, we introduce named entities as objectives into document clustering, which are the key elements defining document semantics and in many cases are of user concerns. First, the traditional keyword-based vector space model is adapted with vectors defined over spaces of entity names, types, name-type pairs, and identifiers, instead of keywords. Then, hierarchical document clustering can be performed using the similarity measure defined as a distance between the vectors representing documents. Experimental results are presented and discussed. Clustering documents by information of named entities could be useful for managing web-based learning materials with respect to related objects.

Downloads: 0

Download data is not yet available.

References

Baeza-Yates, R., Ribeiro-Neto, B.: Modern Information Retrieval. Addison-Wesley (1999).

Berners-Lee, T., Hendler, J., Lassila, 0.: The Semantic Web. Scientific American (2001).

cao, T.H., Do, H.T., Hong, D.T., Quan, T.T.: Fuzzy Named Entity-Based Document Clustering. In: Proceedings of the 17th IEEE International Conference on Fuzzy Systems (2008) 2028-2034.

cao, T.H. (2008) PRICA1'08

Castells, P., Fernåndez, M., Vallet, D.: An Adaptation of the Vector-Space Model for Ontology-Based Information Retrieval. IEEE Transactions on Knowledge and Data Engineering 19 (2006) 261-272.

Dill, S. et al.: SemTag and Seeker: Bootstrapping the Semantic Web via Automated Semantic Annotation. In: Proceedings of the 12th Int. Conference on the WWW (2003).

Goncalves, A., Zhu, J., song, D., Uren, V., Pacheco, R.: LRD: Latent Relation Discovery for Vector Space Expansion and Information Retrieval. In: Proceedings of the 7th International Conference on Web-Age Information Management (2006).

Hartigan, J., Wong, M.: Algorithm AS 136: A K-means Clustering Algorithm. Applied Statistics 28 (1979) 100-108.

He, J., Tan, A.-H., Tan, C.-L., Sung, S.-Y.: On Quantitative Evaluation of Clustering Algorithms. In: Wu, W. et al. (eds.): Clustering and Information Retrieval. Kluwer Academic (2003) 105-133.

Kiryakov, A., Popov, B., Terziev, 1., Manov, D., Ognyanoff, D.: Semantic Annotation, Indexing, and Retrieval. Journal of Web Semantics 2 (2005).

Meilä, M.: Compare Clusterings - An Information Based Distance. Journal of Multivariate Analysis (2007) 873-895.

Sekine, S.: Named Entity: History and Future. Proteus Project Report (2004).Toda, H. , Kataoka, R.: A Search Result Clustering Method Using Informatively Named Entities. In: Proceedings of the 7th ACM International Workshop on Web Information and Data Management (2005) 81-86.

Weighted Combinations of Ontological Features and Keywords for Document Clustering

Authors

Corressponding author's email:

Keywords:

Abstract

Downloads: 0

References

Downloads

Published

How to Cite

Issue

Section

Categories

License

Similar Articles

Make a Submission

Announcements

Journal Score Upgraded in Several Disciplines by the State Council for Professorship

Announcement on the Change in Publication Schedule of JTE

Call for Papers: Special Issue on Information Technology

Language

Information

Connections

Keywords

Visitors

Current Issue