Academic
Publications
Integrating Element and Term Semantics for Similarity-Based XML Document Clustering

Integrating Element and Term Semantics for Similarity-Based XML Document Clustering,10.1109/WI.2005.80,Jianwu Yang,William K. Cheung,Xiaoou Chen

Integrating Element and Term Semantics for Similarity-Based XML Document Clustering   (Citations: 4)
BibTex | RIS | RefWorks Download
Structured link vector model (SLVM) is a recently proposed document representation that takes into account both structural and semantic information for measuring XML document similarity. Its formulation includes an element similarity matrix for capturing the semantic similarity between XML elements - the structural components of XML documents. In this paper, instead of applying heuristics to define the similarity matrix, we proposed to learn the matrix using pair-wise similar training data in an iterative manner. In addition, we extended SLVM to SLVM-LSI by incorporating term semantics into SLVM using latent semantic indexing, with the element similarity related properties of the original SLVM preserved. For performance evaluation, we applied SLVM-LSI to similarity-based clustering of two XML datasets and the proposed SLVM-LSI was found to significantly outperform the conventional vector space model and the edit-distance based methods. The similarity matrix, obtained as a by-product via the learning, can provide higher-level knowledge about the semantic relationship between the XML elements.
Conference: Web Intelligence - WI , pp. 222-228, 2005
Cumulative Annual
View Publication
The following links allow you to view full publications. These links are maintained by other sources not affiliated with Microsoft Academic Search.
    • ...Yang et al. [101], for example, calculate the similarity between documents by their representations in the Structured Linked Vector Model Latent Semantic Indexing (SLVM-LSI), which is their extension to the Structured Linked Vector Model (SLVM)...
    • ...Milano et al. [71] String similarity No Ma and Chbeir [67] String similarity No Al techniques Yang et al. [101 ]S LVM-LSI No...
    • ...Recent work, such as [90,101,102], have been considering both structure and data in the clustering process...
    • ...The proposal in [101 ]a nalyzes how data contained inside XML documents are structured and how they relate to other close data...
    • ...In [101], a similarity model is proposed in order to measure similarities scores between XML documents, which combines structure and contents of the XML documents...
    • ...Yang et al. [101] XML document Clustering Large XML Luis et al.[66] Duplicate detection Data cleaning Small XML Park et al. [77] XML document Similarity queries Medium XML Carvalho et al. [18] Object identification Data cleaning Medium Semi-structured Weis and Naumann [95] Duplicate detection Data cleaning Large XML...
    • ...Ma and Chbeir [67] Common and uncommon No NA Similarity Yang et al. [101] Common and uncommon No Learning Similarity Luis et al.[66] Common No Learning Similarity Park et al. [77] NA No NA Similarity Carvalho et al. [18] Common and uncommon No Comparison Similarity...

    Carina Friedrich Dorneleset al. Approximate data instance matching: a survey

    • ...To obtain an optimal Me for a specific type of XML data, we proposed in [21] to learn the matrix using pair-wise similar training data in an iterative manner...

    Jianwu Yanget al. XML Document Classification Using Extended VSM

    • ...As for the main differences, we can observe that: (i) the approach of [16] returns very accurate results but needs a deep analysis of the structural properties of an XML document; (ii) it does not consider semantic similarities among the concepts of involved sources; on the contrary, in our approach, this information plays an important role; (iii) the approach of [16] is extensional whereas ours is intensional. Approach of [44]...
    • ...In [44] a framework exploiting matrix algebra for clustering XML documents is presented...
    • ...We can recognize some similarities between our approach and that of [44]...
    • ...As for the main differences between them, we observe that: (i) the approach of [44] considers only synonymies whereas our approach handles a wide range of interschema properties; (ii) the approach of [44] is quite sophisticated and precise, since it computes various statistics on the terms occurring in an XML source (e.g., the frequency of a term in a document); this allows accurate results to be obtained but requires a significant ...
    • ...As for the main differences between them, we observe that: (i) the approach of [44] considers only synonymies whereas our approach handles a wide range of interschema properties; (ii) the approach of [44] is quite sophisticated and precise, since it computes various statistics on the terms occurring in an XML source (e.g., the frequency of a term in a document); this allows accurate results to be obtained but requires a significant ...

    Pasquale De Meoet al. Semantics-Guided Clustering of Heterogeneous XML Schemas

Sort by: