Academic
Publications
A Comparison of Fast Blocking Methods for Record Linkage

A Comparison of Fast Blocking Methods for Record Linkage,Rohan Baxter,Peter Christen,Tim Churches

A Comparison of Fast Blocking Methods for Record Linkage   (Citations: 102)
BibTex | RIS | RefWorks Download
Record linkage of millions of individual health records for ethically-approved research purposes is a computationally expensive task. Blocking methods are used in record link- age systems to reduce the number of candidate record com- parison pairs to a feasible number whilst still maintaining linkage accuracy. New blocking methods have been imple- mented recently using high-dimensional indexing or cluster- ing algorithms. We compare two new blocking methods, bigram indexing and canopy clustering with TFIDF (Term Frequency/Inverse Document Frequency), with two older methods of standard traditional blocking and sorted neighbourhood blocking. The results show that recently blocking methods such as bigram indexing and canopy clustering provide scalable blocking methods while maintaining or improving upon record link- age accuracy. There is a potential for large performance speed-ups and better accuracy to be achieved by these new blocking methods.
Published in 2003.
Cumulative Annual
View Publication
The following links allow you to view full publications. These links are maintained by other sources not affiliated with Microsoft Academic Search.
    • ...This problem has been addressed in the literature using techniques that rely on blocking [20, 15]...
    • ...There is a vast amount of literature on blocking techniques [20, 15], whose objective is to efficiently group together entities that are similar...
    • ...Total covering is a natural extension of the notion of blocking, used extensively in entity matching [20, 15]...

    Vibhor Rastogiet al. Large-Scale Collective Entity Matching

    • ...Baxter et al. [5] show that two new blocking methods, which are used in record linkage systems to reduce the number of candidate record comparison pairs, can maintain or improve the record linkage accuracy...

    Debabrata Deyet al. Efficient Techniques for Online Record Linkage

    • ...On the other hand, BI and CC capture pairwise similarities better than SN, but they both have computational complexities of �� (�� 2 ) [12]...
    • ...In BI [12][10], if two records share �� bigrams, they are assigned to the same block and each record can be assigned to multiple blocks, where �� = �� ∗ �� , �� is the average number of bigrams in the records and �� ∈ (0, 1] is a tuning parameter...
    • ...Blocking has been studied extensively in the literature (See [12] for a blocking review), and various blocking techniques have been proposed, including sorted neighborhood (SN) [9], bigram indexing (BI) [10][34], and canopy clustering (CC) [11]...

    Liangcai Shuet al. Efficient SPectrAl Neighborhood blocking for entity resolution

    • ...So-called blocking techniques [3] thus become necessary to reduce the number of entity comparisons whilst maintaining match quality...

    Lars Kolbet al. Multi-pass sorted neighborhood blocking with MapReduce

    • ...There have been numerous works [7-16] on citation matching problems, but there has been no research as to what effects a systematic and computational record field selection process can bring about in citation matching performance...
    • ...Blocking methods are proposed to efficiently process multi-dimensional citation records [8-11]...

    Hee-Kwan Kooet al. Effects of Unpopular Citation Fields in Citation Matching Performance

Sort by: