Microdata fusion: a statistical matching application for the integration of the EWCS and QPS
Microdata refers to databases in which the micro-unit is the central element of analysis - individuals, families or companies. This data is traditionally collected through surveys, censuses or administrative data and allows users/researchers to analyze a wide range of topics and relationships between subpopulations. The characteristics of the data are generally determined by the purpose for which it is collected. As such, they usually do not cover all dimensions of analysis in depth, which creates the need to collect information through new and expensive surveys or other data collection methods.
In response to this problem, various methods have emerged that seek to use existing information scattered across various databases. This type of method seeks to integrate several databases using a common set of variables. This document presents an analysis of the methods commonly used to integrate information dispersed in microdata, with particular emphasis on identifying the feasibility of integrating the Staff Survey (QPS) and the European Working Conditions Survey (EWCS). The techniques considered here fall into three distinct categories: (1) parametric; (2) non-parametric; and (3) mixed.
The results of this analysis suggest that the EWCS and QPS can be successfully integrated using statistical matching methods. As expected, there is an integration cost associated with this procedure, which is reflected in the probability distributions of the new synthetic database. In order to successfully integrate the two sources of information, it is necessary to carry out an extensive harmonization procedure, which requires the aggregation of some of the continuous variables, which translates into an implicit loss of the specificity of the information contained in the databases. Finally, due to the associated computational requirements, it was not possible to optimize the matching process. The ideal optimization should be achieved through an algorithm that solves the assignment problem. However, a heuristic approach was used to optimize our problem, which minimizes the distances between individuals in the two databases through sequential iteration.
Authors: Luís Manso, Jena Santi