The Data Linkage Services team at the Department of Health facilitate access to linked data for approved purposes, such as service evaluation, planning and research. You can find out more about our specialty services which are currently offered by clicking on each of the tabs below.
Data linkage is a technique for connecting information from different data sources that are thought to relate to the same person, family, place or event. Information is created when a person comes into contact with certain services, for example, when they visit an emergency department, stay in a hospital or register the birth of their child.
The Department of Health links data in Western Australia. Data linkage techniques in WA have been developed to ensure the best possible matching while at the same time protecting personal privacy. There are two parts to records linked by:
- Demographic data – this is identifiable information such as a person’s name or address.
- Content data – this is the information about what happened to the person, such as diagnosis and treatment in hospital.
Privacy is protected by separating these data before it is provided for linkage. This practice is known as the ‘separation principle’. Highly specialised computer programs do most of the matching, but for some of the more difficult matches, Data Engineers will look at the records and make a decision about whether it is a true match.
The Data Linkage Team matches only the demographic information, and then makes a special unique ID, called a “linkage key”, for each group of records that belongs to one person. These keys can then be used for approved requests, to join up the content data of the records, without releasing the person’s name or other identifying information.
The WA Data Linkage System (WADLS) stores the linkage keys created by the Data Linkage Team. To create and maintain the WADLS, the Department of health’s Linkage and Systems Teams have developed a bespoke linkage system in house, termed ‘DLS3’. The system is highly versatile and completely integrates and streamlines all aspects of the “end to end” linkage process.
For more information on the utility of linked data, please refer to Limitations and Suitable Use of Linked Data (PDF, 443 KB).
The linkage process can be split into the following stages:
1. Obtain Demographic Data; Clean and Standardise
Raw data is provided for linkage. All or some of the following demographic fields are included:
- Name (first name, second name, family name, aliases)
- Date of Birth
- Address (house number, street name, suburb, postcode)
- Other unique identifiers (e.g. Hospital Unit Medical Record Number)
The data fields are cleaned and put into a standard format that can be used for linkage. Customised identifiers are assigned. For example:
- MC DONALD > MCDONALD
- 12th August 1982 > 19820812
2. Load Demographic Details
The demographic details are loaded into tables in a relational database. There are different tables for different datasets because not all datasets have the same variables.
3. Run Linkage Engine and Load Links
The linkage program runs comparisons between two datasets. Linkage strategies are customised according to the individual characteristics of each dataset. Some links pass as automatic matches, some are automatic rejections, and some fall into a “grey area” in between, where links are manually checked by Data Engineers for validity.
With more than 1.2 million records, on average, being linked every week, and a dynamic and constantly changing system, it is important to ensure that the links we make between records and chains are of the highest quality.
There are many ways to assess the quality of both existing and proposed links and our Data Engineers employ a variety of strategies and tools to ensure that our linkage system contains the highest quality links. Linkage strategies are also regularly revisited to ensure that the system of links is continually refined and improved.
4. Extract Linkage Keys
Customised project specific linkage keys are extracted by encrypting the “linkage key” for each chain of records. These are the keys that have service data attached by the various data collections.
Changes to Linkage Keys
Please be aware that the WA Data Linkage System (WADLS) is a dynamic system where data and links are created, modified and deleted on a regular basis. Although linkage keys are considered to be “person identifiers”, they are not unchanging – when linkage keys are extracted, each one is designated by selecting the “master” record ID (also known as the “ROOT LPNO”) from the given chain of records. The algorithm that performs this task is designed to maximise the stability and consistency of the keys, however some degree of variation is unavoidable.
Changes may occur when:
- New data is loaded and linked
- Previously unlinked records are belatedly linked into a chain
- Two or more chains are found to belong to the same person and are merged together
- A single chain is found to belong to multiple people and is split apart
- A record is deleted from the WADLS (e.g., at the direction of the Data Provider)
- A dataset is reloaded in a new format (e.g., after a database migration at the source)
Please note that this limitation may affect requests for data updates, as well as any project that requires multiple iterative data extractions. See the Amendments and Data Updates page for more information.
The process of extracting data for projects involves multiple teams at the Department of Health, as well as the assistance of Data Collections from the Department of Health and other external agencies. Once necessary approvals have been granted (e.g., by relevant Ethics Committees or stakeholders), the Data Outputs team extracts requested data according to the following stages:
1. Identify Study Population
First, the study population is selected. This can be done via a variety of methods, such as a new linkage, where the Applicant already has the study population chosen, or via selection from one or more Data Collections. For example:
- All participants of a research project (a new linkage)
- All people who went to hospital for a colonoscopy (from the Hospital Morbidity Data Collection)
- All people with colorectal cancer (from WA Cancer Registry)
- People in all of these groups (by combining all of the records from the three sources above)
Other associated study groups may also be selected, for example control or comparison group selections (e.g., random sample of people from the WA Electoral Roll who are the same age and gender as the cases) and Family Connections groups (e.g., children of the cases).
2. Extract Linkage Keys
Once the study population is defined, the Linkage Team extracts the encrypted linkage keys for each requested dataset. The Project Manager then distributes these lists of keys to the relevant data collections for the service data to be attached.
3. Attach Content Data
The Data Custodians arrange for the requested content data from their collection to be attached to the linkage keys. For some Data Collections, the Data Outputs team can perform this process using the Custodian Administered Research Extract Server (CARES), which is part of the WA Health Enterprise Linked Data Warehouse.
For many data collections, the files are sent to the ISPD Client Services Team to coordinate quality checking. For some datasets, the content data may be released directly to the Applicant, at the discretion of the Data Custodian.
A Quality Assurance service for content data is offered prior to release to Data Applicants. Data files are checked for:
- Compliance with cohort selection and extraction filtering criteria
- Compliance with approved variable lists
- No inadvertent disclosure of identifying information
- Internal consistency of linkage keys
- Standardised data format where possible
Supporting information is also prepared during the Quality Assurance process, including the collation of data dictionaries, reference documentation about how the data was prepared and any notes that may impact on data analysis.
Data is encrypted and released to Data Applicants via secure online file transfer to approved personnel.
5. Data Release
The ISPD Client Services Team prepares the data for release by encrypting it and applying password protection. The data is then released to the Applicant via secure online transfer.
Linked Data Preview
Linked data files are usually provided as tab-delimited text files. They will be encrypted/password protected using WinZip or 7zip. On receipt of the data, Applicants will also be given supporting reference documentation.
The following is an example of a file a researcher might receive.
Please refer to the Dataset Information page for data dictionaries.
Project Facilitation and Advice
ISPD Client Services offers a centralised service to assist Data Applicants. As part of this service, the ISPD Client Services team coordinates the Application for Data process, incorporating:
- Stage 1: Pre-application
- Stage 2: Preliminary review
- Stage 3: Data Custodian Feasibility Assessment
- Stage 4: Ethical Approval
- Stage 5: Data Custodian Formal Review
- Stage 6: Research Governance Approvals
- Stage 7: Data Linkage
- Stage 8: Data Extraction
- Stage 9: Data Delivery
- Stage 10: Invoicing
- Stage 11: Project Closure
If you would like to request advice on available datasets, project design, data governance, logistics for data delivery, or cost estimates, please get in touch with ISPD Client Services using our contact form.
For any enquiries related to ethics or ethical approval, please contact HREC@health.wa.gov.au.
For enquiries related to research governance, including site authorisation, please contact DoH.RGO@health.wa.gov.au.
Derived Aboriginal and Torres Strait Islander Status Flag
Based on the work of the Getting Our Story Right project, the Data Linkage Services team can generate a Derived Aboriginal and Torres Strait Islander Status Flag.
A validated algorithm is used to create this flag for any individual with at least one record in a number of WA government administrative data sets where Aboriginal and Torres Strait Islander status is recorded.
The algorithm uses the information from several records in an individual’s chain to produce an overall derived Aboriginal and Torres Strait Islander status of “Yes”, “No” or “Missing”, for that individual.
Datasets used to derive Aboriginal and Torres Strait Islander status include:
- WA Birth Registrations
- WA Death Registrations
- Midwives Notifications
- Hospital Morbidity Data Collection Records
- Emergency Department Data Collection Records
The number of records used to assign this information is varied depending on the number and type of datasets and records in the chain. Data recipients must note that their project could receive data for an individual where data sets provided report an Aboriginal and Torres Strait Islander Status of ‘NO’ but the Aboriginal and Torres Strait Islander Status Flag is ‘YES’.
The Aboriginal and Torres Strait Islander Status Flag indicates what status is indicative of a person from all available collections/records and therefore may be different to what is reported in a specific record or collection.
For more information on the Getting Our Story Right project, a peer reviewed publication is available:
- Christensen D et al. Evidence for the use of an algorithm in resolving inconsistent and missing Indigenous status in administrative data collections. Australian Journal of Social Issues. Vol.49. No 4. 2014 pp 423-443. doi.org/10.1002/j.1839-4655.2014.tb00322.x
The WA Family Connections System contains links between individuals who are related, created using information recorded on original Birth Registrations and Midwives’ Notifications. These relationships are usually (but not always) biological. No information is known about adoptions, including step, local or overseas adoptions.
Currently, the genealogy held by the Department of Health includes parents and siblings of people born in WA since 1945. Extended family members (including grandparents, grandchildren, cousins, aunts and uncles) can also be identified.
The availability of Family Connections information arose from the WA Family Connections Project, which was started in 2003. This project involved creating genealogical links from information recorded on Registrations that are available electronically. Initially focusing on Birth Registrations dating back to 1974, a later phase incorporated newly digitized records from as early as 1945.
Population-based genealogies for are rare due to the challenges of developing and maintaining such a resource on a large scale. The combination of genealogy and health data for the WA population represent a unique opportunity to investigate the inheritance of human disease.
Data may be used to assess the degree of relatedness of individuals within study samples, locate common ancestors, estimate genetic risk and describe the familial burden of comorbid conditions.
For more information on the WA Family Connections Project, a peer reviewed article is available:
- EJ Glasson et al. Cohort Profile: The Western Australian Family Connections Genealogical Project. International Journal of Epidemiology, Volume 37, Issue 1, Feb 2008, pp 30–35. https://doi.org/10.1093/ije/dym136
Geocoding is a process that involves converting an address into a latitude/longitude, using a set of reference data. This map point can then be placed within spatial boundaries such as the Statistical Area Levels 1 and 2 (SA1 and SA2 respectively) and Local Government Area (LGA).
Data Linkage Services assigns the boundaries and derives the indices using mapping and concordance tables created by the Australian Bureau of Statistics (ABS). More information can be found on the ABS website in the Census Reference area.
The Department of Health currently has geocoded data for all census years from 1996 until 2016.
Routinely geocoded data includes:
- Midwives Notifications
- Hospital Morbidity Data Collection Data
- Emergency Department Data Collections
- Death Registrations
- Mental Health Information System
Privacy Preserving Record Linkage
Privacy Preserving Record Linkage (PPRL) is a technique that enables data linkage without using personal identifying information. Fields such as name, address and date of birth are irreversibly ‘hashed’ using specialised software to derive a new string that still enables record linkage but does not identify an individual.
The Department of Health can provide hashed data for record linkage at third party linkage agencies.
For further information, please view the Privacy Preserving Record Linkage (PPRL) Guide.
Matched Comparison Group Selection
The Department of Health’s Data Linkage Services team has the capability to select comparison or control populations for study cohorts to facilitate case-control studies.
Controls can be selected to meet several criteria, including:
- Demographic characteristics (e.g., year of birth, sex)
- Location (postcodes, SEIFA or other geographical features)
- Clinical outcomes or characteristics (e.g., Admitted to hospital within the same year and month)
Controls are usually frequency matched, however if required can be individually matched to cases.
For adult cases, controls are usually selected from the Electoral Roll, such that each control was a current elector during the year in which a corresponding case had their “index event”. This does not guarantee they were resident in WA at the time, but makes it more probable.
For child cases, controls can be selected from Birth Registrations or from Midwives Notifications. Postcode matching is not very reliable in these cases – the address on the midwives Notification of Case Attended Form is not necessarily the mother’s usual address and not all birth registrations have the parents’ address. The Department of Health has no means of ensuring that children selected as controls were still resident in WA on the case’s index date.
Controls may be selected from other datasets, if appropriate for the given study.
Most often cases are excluded from being controls. Stillborn children will be excluded from controls selected from Midwives/Births unless specifically requested otherwise. It is also possible to exclude known relatives via the Family Connections System.
The Department of Health can also exclude people based on information in other datasets, e.g., exclude all women who are known to have had a hysterectomy, or exclude men who had lung cancer in a certain time period.
Control to Case Ratio
Data Applicants should advise ISPD Client Services of their required ratio, noting that a high number of controls (e.g., 10:1 or above) may require justification. In some cases, the matching criteria may need to be relaxed to find a suitable control (e.g., where a large number are requested or the criteria are highly specific).
For some requests, the Department of Health is able to provide a sample of identifying information selected from the Western Australian Electoral Roll.
These samples can be selected based on a variety of characteristics, and can be used for various purposes, including study invitations, pursuant to the appropriate approvals being granted.