Kickstart with Amazon HealthLake


6 min read

Kickstart with Amazon HealthLake

Over the past decade, we have seen a digital transformation in the healthcare landscape. Digitization of the healthcare industry has led to an immense volume of data being collected in the form of clinical notes, lab reports, insurance claims etc.

AWS HealthLake aims to ingest this unstructured/unstructured data from disparate on-premise systems and normalize it using NLP algorithms. The service identifies each piece of clinical information, tags, and indexes events in a timeline view with standardized labels so it can be easily searched, and structures all of the data into the Fast Healthcare Interoperability Resources (FHIR) industry-standard format for a complete view of the health of individual patients and entire populations. As a result, the data lake makes it easier for customers to query, perform analytics, and run machine learning models to derive meaningful value from the newly normalized data.


How does AWS HealthLake work?

The effort required to manually tag and index this unstructured data can take weeks or months (8 hours per patient on average). Furthermore, the cost and complexity of taking up this activity are prohibitive for most healthcare organizations. AWS HealthLake supports copying this data as well as structuring and tagging it using the industry-standard FHIR structure. All the data is indexed to make it easy to query, search and analyze. The service performs the following activities to achieve this:

Import: Importing the data from disparate legacy systems to a centralized AWS data lake. This service uses FHIR structuring to ensure uniformity in importing data across separate systems. The data imported can have health data, clinical notes, insurance claims and more.

Store: This imported data is stored on the AWS cloud in a secure, compliant and auditable way. This service can handle data up to a petabyte-scale.

Transform: This unstructured data is now normalized using NLP techniques to tag this data so that it can be queried and searched easily. The data is tagged using keys like diagnoses, medical indications, medicines, and many more. Adding these transformations to the existing data improves its efficiency for a wide variety of use cases.

Query and Search: Since the data is now indexed and tagged using domain-specific keys, the entire data has more data points to train models on and is now easier to analyze.

Analyze: The data indexing makes the analysis more efficient, helps identify trends, and makes predictions with integrated analytics and ML capabilities.

Leveraging the digital transformation in Healthcare

In the current landscape, medical data is usually unstructured and messy. The clinical models built on this data are able to extract 30 data points at best, training on just the simple features. The data is stored in legacy systems and is often sequential with no tagging and classification. This data coming from separate systems when put together, with context can help create a complete view of a patient’s health information.

The labeling and tagging of this data manually are not practical. Using the HealthLake service we can normalize this data and make it easily analyzable. The service extracts these data points from physician notes, lab reports, and insurance claims. If tagged and indexed efficiently, the same data can be used to extract 200,000 to 300,000 data points on average. This can be leveraged to build advanced models on the same data.

How does AWS HealthLake extract medical data?

Extracting medical data and making sense of the data with medical context is not an easy task. In a medical data store, all entities are interconnected. Even if one entity is missing, we can leverage the relationship between the other entities to still extract a good number of data points. Thus, a medical data store should be able to get to this data and understand the relationships between entities to be useful for analysis and model training. The service uses this interconnected nature of the data to make sense of the data.

Identifying medical terminology and context: A generic medical data source has data entities collected from various systems be in clinical notes, lab results, etc. In the absence of a single entity, the other entities and the relationship between different entities plays a critical role in understanding the complete story. AWS HealthLake helps structure this data from various systems then tags the different entities with tags such as medical conditions, indications, diagnoses, and medicines prescribed. Apart from this, the services identify the relationship between the entities like dosage for the medications or tests or test-related results that led to the diagnoses. It will also add concepts like anatomical locations and the time at which the specific data was taken.


Like in the image above, we have a few lines from a sample physician’s note. The service extracts data entities like the patient’s age, diagnoses and the time of diagnoses/symptom observation that led to that diagnosis.

Similarly, the service extracts medical terms and abbreviations like BP, etc. This tagging is converted to a JSON which is then appended to the original text providing the user with a detailed view of the actual information. Having this level of granularity improves the efficacy of analysis across a wide variety of use cases.


How to get started?

1. Create a data store and import your health data

  • Create a datastore using the AWS console.
  • The service lets you import your data from an on-premise system or test out the service using sample data from Synthea(Synthetic patient data).
  • To import your own data you can choose the S3 location and set the IAM role permissions so that HealthLake can import your data.
  • The datastore is scalable so it can scale up and down depending upon the amount of data or queries you require.

2. Create, update, or delete your health data

  • Once your data is in the system you can manage your data using FHIR query operations.
  • All CRUD operations are supported by the query service.
  • A thing to note here is the delete operation only hides the data from the analysis and results but is not deleted completely from the service.
  • You can also run the search query on your entire data resource at once. Since this is an FHIR resource a set of parameters are automatically suggested by the service to filter the data and help you get to the specific record you are searching for.

3. Create dashboards or build ML models on all your health data

  • Once the service has normalized the data and appended the tags — JSON to your data, you can share this data with external services through APIs for analysis or building models on the normalized data.
  • Similar to import, the service lets you export your data back to an S3 location. Thus, the user will have the normalized data ready to be used within their S3 bucket.
  • The user can easily access this data using service APIs or directly from the S3 location.


Potential Use Cases:

  • Matching clinical trial patient population.
  • Analyzing population health trends.
  • Optimizing availability of health care resources.
  • Helping organizations to analyze healthcare data and improve decision making.

Additional Resources: