Skip to main content

Data Hub Overview

The proliferation of sensor technologies and advancements in data collection methods have enabled the accumulation of very large amounts of data. Increasingly, these datasets are considered for scientific research. However, the design of the system architecture to achieve high performance in terms of parallelization, query processing time, aggregation of heterogeneous data types (e.g., time series, images, structured data, among others), and difficulty in reproducing scientific research remain a major challenge. This is specifically true for health sciences research, where the systems must be: i) easy to use with the flexibility to manipulate data at the most granular level, ii) agnostic of programming language kernel, iii) scalable, and iv) compliant with the HIPAA privacy law.

To meet this challenge, RCHE research scientist Mohammad Adibuzzaman, RCHE faculty affiliate Ananth Grama, and RCHE-funded computer science Ph.D. student Fatemeh Rouzbeh developed and implemented a novel architecture for software-hardware-data ecosystem over the past year using open source technologies in a distributed environment. The platform consists of four layers: storage, computation, operation, and application. The storage layer handles the data and indexes. The computation layer is responsible for the distributed computations. The operation layer supports a programming language interface to develop reusable components to analyze and process the data. Finally, the application layer offers multiple ways of interaction for users. The system supports several types of data sources including images, structured data (such as electronic health records and claims), waveform data (such as from ECGs or smartwatches), and clinical notes.

Leveraging Purdue’s advanced engineering and computer science skills, Data Hub serves as a collaborative hub for the integration of life science data and enhanced analytics to provide the functional backbone with logistical support for both the Purdue and broader research community. By harnessing the various sources of data available (i.e. electronic health records, medical claims, genomics, wearables) onto one platform, investigators have access to more robust and diverse data sets to solve critical life science research problems. The cyberinfrastructure of Data Hub has a cluster computing infrastructure architected by the Regenstrief Center in collaboration with the Department of Computer Science. The hardware, security and HIPAA aligned server maintenance efforts are supported by the Rosen Center for Advanced Computing at Purdue. The framework is developed with open source technologies to support scalability, and data aggregation of heterogeneous sources. Currently, the system hosts several large EHR and claims data sources such as Cerner Health Facts (69M patients), Indiana Medicaid Claims data, Purdue Employee claims data, among others with a storage capacity of 300 terabytes.