Medical & Healthcare Datasets

Physician Dictation Text, CT Scan, MRI, and X-ray Image Data, POS-Tagged and NER-Annotated Datasets

Off-the-Shef Datasets

We offer pre-built, ready-to-use datasets packaged according to your specific business and research objectives.
By leveraging our existing data catalog, you can acquire exactly the data you need with rapid turnaround, significantly reducing time-to-delivery and accelerating project execution.

Deep Medical Domain Expertise

Our datasets span over 30 medical specialties, including neurology, cardiovascular and circulatory diseases, cardiology, family medicine, oncology, orthopedics, and more—providing the domain depth required for high-quality medical AI development.

Privacy & Regulatory Compliance

All datasets are fully processed in accordance with the HIPAA Safe Harbor Guidelines, with rigorous removal of personally identifiable information. This ensures secure, privacy-compliant data suitable for enterprise and research use in regulated healthcare environments.

CT Scan Image Dataset

This machine learning dataset consists of high-resolution CT scan images collected from real patients and is designed for use by medical professionals and researchers.

  • Over 15,000 CT scan images

  • Covering the head, chest, abdomen, cervical spine, blood vessels, brain, and pelvis
  • Regional data coverage, including Asia, Europe, and India
MRI Image Dataset

This dataset consists of MRI images collected from real patients and can be used for machine learning applications related to a wide range of medical conditions, including neurological disorders and cardiovascular diseases.

  • Over 15,000 MRI images

  • Covering the head, abdomen, chest, hips, prostate, brain, and spine
  • Regional data coverage, including Asia, Europe, and India
X‑ray Image Dataset

This dataset consists of X‑ray images collected from real patients. It is a high-resolution image package suitable for machine learning in medical AI research and development.

  • Over 3,000 X-ray images
  • Covering the ankle, chest, pelvis, upper limbs, and lower limbs
  • Regional data coverage, including Asia, Europe, and India
Echocardiogram Dataset

This dataset contains DICOM images from real patients’ echocardiographic (cardiac ultrasound) examinations.
It is used for developing AI-based diagnostic imaging solutions for a wide range of cardiovascular diseases.

  • Over 60,000 echocardiogram images

  • Used for examinations and AI-assisted analysis of cardiovascular diseases, including
    valvular heart disease, myocardial infarction, cardiomyopathy, and congenital heart disease

  • Regional data coverage, including Brazil and other regions

Mammography Dataset

This dataset includes mammography DICOM images collected from real patients.
It is suitable for machine learning applications such as AI-based breast cancer screening and diagnostic imaging.

  • Over 1,000 DICOM images

  • Supporting AI-assisted radiology diagnostics for breast cancer detection, including masses, microcalcifications, and architectural distortion

  • Regional data coverage, including Brazil and other regions

Scintigraphy Examination Dataset

This dataset includes DICOM data from scintigraphy (nuclear medicine) examinations collected from real patients.
It provides high-resolution images suitable for machine learning tasks such as segmentation of structures and tissues in nuclear medicine scans.

  • Over 6,000 DICOM images

  • Suitable for AI model development to evaluate the effectiveness of treatments such as chemotherapy and radiation therapy
  • Regional data coverage, including Brazil and other regions

Physician Dictation Dataset

This dataset includes audio recordings of physicians dictating clinical information, such as patient symptoms and treatments, in real clinical settings.
All personal and identifiable information has been fully removed. The dataset covers dictations across 31 medical specialties.

  • Over 200,000 hours of physician dictation audio

  • Spanning more than 30 medical specialties, including cardiology, family medicine, oncology, and orthopedics

  • Recording devices include telephones, electronic devices, and smartphones

  • Personally identifiable information removed in compliance with the HIPAA Safe Harbor Guidelines

Medical Records Dataset

A text-based dataset of medical records that can be used for tasks such as mapping patient medical histories and supporting treatment recommendations.

  • Over 200,000 hours of clinical text
  • Including treatment reports, discharge summaries, and emergency department (ED) notes
  • Personally identifiable information removed in compliance with the HIPAA Safe Harbor Guidelines

Electronic Health Record (EHR) Dataset

An Electronic Health Record (EHR) is a digital collection of medical records that may include patient medical histories, diagnoses, prescriptions, treatment plans, immunization dates, allergies, radiology images (CT scans, MRI, X‑rays), and clinical test results.

  • Over 5 million medical dictation audio files across more than 30 clinical specialties

  • De-identified electronic health records (EHRs) and medical metadata, including admission and discharge records, AMLOS, GMLOS, and hospital information
  • Demographic metadata, such as age groups and gender

  • Training data for medical NLP and Document AI models

NER & Entity Linking Data

A dataset for Named Entity Recognition (NER) and entity linking, focused on extracting medical entities such as symptoms, procedures, medications, and anatomical locations from medical documents.

  • Annotations applied to over 10,000 medical documents
  • Extraction of a wide range of medical entities, including Problems, Diagnoses, Procedures, and Medications
  • JSON Format

ICD-10-CM & CPT Data

A medical NLP dataset annotated with ICD-10-CM and CPT codes for medical documents, designed to support automated medical coding and clinical text analysis.

  • Annotations applied to over 10,000 medical documents
  • Designed for model development integrated with major medical terminologies, including ICD‑10‑CM, CPT, SNOMED, UMLS, and RxNorm
  • JSON Format
POSタグデータ

A dataset for NLP development featuring part‑of‑speech (POS) tagging applied to medical text, designed to support medical language processing and Document AI applications.

  • Annotations applied to more than 20,000 medical text samples
  • Includes detailed parts‑of‑speech (POS) annotation data
  • JSON Format

Medical & Healthcare Data Catalog

What Are Machine Learning Datasets for Medical and Healthcare Applications?

Medical and healthcare datasets are collections of data used for machine learning on medical information. These datasets may include physician dictation audio, clinical history data, CT scan images, MRI images, POS tags, and NER annotations. Because medical data contains sensitive information, advanced data processing and strict data management practices are essential to ensure privacy and compliance. We provide ready-to-use datasets that allow you to purchase only what you need from our existing data catalog—without launching a project from scratch. This approach enables you to acquire high-quality medical AI training data quickly and cost-effectively, supporting efficient research and development.

Requirements Definition

We propose the most suitable solution based on your project objectives and budget.

Data Extraction Begins

Based on your requirements—such as language, duration, and number of speakers—we extract the appropriate dataset. For Off-the-Shelf data, the extraction process is typically completed within one to several days.

Delivery

The extracted data will be delivered via your specified platform or through our secure data transfer system.