1 Motivation
Last week, after realizing that I couldn’t show any of the projects I’ve been involved with for confidential reasons, I decided that it would be a great idea to build a quick project to put on display while I assemble a more solid portfolio. As a constraint (and also because, frankly, I don’t have a lot of free time these days), I chose to give myself only one weekend to work on it, from getting and cleaning the data to analyzing it and producing some results.
Speaking about the data, I wanted to choose something relevant to the field I’m interested in (the intersection between Biostatistics and Data science) and that would offer some challenge, instead of downloading a random dataset. After thinking hard about it, I settled on the following process: I researched Novartis’s main therapeutic areas of focus and downloaded information about hundreds of thousands of clinical trials related to these areas.
According to this, I’ve divided the remainder of the article in two parts:
- Data engineering, where I explain the strategy I’ve used to select, download and process the data.
- Data visualization, where I show different ways and methods of representing the relationship between different variables to help us better understand the data, discover insights and maybe even point to new questions to explore in the future.
If you’re not terribly interested about how the data was gathered or don’t have the time to read the whole thing, the data visualization section is the most noteworthy one!
The code used to build this website is available as a single Quarto document on GitHub. Additionally (and unrelated to this project), I’ve also made public a Shiny app I developed to deploy a machine learning model to predict if a patient has a specific type of cancer.
2 Data engineering
Novartis lists its main therapeutic areas in multiple websites. Although it’s difficult to summarize such a large offering of treatments, pharmaceuticals and fields of research, I think the following list is a good compromise for the purposes of this analysis:
- Cardio Metabolic
- Ophthalmology
- Respiratory
- Neuroscience
- Immunology and Dermatology
- Oncology
- Cell and Gene Therapy
- Tropical Diseases
Now that we have narrowed down the fields we’re interested in, we only need to get all the possible clinical trials related to them. To do so, we will use ClinicalTrials.gov, which in addition to being the largest clinical trials database in the world also has a useful and comprehensive API.
ClinicalTrials.gov’s API allows us not only to make a very refined search but also to select which fields from the clinical trial record to get information from.
The table below lists which search terms have been used to retrieve clinical trials records from each therapeutic area:
| Novartis’s therapeutic area | Search term used on ClinicalTrials.gov |
|---|---|
| Cardio Metabolic | Vascular Diseases OR Heart Diseases OR Kidney Diseases OR Liver Diseases |
| Ophthalmology | ophthalmology OR ophthalmic |
| Respiratory | Respiratory Tract Diseases OR copd OR chronic obstructive pulmonary disease OR severe asthma |
| Neuroscience | alzheimer OR parkinson OR multiple sclerosis OR epilepsy OR adhd |
| Immunology and Dermatology | Immune System Diseases OR Autoimmune Diseases OR Skin Diseases |
| Oncology | oncology |
| Cell and Gene Therapy | cell therapy OR gene therapy |
| Tropical Diseases | tropical AND (disease OR diseases) |
To further restrict the search to get relevant results, the following parts were added to the query:
AREA[StudyType]Interventional: to get only clinical trials.AREA[StartDate]RANGE[01/01/2010, 31/12/2021]: to get clinical trials started between 2010 and 2021.
ClinicalTrials.gov only returns information for up to 1000 studies per query, so I had to build a custom function on top of the API to iteratively get the maximum number of studies for each area. To avoid overwhelming their servers or hitting any request limits, I also built in a delay of 5 seconds per query. After waiting ~15 min, I finally had the raw data:
List of the clinical trial record’s fields queried:
- NCTId
- OfficialTitle
- BriefSummary
- Condition
- StudyType
- Phase
- Gender
- MinimumAge
- MaximumAge
- LeadSponsorName
- EnrollmentCount
- DesignInterventionModel
- DesignPrimaryPurpose
- ArmGroupInterventionName
- HealthyVolunteers
- LocationCountry
- StartDate
- CompletionDate
- WhyStopped
- RetractionPMID
| Therapeutic area | # of clinical trials |
|---|---|
| Cardio Metabolic | 36408 |
| Ophthalmology | 11625 |
| Respiratory | 20760 |
| Neuroscience | 7728 |
| Immunology and Dermatology | 28671 |
| Oncology | 54792 |
| Cell and Gene Therapy | 24960 |
| Tropical Diseases | 883 |
It’s important to consider, however, that in some cases two different searches return the same clinical trial record (for example, a clinical trial about the effect of COVID-19 on the heart would be repeated two times, as it belongs to two different therapeutic areas). The number of unique studies retrieved is 127,712, which represents a remarkable 29.8% of ClinicalTrials.gov’s total of 428,103 registered studies.
3 Data visualization
In this section I will explore different ways of summarizing and visualizing the information obtained in the previous section. The main tool used is ggplot2, the excellent R package for creating highly-customizable publication-ready graphics, but other, more interactive methods are also touched upon.
3.1 Number of clinical trials for each therapeutic area over time
The best way to get a good grasp on the kind of data that we have is to represent how the number of clinical trials evolves over time. This is a situation where using an interactive plot makes a lot of sense, because we may want to just zoom in on a particular time range, or represent the Y axis on the log scale to better appreciate the differences between the number of trials started in each therapeutic area.
The cool thing about the recently released Quarto publishing system is that we can use a lot different data science-oriented programming languages and frameworks in the same document, like R, Python, D3.js or, in this case Observable1:
We can already observe a trend in this plot that will be a constant throughout the analysis: the incredible effect of COVID-19 on the number of clinical trials in the Respiratory therapeutic area. No less remarkable is the dip experienced by almost every other therapeutic area, as a result of the trials that had to be paused or outright stopped during the pandemic.
3.1.1 Getting more detail with heatmaps
Heatmaps are also a great way of detecting trends for time-dependent variables at the month level. Thanks to the plotly graphing library we can also make them interactive:
However, depending on our needs we can create a non-interactive version of the same plot with customized aesthetics, ready to be downloaded in high quality:


Once again, we can make some tentative observations about these data:
The number of clinical trials for respiratory diseases in 2020 continues the same previous trend at the start of the year, until april, when the number of started studies explodes in comparison with any previous moment since 2010. From this point on, this trend of increased studies in this area continues until the end of the year.
Conversely, the overall number of clinical trials for all therapeutic areas diminishes at the start of the year, for the reasons stated before. Interestingly, this trend reverses during the summer, probably as a result of more clinical trials being developed to treat the effects of COVID-19 on different organ systems other than the respiratory system.
3.2 Top 10 lead sponsors by therapeutic area
ClinicalTrials.gov also makes available the lead sponsor of each clinical trial, which lets us plot the top 10 sponsors by therapeutic area. Novartis consistently ranks in the top 5 sponsors2, with the exception of Tropical Diseases.

3.3 Clinical trials around the world
ClinicalTrials.gov provides as well a list of all the participating countries in each clinical trial. Unfortunately, we only get the names of the countries, not their coordinates. After some data wrangling in R, we are able to show an interactive choropleth map of all the countries which have participated in at least one clinical trial since 2010.
| Flag | Country | # of CTs |
|---|---|---|
| 🇺🇸 | United States | 49885 |
| 🇨🇳 | China | 13100 |
| 🇫🇷 | France | 11275 |
| 🇨🇦 | Canada | 9469 |
| 🇩🇪 | Germany | 8602 |
Unsurprisingly, the United States and China lead the list of the countries with the highest number of number of clinical trials, followed by France, which has a strong pharmaceutical industry.
Click on each country to get a summary of its number of clinical trials.
3.4 Minimum and maximum ages to participate in a study
There are several ways to visualize the relationship between the minimum and maximum ages to participate in a clinical trial. First, we can just plot the minimum vs. the maximum age, obtaining a useless but nonetheless intriguing graph:

The dots that fall on the dashed blue line correspond to studies where the minimum and maximum age are the same. This makes sense in studies with newborns and babies where the age is 0, but there are actually a handful of clinical trials for adults with the same minimum and maximum ages.
There is a shocking number of studies where the maximum age is higher than 1203.
On a more serious note, the ridgelines plot is the perfect way to visualize such a huge amount of data while preserving interpretability:

Looking at the minimum age distribution, we can appreciate two distinct peaks across all therapeutic areas: one corresponding to studies probably involving infants (or with a large gap between minimum and maximum ages) and the other to the standard 18 years of age, which is the age of majority/legal age in most countries. It’s also worth noting how the Neuroscience trials have a unique distribution of minimum ages starting at 50 years old, in accordance with how these diseases tend to develop and show later in life.
With respect to the maximum age, we better see which studies focus on infants. Again, the Neuroscience distribution paints a clear picture of the nature of these diseases: on general, they appear either early on (the 5-20 years crest) or later in life (the two peaks at the end).
3.5 Conditions/diseases
Another interesting variable is Condition, which is defined by ClinicalTrials.gov as:
The disease, disorder, syndrome, illness, or injury that is being studied. On ClinicalTrials.gov, conditions may also include other health-related issues, such as lifespan, quality of life, and health risks.
Each study record needs to list at least one condition, with some studies having up to 145 keywords for the variable. The first thing we can do is a similar graph to the lead sponsors’s one, where we plot the top 10 conditions in each therapeutic area. It is important to note that extensive data cleaning needs to be performed, because one condition can be written in more than one way.
3.5.1 Top 10 conditions in each therapeutic area

3.5.2 Enrollment distribution for the top 25 conditions by number of clinical trials
We can also study the enrollment numbers for the top 25 conditions, for example, by plotting all the data in the following way, creating a striking chart:

3.6 Phases of clinical trials and early stoppings
By plotting the percentage of studies in each phase we can get a rough estimate of the most difficult areas in which to develop new and effective treatments. Oncology and Cell and Gene Therapy have the lowest percentages of clinical trials that reach the final phase 4, which is actually in line with more complex estimations.
From ClinicalTrials.gov’s glossary:
There are five phases: Early Phase 1 (formerly listed as Phase 0), Phase 1, Phase 2, Phase 3, and Phase 4. Not Applicable is used to describe trials without FDA-defined phases, including trials of devices or behavioral interventions.

At the same time, these two therapeutic areas also have the highest percentages of early stoppings:
| Therapeutic area | % clinical trials stopped |
|---|---|
| Cardio Metabolic | 10.0% |
| Ophthalmology | 7.7% |
| Respiratory | 10.7% |
| Neuroscience | 8.4% |
| Immunology and Dermatology | 10.3% |
| Oncology | 11.6% |
| Cell and Gene Therapy | 13.0% |
| Tropical Diseases | 5.1% |
4 Conclusion
Here’s a list, in no particular order, of the main takeaways I learned from this quick project:
It has helped me develop a new understanding on how to access clinical trials data in a programmatic way that I will be able to reuse in the future for other projects.
The time constraint has been extremely useful by making me focus on the most effective way to summarize the variables that caught my interest. On the downside, I haven’t had enough time to properly explain in detail each plot as much as I would have liked, maybe by incorporating additional information from relevant papers.
It has been difficult to establish the right balance between explaining the process and not being too verbose. My initial idea was to expand on how the code was developed and what “tricks” I used to get around some obstacles that arose along the way, but I feel like it would have made the article too long to the detriment of the main objective: showing the data visualizations.
Although I already knew about ridgeline plots, I had never actually used them before and was surprised at how effective they have been to convey the minimum and maximum ages in an intuitive way, and how easy they were to implement in
ggplot2.
Footnotes
Which is also developed by the creator of D3.js, Michael Bostock.↩︎
Although it’s important to note that this variable requires extensive data cleaning, as the same sponsor can be written in different ways.↩︎
For reference, the oldest person ever to have lived died at the age of 122.↩︎