Cells read the DNA within them to make useful products, such as proteins, through a process called gene expression. Scientists and health organizations have reported that gene expression datasets often contain too few patient samples with far too many genes per sample, which presents a major barrier to reducing cancers globally. This imbalance makes it difficult to find or prioritize the changes in gene expression that differentiate cancer cells from healthy cells. Scientists refer to this challenge as the curse of dimensionality.
Machine learning techniques can model existing patterns within these large datasets and then classify samples as cancerous or not, but they introduce another barrier. Clinicians and physicians hesitate to trust the results because they don’t understand how a machine learning model reached its conclusions. They call this the black box problem. Therefore, researchers aim to develop methods that explain how a machine learning model makes its decisions.
A research team based across several African institutions focused on explaining breast cancer model predictions. They downloaded publicly available gene expression data from a global database, called the Cancer Genome Atlas, which contained almost 20,000 genes across 1,208 breast cancer samples. Their goal was to identify the few genes out of the 20,000 that could predict whether a tissue was cancerous.
First, the researchers reduced the data to 3,602 genes that showed differential expression between breast cancer and healthy cells. From there, they used an algorithm to test multiple gene combinations and selected the smallest group of genes that consistently produced good results. It’s like conducting thousands of small races with different runners to figure out which ones always come in first, even if they all eventually reach the finish line.
Then, they trained and tuned multiple models using different machine learning techniques based on the expression data of the genes that the algorithm selected. They reported that all models performed well, correctly predicting cancer status at least 98% of the time. Next, they asked: “Which genes make the models work?” and “How do these genes influence the predictions?”
They employed 4 different statistical interpretation methods, known as feature importance techniques, to identify the top-contributing genes to the models’ performance. The first one showed how each model’s prediction changes with the level of expression of a specific selected gene, and the second showed how multiple genes interact to drive the models’ decisions. The third method quantified the overall influence of each gene on the models’ decisions, thereby providing a ranked importance, and the last method assessed how well a single gene could predict breast cancer on its own.
The researchers identified 7 genes that consistently appeared across all trained models and feature importance scales. They confirmed that all these genes had relevant biological functions that can influence cancer growth, like repairing damaged tissues, controlling the movement of materials in and out of the cell, and regulating how cells defend themselves.
The team noted that while different models tended to agree on the most important genes, the exact rankings and influence scores sometimes varied. They explained that with biological data, models often see different slices of the same reality, and therefore, better results come from combining viewpoints from multiple machine learning models rather than relying on a single one.
The researchers highlighted some limitations. The gene selection algorithm took nearly 6 hours on a powerful laptop, which was longer than they anticipated, so it may not be efficient for datasets larger than theirs. They also acknowledged that the algorithm might have omitted some important genes during its selection. And despite its large size, their dataset didn’t fully capture the diversity of breast cancer worldwide, so their models might not perform as well across all samples. The researchers concluded that combining machine learning models with transparent, explainable techniques is the future of cancer prediction to enable clinical trust in machine learning recommendations.
