What I Learned Applying ML to Real Neuroscience Data
When you learn machine learning in the classroom, you’re often given clean data. Typically, these datasets are tailored towards specific models. As a result, you see perfect distributions, correct clusters, and the end-to-end process is satisfying like fitting together the pieces of a puzzle. I wish this were the case for the Motorola-funded research project at DePaul University, but instead of neatly fitting puzzle pieces, I was handed a box with puzzle pieces from mismatched sets. It was a complete slap in the face to what I’d been taught.
The project’s goals were to map how 13 brainstem nuclei in the reticular formation relate to orofacial motor behaviors. These behaviors are responsible for vital movements like chewing, breathing, and swallowing. The current gap in knowledge of this region in the brain inhibits our understanding of neurodevelopmental and neurodegenerative disorders such as dysphagia and Sudden Infant Death Syndrome (SIDS).
We used spatial gene expression data from the Allen Mouse Brain Atlas (AMBA). With this data, we applied various spatial clustering methods and dimensionality reduction to find patterns across these structures.
Here’s what I actually learned from this project.
Data retrieval is MESSY #
Now don’t get me wrong, AMBA is a remarkable resource. It’s a comprehensive atlas that houses gene expression data across several aged mouse brains with spatial data per volumetric pixel (voxel). The data is also general purpose, which means it most likely doesn’t suit your needs out-of-the-box.
Researchers and developers know that turning data into something that humans can understand is one of the hardest parts of the project. In our case, AMBA’s API returns data per gene. That’s fine if you’re running models on an isolated gene, but we’re talking about the entire brainstem. There’s undoubtedly going to be thousands of genes influencing the brainstem, and consequently, orofacial motor behaviors.
I never would’ve thought that the bottleneck of the data retrieval would be the API performance, but it makes perfect sense looking back. There are a couple of parameters you can choose with AMBA’s API including which gene, expression measurement (density, intensity, energy), mouse age, etc… In my case, I needed to retrieve density measurements for the P56 (56 day old) mouse. The data returned is a vector of N doubles representing gene density percentages from 0-1 where N is the spatial resolution of the given mouse brain.
X = width (num. voxels)
Y = height (num. voxels)
Z = depth (num. voxels)
Spatial Resolution (N) = X * Y * Z voxelsIn our case, we were looking at a spatial resolution of ~160,000 voxels cubed. For a single gene, there will be 160,000 data points. Now imagine 5,000 genes. I felt a cold shiver down my spine just thinking about it. That’s 800 million data points.
Assuming one API call takes approximately 20 seconds (and it usually did), it would take me over 24 hours to retrieve data for all genes. From a research perspective, that doesn’t sound like a lot of time. However, I was a college student at a university with little support for research. We didn’t have powerful servers to delegate expensive tasks like this. I did this on my MacBook.
You may be thinking: "You could’ve spun up multiple threads to retrieve the data faster!” That’s true, and I actually implemented a multi-threaded solution until I realized a major setback running it for the first time: RATE LIMITING.
You’re probably thinking I came up with an elegant solution. Unfortunately, I didn’t think to implement one. I just let it run one request at a time. For 24 hours, my MacBook sat like an oven with the fans spinning for their lives.