Services
I’m a former organic chemist turned data scientist, specialising in cheminformatics and AI driven drug discovery. I provide consulting services to academic groups and drug discovery companies seeking to integrate AI and machine learning into their research and development. While I specialise in cheminformatics, I also provide data science consulting for other fields, particularly health and medical technology.

AI and ML solutions for drug discovery
QSAR modelling and ADMET profiling using shallow and deep learning models, de novo drug design, and methods for searching ultra-large chemical libraries.

Python programming and cheminformatics
Bespoke Python packages to suit your needs and automate workflows.

Deployment infrastructure and data engineering
Custom data storage solutions and deployment of predictive and generative models using cloud computing services.

Data science and statistical analysis
Data cleaning, visualisation, and predictive modelling to provide insights and lead data-driven decision making.

Training and development
Specialised training programmes to upskill chemists and laboratory researchers in data science and cheminformatics.

Hi, I’m Jon,
At Cambridge Cheminformatics Consulting, I specialise in applying data science and cheminformatics to tackle complex challenges in drug discovery, chemical research, and beyond. Whether it’s developing machine learning models, building automated data analysis pipelines, or designing and deploying scalable cloud-based solutions, I’m here to help you make the most of your data. With a decade of experience across industry and academia, I bring a deep understanding of chemical research combined with cutting-edge data science techniques. My aim is to provide practical, actionable solutions that streamline your workflows, drive innovation, and deliver measurable results for your organisation.
Blog
Working with large virtual chemical libraries: Part 1 – Active learning
Active learning is a machine learning method for searching large libraries when you have a scoring function that is too computationally expensive to label the full library of compounds.
Working with large virtual chemical libraries: Part 2 – Genetic algorithms
Genetic algorithms are biologically inspired, based on biological natural selection, and can be used to search large libraries quickly using a very simple algorithm.
TabPFN for chemical datasets
TabPFN (Tabular Prior-data Fitted Network) is a transformer-based foundation model for tabular data, pre-trained on millions of synthetic datasets to solve supervised learning tasks, with state-of-the-art performance on benchmarks. But does it work for cheminformatics?