Cheminformatics and data science consulting services

Dr Jonathan Swain, Data Scientist and Cheminformatician

Services

I’m a former organic chemist turned data scientist, specialising in cheminformatics and AI driven drug discovery. I provide consulting services to academic groups and drug discovery companies seeking to integrate AI and machine learning into their research and development. While I specialise in cheminformatics, I also provide data science consulting for other fields, particularly health and medical technology.

AI and ML solutions for drug discovery

QSAR modelling and ADMET profiling using shallow and deep learning models, de novo drug design, and methods for searching ultra-large chemical libraries.

Python programming and cheminformatics

Bespoke Python packages to suit your needs and automate workflows.

Deployment infrastructure and data engineering

Custom data storage solutions and deployment of predictive and generative models using cloud computing services.

Data science and statistical analysis

Data cleaning, visualisation, and predictive modelling to provide insights and lead data-driven decision making.

Training and development

Specialised training programmes to upskill chemists and laboratory researchers in data science and cheminformatics.

Get in Touch

Hi, I’m Jon,

At Cambridge Cheminformatics Consulting, I specialise in applying data science and cheminformatics to tackle complex challenges in drug discovery, chemical research, and beyond. Whether it’s developing machine learning models, building automated data analysis pipelines, or designing and deploying scalable cloud-based solutions, I’m here to help you make the most of your data. With a decade of experience across industry and academia, I bring a deep understanding of chemical research combined with cutting-edge data science techniques. My aim is to provide practical, actionable solutions that streamline your workflows, drive innovation, and deliver measurable results for your organisation.

About

Blog

Working with large virtual chemical libraries: Part 1 – Active learning

Active learning is a machine learning method for searching large libraries when you have a scoring function that is too computationally expensive to label the full library of compounds.

Working with large virtual chemical libraries: Part 2 – Genetic algorithms

Genetic algorithms are biologically inspired, based on biological natural selection, and can be used to search large libraries quickly using a very simple algorithm.

TabPFN for chemical datasets

TabPFN (Tabular Prior-data Fitted Network) is a transformer-based foundation model for tabular data, pre-trained on millions of synthetic datasets to solve supervised learning tasks, with state-of-the-art performance on benchmarks. But does it work for cheminformatics?