AI-assisted data analysis: How to make AI wrangle data for you

Altuna Akalin
7 min readAug 2, 2024

--

Most people doing data analysis enjoy the challenge of uncovering insights. However, We often find ourselves bogged down in the tedious task of cleaning and merging messy spreadsheets and tables instead. Fingers ache, eyes blur, and each small error feels like a mountain to climb.

I often ask if is there a quicker and less painful way to do these things, a way that would let us focus on meaningful analysis quicker…

Why data wrangling is needed

Data wrangling, also known as data munging, is a crucial step in the data analysis process. The primary goal is to transform raw data into a more useful and structured format for further analysis. Raw data often comes from various sources and is typically unstructured, incomplete, or inconsistent. This data can include errors, duplicates, and irrelevant information that hinder effective analysis. By cleaning and transforming the data, you ensure that it is accurate, complete, and consistent. This is essential for generating reliable insights from downstream analysis. It is also somewhat boring if you have many such tasks and you have to write similar code from scratch every time.

What kind of tasks are included in data wrangling

Data wrangling encompasses a variety of tasks aimed at transforming raw data into a usable format. One of the primary tasks is data cleaning, which involves identifying and correcting errors, removing duplicates, filtering data based on statistical criteria and dealing with missing values.

Another critical task is data transformation, which involves converting data into a format suitable for analysis. This can include normalizing data, aggregating information, and converting data types.

Data integration is also a vital part of data wrangling, where data from different sources is combined into a single cohesive dataset. This task often requires matching records based on common fields and resolving discrepancies between different data sources.

Lastly, data wrangling could include data exporting. This might mean saving the cleaned and transformed data to a specific format or system for further analysis or sharing.

How can AI help?

Data wrangling can be tedious and repetitive. We can utilize LLM-driven frameworks models to quickly provide task adequate code and execution on your data. You simply need to type what you want to achieve on your dataset, rather than coding. We will be using the “mergenstudio” R package to demonstrate this concept. This package provides an LLM-driven framework on your Rstudio interface.

Data filtering

Let’s say you have some data that you want to filter based on some statistical measure. Writing the code for your analysis, performing the analysis and visualizing the results can be boring! Luckily, AI can help in both writing the analysis pipeline and visualization. Let’s look at an example of data filtering in gene expression data. We have some gene expression data from 6 different samples, across over 20,000 genes. We want to filter our data so that only the top 1000 most variable genes across samples are retained. With the help of mergenstudio this task becomes ridiculously easy! For this post, we assume you are already familiar with setting up your API information. If this is not the case, take a look at the GitHub help page (https://github.com/BIMSBbioinfo/mergenstudio). First, we select the directory where our data is at by clicking the button shown below:

After selecting the directory, we can start typing out our question. Important here is that we describe a bit what our datafile looks like, as to help the analysis along. After we give a brief description of our file, we ask can ask the LLM to perform the analysis for us. Here is the question we asked:

I have a file called leukemiaExp.txt . It holds gene expression data for different samples. Samples are on the columns, and genes are on the rows. The first column holds gene names. Could you write code to filter the data so that only the top 1000 most variable genes across samples are retained?

So the answer we got back is the following:

As we can see, the LLM wrote code to help us perform the task at hand, and even wrote some optional code to save the filtered data to a file. So just by describing our file, and the type of data filtering we want to do, we get back a full data filtering pipeline which we can now run!

Data transformation

Data filtering is not the only type of wrangling that we can do. Other forms of data transformation can involve processes such as normalization, scaling, or converting data into a different format. One common example is transforming raw gene expression data from RNA sequencing (RNA-seq) into a normalized form that can be used for downstream analyses. Below is an example question in which we ask the LLMs help with data transformation and filtering afterwards:

I have a file called “RNAseq_coloncancer.txt” which holds RNA-seq data from different patients with different types of colon cancer. The rows hold different genes, and the columns represent different samples. The file is seperated by ‘,’ . I first want to log transform the RNA expression data. Since I have so many genes, I want to remove genes which have no real variance across samples. I want to retain the rows when there is 85% of distinct values out of the total number of samples. How do I do this?

As we can see, we can even execute the response, and see what the output of our analysis looks like!

Merging tables

Let’s face it. Dealing with data which is spread across multiple files can be pretty annoying. Luckily, AI can come to the rescue once more! Let’s take our raw gene expression data from RNA sequencing (RNA-seq) data. Oftentimes, we have some extra files holding additional information about the different samples or genes. So what if we want an easy way to merge these tables, but don’t know how to go about this? We can ask something like the following:

I have a file called “RNAseq_coloncancer.txt” which holds RNA-seq data from different patients with different types of colon cancer. The rows hold different genes, and the columns represent different samples. The file is seperated by ‘,’. I have another file called “patient_subtype.txt” which holds different information about phenotype in the column “subtype”. Could you merge these tables?”

Basic plots after wrangling

Data visualization is a crucial aspect of data analysis. Transforming your data sets into visual formats such as charts, graphs, and maps allows for easier interpretation, and helps to uncover patterns, trends, and insights that might not be apparent in raw data. Through compelling visuals, data becomes more accessible and engaging, driving informed actions and strategies. However, designing visually appealing plots can be laborious. mergenstudio can help with this! Imagine this case:

I have a file called “cancer_expr_set.txt” which holds the gene expression information of patients with different subtypes of blood cancer. Rows represent different genes, and columns represent the different patient samples. The file is seperated by “,”. The first 3 letters of the column names indicate which type of cancer the patient has. I want to plot a heatmap, and annotate this heatmap with the different cancer subtypes to see if they cluster together. How do I do this?”

We can get the desired heatmap by simply asking this question into the chat, and executing the response as shown below:

Data output after filtering:

Sometimes, after performing data wrangling, we would like to export some files holding the results from our analysis. Luckily, this is something we can also ask our LLM. Let’s take our first data wrangling example and modify it a bit:

I have a file called leukemiaExp.txt . It holds gene expression data for different samples. Samples are on the columns, and genes are on the rows. The first column holds gene names. Could you write code to filter the data so that only the top 1000 most variable genes across samples are retained? After filtering I would like to save the information of the remaining genes in a txt file called leukemiaExp_filtered.txt.

If we now run this code and look into the selected directory, we will find the file called “leukemiaExp_filtered.txt” holding the results from the data filtering.

Try it and let us know how it went!!

Disclaimer: Although mergenstudio works with almost any LLM API (OpenAI, Anthropic, Mistral, and also local LLM APIs via ollama), for the examples above we used our API which is more performant than others for the data analysis tasks we care about. The performance you get from non-specialized LLM APIs might be different. If you want free access to our API, you can register your interest here.

--

--

Altuna Akalin

Bioinformatics Scientist, writing about data analysis, genomics, bioinformatics, and science. https://al2na.co