Analyzing SNP Data in Non-coding Regions

Published Dec 28, 2024
Updated Nov 15, 2025
5 minutes read
Note

This old post is translated by AI.

##Unexpected Uses of SNPs

I used to think SNP (Single Nucleotide Polymorphism) analysis was used for things like investigating correlations with diseases in GWAS, but I recently learned through work that it can also be very useful for QC of genome-edited cells.

For example, SNP data can be used for:

  • Natural variants can be used when you want to modify sequences
  • If what was a single cell clone becomes mosaic, it has become a heterogeneous population

Like this, looking at SNPs beyond just the GRCh38 (hg38) reference sequence reveals many things. (Actually, a colleague taught me this...)

As the year ends and I have more time, I looked into how to use DB information that I've been curious about.

##Database Selection

There are databases like the following, but dbSNP integrates all data:

DatabaseData TypeData VolumeUpdate FrequencyAccess MethodMain Uses
dbSNPSNPs, indels, microsatellites, etc.3.3 billion SNP records, 1.1 billion Reference SNP recordsAbout every 3-4 monthsEntrez SNP search tool, batch queriesGenome-wide association analysis, population genetics, evolutionary biology
ClinVarRelationship between genetic variants and diseasesOver 825,000 records, over 500,000 variantsWeekly, comprehensive monthly releasesE-utilities, Entrez DirectInterpretation of clinical significance of variants, medical genetics
gnomADExons, genomes730,947 exome sequences, 76,215 whole genome sequences (v4.1)Each gnomAD releasegnomAD browser, AWS, Microsoft Azure, Google CloudRare variant analysis, gene function research
1000 Genomes ProjectGenomesGenome data from 2,504 people (Phase 3)No updatesIGSR data portal, AWS, Ensembl, UCSC Genome BrowserPopulation genetics, human genetic diversity research

###What is dbSNP?

dbSNP is a free public database that aggregates information about genetic variations in various species, jointly developed and operated by NCBI (National Center for Biotechnology Information) and NHGRI (National Human Genome Research Institute). dbSNP was established in 1998 and created to accommodate the many common variants discovered in the Human Genome Project. dbSNP includes diverse molecular polymorphisms such as single nucleotide substitutions, short deletion/insertion polymorphisms, microsatellite markers, and retrotransposons.

Having many entries is wonderful, but there are indications of a high false positive rate. The criticism on Wikipedia nicely summarizes discussion results from papers, so please refer to it.

##Using dbSNP to Get All SNP Data in a Specific Region

In dbSNP, you can search by specifying chromosome, start position, end position, etc.

Search interface

Available special queries include the following. Use them by putting the search term before special brackets like 6[Chromosome].

Even if you specify a Gene Name, only exons or introns are targeted, so for promoter sequences and further upstream, you need to reference chromosome positions.

  • [Chromosome]
  • 1000[Base Position] : 2000[Base Position]
  • [Gene Name]
  • [Validation Status]
  • [Function Class]
  • [SNP ID]
  • [SNP Class]
save mode

You can also output search results as TSV from the top right of the search screen. Output takes quite a while, so be patient. ChatGPT sometimes returns descriptions suggesting API downloads are possible, but I haven't verified this.

##Data Analysis Example

The output data looked like this:

#chr	pos	variation	variant_type	snp_id	clinical_significance	validation_status	function_class	gene	frequency
6	187649	T>A,C,G	snv	1184		by-frequency;by-alfa;by-cluster	intron_variant;non_coding_transcript_variant	LOC285766;LOC105374869	T:0.066677:427:1000Genomes|T:0.078473:148:HapMap|T:0.035495:104:KOREAN|T:0.071429:4:Siberian|T:0.040873:57:TOMMO|T:0.088145:23331:TOPMED|T:0.109392:3359:ALFA
6	102893	A>G	snv	382895		by-frequency;by-alfa			G:0.:0:ALFA
6	103136	C>A	snv	462661		by-cluster
6	103505	T>C,G	snv	467420		by-cluster
6	165632	A>C,G,T	snv	719065		by-frequency;by-alfa;by-cluster			A:0.349781:2240:1000Genomes|A:0.293186:611:HGDP_Stanford|A:0.413136:780:HapMap|G:0.430034:1260:KOREAN|G:0.411572:754:Korea1K|G:0.323529:22:Qatari|A:0.:0:SGDP_PRJ|A:0.211538:11:Siberian|G:0.453075:12803:TOMMO|A:0.272368:72093:TOPMED|A:0.19265:6060:ALFA

You could avoid the false positive problem mentioned earlier by filtering for Frequency values above a certain threshold.

###Loading Data

This is o1's code, but let me try organizing the data.

import pandas as pd
import numpy as np
 
# Load TSV
df = pd.read_csv("snp_result.txt", sep="\t",  dtype=str)
# Exclude rows where frequency column is NaN
df = df[~df["frequency"].isna()]
 
# Define function to exclude cases where frequency column is empty or only "0."
def has_valid_freq(freq_str):
    # Return False if freq_str is empty
    if not freq_str.strip():
        return False
    # True if at least one entry separated by pipes has a number
    items = freq_str.split("|")
    for item in items:
        # Example: "T:0.066677:427:1000Genomes"
        parts = item.split(":")
        if len(parts) == 4:
            allele, freq_val, count, dataset = parts
            # Consider valid if freq_val is numeric and not 0
            try:
                val = float(freq_val)
                if val > 0.0:
                    return True
            except ValueError:
                pass
    return False
 
df = df[df["frequency"].apply(has_valid_freq)]
 
# Expand frequency column to make it tidy
rows = []
for idx, row in df.iterrows():
    freq_entries = row["frequency"].split("|")
    for entry in freq_entries:
        parts = entry.split(":")
        if len(parts) == 4:
            allele, freq_val, count, dataset = parts
            # Skip if freq_val can't be converted to numeric
            try:
                val = float(freq_val)
            except ValueError:
                continue
            # Skip if 0 or less
            if val <= 0.0:
                continue
            row_dict = {
                "#chr": row["#chr"],
                "pos": row["pos"],
                "variation": row["variation"],
                "variant_type": row["variant_type"],
                "snp_id": row["snp_id"],
                "clinical_significance": row["clinical_significance"],
                "validation_status": row["validation_status"],
                "function_class": row["function_class"],
                "gene": row["gene"],
                "allele": allele,
                "frequency": val,
                "sample_count": count,
                "dataset": dataset
            }
            rows.append(row_dict)
 
tidy_df = pd.DataFrame(rows)
 
tidy_df

Now it's tidy, usable data.

 
  #chr     pos variation variant_type  snp_id  clinical_significance  \
0    6  187649   T>A,C,G          snv    1184                    NaN
1    6  187649   T>A,C,G          snv    1184                    NaN
2    6  187649   T>A,C,G          snv    1184                    NaN
3    6  187649   T>A,C,G          snv    1184                    NaN
4    6  187649   T>A,C,G          snv    1184                    NaN
5    6  187649   T>A,C,G          snv    1184                    NaN
6    6  187649   T>A,C,G          snv    1184                    NaN
7    6  165632   A>C,G,T          snv  719065                    NaN
8    6  165632   A>C,G,T          snv  719065                    NaN
9    6  165632   A>C,G,T          snv  719065                    NaN
 
                 validation_status  \
0  by-frequency;by-alfa;by-cluster
1  by-frequency;by-alfa;by-cluster
2  by-frequency;by-alfa;by-cluster
3  by-frequency;by-alfa;by-cluster
4  by-frequency;by-alfa;by-cluster
5  by-frequency;by-alfa;by-cluster
6  by-frequency;by-alfa;by-cluster
7  by-frequency;by-alfa;by-cluster
8  by-frequency;by-alfa;by-cluster
9  by-frequency;by-alfa;by-cluster
 
                                 function_class                    gene  \
...
6      T   0.109392         3359           ALFA
7      A   0.349781         2240    1000Genomes
8      A   0.293186          611  HGDP_Stanford
9      A   0.413136          780         HapMap

###Types of Datasets

It seems that measurement results from various datasets are stored for specific SNP_IDs. Let's see what datasets are available.

tidy_df['dataset'].unique()
array(['1000Genomes', 'HapMap', 'KOREAN', 'Siberian', 'TOMMO', 'TOPMED',
       'ALFA', 'HGDP_Stanford', 'Korea1K', 'Qatari', 'Daghestan',
       'GnomAD', 'PAGE_STUDY', 'GENOME_DK', 'SGDP_PRJ', 'PRJEB36033',
       'NorthernSweden', 'GoNL', 'ALSPAC'], dtype=object)

In my experience, such data often contains datasets with lots of missing values, and some can be omitted. Let me check which datasets have many missing values.

import polars as pl
 
# Convert to polars obj
df_tidy = pl.DataFrame(tidy_df)
 
# Convert to wide data - this operation is easier in polars
df_freq = df_tidy.select(
    [
        # '#chr',
        # 'pos',
        # 'variation',
        # 'variant_type',
        'snp_id',
        # 'function_class',
        'frequency',
        # 'sample_count',
        'dataset'
        ]
).pivot('dataset', index='snp_id', aggregate_function='first').fill_null(0.0).filter(
    (pl.col('TOMMO') > 0.0) | (pl.col('1000Genomes') > 0.0)
).head(20)
 
df_heat = df_freq.to_pandas().set_index('snp_id')
df_heat
 
 
# Plot with pyComplexHeatmap
import matplotlib.pylab as plt
import numpy as np
plt.rcParams['figure.dpi'] = 100
plt.rcParams['savefig.dpi']=300
plt.rcParams['font.family']='sans serif' #please remove this line if font is not installed
plt.rcParams['font.sans-serif']='Arial' # please remove this line if Arial font is not installed
plt.rcParams['pdf.fonttype']=42
# sys.path.append(os.path.expanduser("~/Projects/Github/PyComplexHeatmap/"))
import PyComplexHeatmap as pch
 
plt.figure(figsize=(5, 5))
pch.ClusterMapPlotter(data=df_heat, col_cluster=False, row_cluster=False, show_colnames=True,
                      show_rownames=True)
heatmap

I obtained roughly the following insights: (subjective) This time I only used a tiny portion of chromosome 6's SNPs as an example, but I think looking at a wider area would show different datasets too. It seems good to use them according to the purpose.

Has lots of data but frequency trends differ from other datasets:

  • 10000Genomes
  • HapMap

Has solid data:

  • KOREAN
  • Siberian
  • TOMMO
  • ALFA
  • Korea1K

Slightly lacking data:

  • HGDP_Stanford
  • Qatari
  • GnomAD

Insufficient data for screening purposes:

  • Daghestan
  • PAGE_STUDY
  • GENOME_DS
  • SGDP_PRJ
  • PRJE36033
  • NorthernSweden
  • GoNL
  • ALSPAC
    Footnotes
  1. See Wikipedia.