Using pysradb with Python API

Published Mar 14, 2025
Updated Nov 15, 2025
2 minutes read
Note

This old post is translated by AI.

The other day, I tried using "pysradb," a Python package that has been gaining attention as a method to efficiently handle NCBI and GEO data. I experimented with a series of data conversions: obtaining the corresponding SRP from a GSE ID, and then extracting SRR from each experiment (via SRX). pysradb is published on this GitHub repository and is originally designed as a command-line tool. However, since there weren't many examples of using it as a Python API, I thoroughly researched the implementation method and actually wrote and verified the code.

※ However, the operation is honestly quite slow, so if you want to process large-scale data at high speed, you might want to consider using other software.


##1. Overview of pysradb

pysradb is a package for retrieving and converting metadata from public databases such as NCBI Sequence Read Archive (SRA) and GEO. It can be used as a command-line tool, but it can also be called directly as an API within Python scripts as in this case. Especially when converting from GSE (GEO Series) to SRP (SRA Project), and then obtaining SRR (Run) from each experiment within the SRP, combining the functions and classes provided by pysradb allows for simple implementation of a series of processes.


##2. Conversion Process from GSE to SRP, and SRP to SRR

Below is an example code that actually performs the data conversion. First, it takes a GSE ID (e.g., "GSEXXXXX") as input to obtain SRP information, then scans each experiment (SRX) within the SRP to extract SRR information for each experiment.

from pysradb.sraweb import SRAweb
 
# Create SRAweb object (pass API key as argument if available)
sra = SRAweb()
 
# ① Convert GSE to SRP
# gse_to_srp returns SRP information in DataFrame format (detailed=True, expand_sample_attributes=True expands detailed information)
srp_df = sra.gse_to_srp("GSEXXXXX", detailed=True, expand_sample_attributes=True)
print("【GSE -> SRP Results】")
print(srp_df)
 
# ② Obtain metadata for each experiment from SRP, and get SRR information for each SRX
for idx, row in srp_df.iterrows():
    srp_id = row["study_accession"]
    print(f"\n【SRP: {srp_id}】Retrieving metadata...")
 
    # Get detailed experiment metadata from SRP ID
    meta_df = sra.sra_metadata(srp_id, detailed=True, expand_sample_attributes=True)
    if meta_df is None or meta_df.empty:
        print("No matching metadata found.")
        continue
 
    # For each experiment (SRX), get SRR information from SRX
    for exp in meta_df.to_dict("records"):
        srx = exp.get("experiment_accession")
        if not srx:
            continue
        srr_df = sra.srx_to_srr(srx, detailed=True, expand_sample_attributes=True)
        print(f"\nExperiment {srx} SRR information:")
        print(srr_df)
 
# Finally, release resources (SRAweb's close() is dummy but written anyway)
sra.close()

##3. Code Explanation

###Obtaining SRP from GSE

  • gse_to_srp Retrieves SRP information corresponding to the specified GSE ID. By specifying arguments detailed=True and expand_sample_attributes=True, more detailed sample attribute information is also expanded, which is useful for subsequent analysis.

###Obtaining Metadata from SRP

  • sra_metadata Based on the obtained SRP ID, retrieves metadata for related experiments (SRX) and runs (SRR). This data frame contains experiment accessions, sample accessions, and detailed attribute information.

###Obtaining SRR from SRX

  • srx_to_srr Using each experiment (SRX) accession as a key, extracts the corresponding SRR information. This allows you to understand which actual runs (SRR) are the targets.

##4. Conclusion

The tool I created this time is planned to be published on GitHub eventually.

The operation was slower than expected, so I'll also try R packages.