pysradbをpythonAPIで使う

先日、NCBIやGEOのデータを効率的に扱う方法として注目されているPythonパッケージ「pysradb」を使い、GSE IDから対応するSRPを取得し、さらに各実験（SRX経由）からSRRを抜き出すという一連のデータ変換を試してみました。
pysradbはこちらのGitHubレポジトリで公開されており、本来はコマンドラインツールとして設計されています。しかし、Python APIとして利用する例はあまり見かけなかったため、今回その実装方法を徹底的に調べ、実際にコードを書いて検証しました。

※ただし、動作は正直なところ結構遅いため、大規模なデータを高速に処理したい場合は、他のソフトウェアの利用も検討すると良いかもしれません。

##1. pysradbの概要

pysradbは、NCBI Sequence Read Archive (SRA) やGEOなどの公開データベースからメタデータを取得・変換するためのパッケージです。
コマンドラインツールとしても利用できますが、今回のようにPythonスクリプト内で直接APIとして呼び出すことも可能です。
特に、GSE (GEO Series) からSRP (SRA Project) への変換、そしてSRP内の各実験からSRR (Run) を取得する場合、pysradbの提供する関数やクラスを組み合わせることで、一連の処理がシンプルに実装できます。

##2. GSEからSRP、SRPからSRRへの変換手順

以下は、実際にデータ変換を行うコード例です。
まずは、GSE ID（例："GSEXXXXX"）を入力としてSRP情報を取得し、その後SRP内の各実験（SRX）を走査して、各実験に対してSRR情報を抽出するという流れです。

from pysradb.sraweb import SRAweb
 
# SRAwebオブジェクトを生成（APIキーがあれば引数に渡す）
sra = SRAweb()
 
# ① GSEからSRPへの変換
# gse_to_srpはDataFrame形式でSRP情報を返す（detailed=True, expand_sample_attributes=Trueで詳細情報が展開される）
srp_df = sra.gse_to_srp("GSEXXXXX", detailed=True, expand_sample_attributes=True)
print("【GSE -> SRP の結果】")
print(srp_df)
 
# ② SRPから各実験のメタデータを取得し、SRXごとにSRR情報を取得する
for idx, row in srp_df.iterrows():
    srp_id = row["study_accession"]
    print(f"\n【SRP: {srp_id}】のメタデータを取得中…")
    
    # SRP IDから詳細な実験メタデータを取得
    meta_df = sra.sra_metadata(srp_id, detailed=True, expand_sample_attributes=True)
    if meta_df is None or meta_df.empty:
        print("該当するメタデータが見つかりませんでした。")
        continue
 
    # 各実験（SRX）ごとに、SRXからSRR情報を取得
    for exp in meta_df.to_dict("records"):
        srx = exp.get("experiment_accession")
        if not srx:
            continue
        srr_df = sra.srx_to_srr(srx, detailed=True, expand_sample_attributes=True)
        print(f"\nExperiment {srx} の SRR 情報:")
        print(srr_df)
 
# 最後に、リソース解放（SRAwebのclose()はダミーですが記述しておく）
sra.close()

##3. コード解説

###GSEからSRPの取得

gse_to_srp
指定したGSE IDに対応するSRP情報を取得します。
引数 detailed=True や expand_sample_attributes=True を指定することで、さらに詳細なサンプル属性情報も展開され、後続の解析に役立ちます。

###SRPからメタデータ取得

sra_metadata
得られたSRP IDを元に、関連する実験（SRX）やラン（SRR）のメタデータを取得します。
このデータフレームには、実験アクセッション、サンプルアクセッション、さらには詳細な属性情報が含まれています。

###SRXからSRRの取得

srx_to_srr
各実験（SRX）のアクセッションをキーとして、対応するSRR情報を抽出します。
これにより、実際にどのラン（SRR）が対象となるのかを把握できます。

##4. おわりに

今回作ったツールはいずれgithubにて公開予定です。

思ったよりも動作が遅かったので、Rのパッケージも試してみます。

##1. Overview of pysradb

pysradb is a package for retrieving and converting metadata from public databases such as NCBI Sequence Read Archive (SRA) and GEO. It can be used as a command-line tool, but it can also be called directly as an API within Python scripts as in this case. Especially when converting from GSE (GEO Series) to SRP (SRA Project), and then obtaining SRR (Run) from each experiment within the SRP, combining the functions and classes provided by pysradb allows for simple implementation of a series of processes.

##2. Conversion Process from GSE to SRP, and SRP to SRR

Below is an example code that actually performs the data conversion. First, it takes a GSE ID (e.g., "GSEXXXXX") as input to obtain SRP information, then scans each experiment (SRX) within the SRP to extract SRR information for each experiment.

from pysradb.sraweb import SRAweb
 
# Create SRAweb object (pass API key as argument if available)
sra = SRAweb()
 
# ① Convert GSE to SRP
# gse_to_srp returns SRP information in DataFrame format (detailed=True, expand_sample_attributes=True expands detailed information)
srp_df = sra.gse_to_srp("GSEXXXXX", detailed=True, expand_sample_attributes=True)
print("【GSE -> SRP Results】")
print(srp_df)
 
# ② Obtain metadata for each experiment from SRP, and get SRR information for each SRX
for idx, row in srp_df.iterrows():
    srp_id = row["study_accession"]
    print(f"\n【SRP: {srp_id}】Retrieving metadata...")
 
    # Get detailed experiment metadata from SRP ID
    meta_df = sra.sra_metadata(srp_id, detailed=True, expand_sample_attributes=True)
    if meta_df is None or meta_df.empty:
        print("No matching metadata found.")
        continue
 
    # For each experiment (SRX), get SRR information from SRX
    for exp in meta_df.to_dict("records"):
        srx = exp.get("experiment_accession")
        if not srx:
            continue
        srr_df = sra.srx_to_srr(srx, detailed=True, expand_sample_attributes=True)
        print(f"\nExperiment {srx} SRR information:")
        print(srr_df)
 
# Finally, release resources (SRAweb's close() is dummy but written anyway)
sra.close()

##3. Code Explanation

###Obtaining SRP from GSE

gse_to_srp Retrieves SRP information corresponding to the specified GSE ID. By specifying arguments detailed=True and expand_sample_attributes=True, more detailed sample attribute information is also expanded, which is useful for subsequent analysis.

###Obtaining Metadata from SRP

sra_metadata Based on the obtained SRP ID, retrieves metadata for related experiments (SRX) and runs (SRR). This data frame contains experiment accessions, sample accessions, and detailed attribute information.

###Obtaining SRR from SRX

srx_to_srr Using each experiment (SRX) accession as a key, extracts the corresponding SRR information. This allows you to understand which actual runs (SRR) are the targets.

pysradbをpythonAPIで使う

##1. pysradbの概要

##2. GSEからSRP、SRPからSRRへの変換手順

##3. コード解説

###GSEからSRPの取得

###SRPからメタデータ取得

###SRXからSRRの取得

##4. おわりに

Using pysradb with Python API

Note

##1. Overview of pysradb

##2. Conversion Process from GSE to SRP, and SRP to SRR

##3. Code Explanation

###Obtaining SRP from GSE

###Obtaining Metadata from SRP

###Obtaining SRR from SRX

##4. Conclusion