清晨好,您是今天最早来到科研通的研友!由于当前在线用户较少,发布求助请尽量完整地填写文献信息,科研通机器人24小时在线,伴您科研之路漫漫前行!

DiffuCpG

概率逻辑 一般化 自回归模型 计算机科学 解码方法 扩散 算法 朗之万方程 应用数学 人工智能 数学 统计物理学 统计 数学分析 物理 热力学
作者
Jonathan Ho,Ajay N. Jain,Pieter Abbeel
出处
期刊:Cornell University - arXiv 被引量:5488
标识
DOI:10.5281/zenodo.14622374
摘要

DiffuCpG 1. Introduction In this study, we used a generative AI diffusion model to address missing methylation data. We trained the model with Whole-Genome Bisulfite Sequencing data from 26 acute myeloid leukemia samples and validated it with Reduced Representation Bisulfite Sequencing data from 93 myelodysplastic syndrome and 13 normal samples. Additional testing included data from the Illumina 450k methylation array and Single-Cell Reduced Representation Bisulfite Sequencing on HepG2 cells. Our model, DiffuCpG, outperformed previous methods by integrating a broader range of genomic features, utilizing both short- and long-range interactions without increasing input complexity. It demonstrated superior accuracy, scalability, and versatility across various tissues, diseases, and technologies, providing predictions in both binary and continuous methylation states. In this repository, we deposit the code used to build the diffusion models along with necessary example datasets to train and test a diffusion model for methylation imputation purposes. Docker Usage Install Docker Install Docker using the following link:https://docs.docker.com/engine/install/Recommended system specs: Debian 12 bookworm with 16GB RAM or more.Make sure you have the latest Nvidia GPU driver installed and docker can access your Nvidia GPU. Run Docker images with Tissue-specific Models docker pull yay135/diffucpg_tssUse our example to generate input samples with Hi-C matrix and CIS (Confidence Interval Cross Sample) data.docker run -it yay135/diffucpg_tssthenpython generate_train_test_samples.py The tissue-specific models (pytorch) are for CD34+ cells, GBM and BRCA, they are stored in folders named "model*" in the image. Run the Tissue specific modelsdocker run -it yay135/diffucpg_tssthenpython batch_run.py Run Docker images Example Models docker pull yay135/diffucpgIf you do not have a GPU enabled system, pull a CPU-only imagedocker pull yay135/diffucpg_cpuprepare your input data directory, use the following command to print a example input data directorydocker run --rm yay135/diffucpg -e trueassume your data directory name is "input_data"in windowsdocker run --gpus all -v .\input_data\:/data --rm yay135/diffucpgin unix or linuxdocker run --gpus all -v ./input_data:/data --rm yay135/diffucpg Other docker options -d or --device : select which cuda device to run with, default is 0-m or --mingcpg : scan your methyl array, limit only imputing windows with at least m non-missing methyl values, default is m=10-o or --overlap : set number of impute epochs, shift window locations between epochs, get mean imputed values for each CpG location, default is 2example:docker run --gpus all -v ./input_data:/data --rm yay135/diffucpg -d 1 -m 5 -o 3use cuda device 1, min number of non-missing methyl values in a window is 5, overlap epochs 3 The following tutorials are for non-docker usages. 2. Data and Models Example datasets are available for download using "gdown.sh". The example datasets only contain WGBS methylation data. The model is the DDPM diffusion model, the repository contains a complete implementation for 1-dimensional input. Please refer to https://arxiv.org/abs/2006.11239 and https://huggingface.co/blog/annotated-diffusion for more details. 3. How to use 3.1 System Requirements The number of steps in the diffusion process is set to 2000. Imputing a sample requires 2000 steps. Gpu acceleration is preferred. 16GB of RAM is required. The code is fully tested and operational on the following platform: Distributor ID: DebianDescription: Debian GNU/Linux 12 (bookworm)Release: 12Codename: bookworm 3.2 Clone the Current Project Run the following command to clone the project.git clone https://github.com/yay135/DiffuCpG.git 3.4 Configure Environment Make sure you have the following software installed in your system:Python 3.9+Pytorch 2.0.1+ 3.4 Run Training and Testing python run.pyThe script will download necessary data and install dependencies automatically. 4 Data and Script Details 4.1 RAW Data The methylation arrays downloaded are in the folder "raw", each file is a methylation array. The first 2 columns are "chromosome" and "location". The assembly used for mapping in our project is the "GRCH37 primary assembly". It is also downloaded automatically. The rest of the columns in each file are methylation levels(required) and other biological data (optional) you wish to incorporate to enhance the model. These files in the raw folder are the initial inputs for pipeline,if you wish to use your own data, it must be configured as such before running the pipeline. 4.2 Generate Sample Use script "generate_samples.py" to generate samples for training and testing.The model can not directly read and impute a methylation array file. Instead, each methylation array is divided into windows, each window is 1kb (1000 base pairs) in length, and each training testing sample is generated from a window. Each sample contains at least 5 channels. the first 4 is the sequence one-hot encoding, the 5th is the methylation data. If a base pair location is not a CpG location, the methylation data value for it is "-1". If a CpG's methylation data is missing or waiting for imputaion, its value is also "-1". Other biological data can be added as extra channels. Check out example raw files in the folder "raw" to form your own datasets for training and testing sample generation.For each raw file in the "raw" folder, the first 3 columns are chr, loc, and methylation.The rest of the columns are treated as additional channels and will be added to each sample during generation. '-d' or '--folder': specify raw data folder'-i' or '--index' : which column in a raw file is the methylation array'-t' or '--tol' : how many missing methylation value is tolerated(we recommend 0 for generating training samples and -1 for generating testing samples, 0 will force the script to only select from windows with no missings, -1 will tolerate missing as much as possible.)'-c' or '--chr' : limit which chromosome to use, default is "chr#" to use all chromosomes'-w' or '--winsize' : what window size to use, default is 1000 '-m' or '--mincpg': force generate from window to have a minimum number of CpGs, default is 10 '-n' or '--nsample': number of samples to generate per chromosome '-p' or '--output': samples output folder, default is "out" Use script "generate_samples_concat.py" to generate samples from long-range interacting windows such as Hi-C interactions or computed correlation.Check out the example long range file in the folder "data" to form your own long-range interacting windows for sample generation and concatenation. 4.3 Training Script Use diffusion.py to train and test a DDPM model using the generated samples'-t' or '--train_folder' : the folder containing the training samples'-f' or '--model_folder' : the model folder, will be created if it does not exist'-w' or '--win_size' : window size of each sample, default is 1000'-c' or '--channel': channel size of each sample'-d' or '--cuda_device' : if you have multiple cuda gpus, select which gpu to use, default is 0"-e" or "--epoch" : how many epochs for training, default is 2000"-s" or "--earlystop" : whether to use "early stopping" during training, default is False"-p" or "--patience" : patience for early stopping, default is 10 4.4 Imputation Use diffusion_inpainting.py to perform imputation on generated samples.'-t' or '--test_folder' : the folder containing samples for imputation'-o' or '--out_folder': imputed output folder name, default="inpainting_out"'-w' or '--win_size' : window size of each sample, default is 1000'-c' or '--channel': channel size of each sample'-d' or '--cuda_device' : if you have multiple cuda gpus, select which gpu to use, default is 0 Team If you have any questions or concerns about the project, please contact the following team member: Fengyao Yan fxy134@miami.edu

科研通智能强力驱动
Strongly Powered by AbleSci AI
更新
PDF的下载单位、IP信息已删除 (2025-6-4)

科研通是完全免费的文献互助平台,具备全网最快的应助速度,最高的求助完成率。 对每一个文献求助,科研通都将尽心尽力,给求助人一个满意的交代。
实时播报
16秒前
清风明月完成签到 ,获得积分10
19秒前
诚心的凛发布了新的文献求助10
22秒前
haprier完成签到 ,获得积分10
30秒前
32秒前
要减肥火车完成签到 ,获得积分10
52秒前
科研通AI2S应助科研通管家采纳,获得10
1分钟前
sunialnd应助科研通管家采纳,获得20
1分钟前
lalalapa666发布了新的文献求助10
1分钟前
李木禾完成签到 ,获得积分10
1分钟前
yll发布了新的文献求助10
2分钟前
摸鱼主编magazine完成签到,获得积分10
2分钟前
研友_nxw2xL完成签到,获得积分10
3分钟前
muriel完成签到,获得积分0
3分钟前
如歌完成签到,获得积分10
3分钟前
sunialnd应助科研通管家采纳,获得20
3分钟前
浮游应助科研通管家采纳,获得10
3分钟前
master-f完成签到 ,获得积分10
3分钟前
端庄半凡完成签到 ,获得积分0
3分钟前
何为完成签到 ,获得积分0
4分钟前
小树叶完成签到 ,获得积分10
4分钟前
misa完成签到 ,获得积分10
4分钟前
哆啦十七应助Wei采纳,获得10
4分钟前
nano完成签到 ,获得积分10
4分钟前
蝎子莱莱xth完成签到,获得积分10
5分钟前
氢锂钠钾铷铯钫完成签到,获得积分10
5分钟前
Square完成签到,获得积分10
5分钟前
科研通AI2S应助科研通管家采纳,获得10
5分钟前
sunialnd应助科研通管家采纳,获得20
5分钟前
科研小白书hz完成签到 ,获得积分10
6分钟前
松松完成签到 ,获得积分0
6分钟前
fox完成签到 ,获得积分10
6分钟前
as完成签到 ,获得积分10
7分钟前
sunialnd应助科研通管家采纳,获得30
7分钟前
heher完成签到 ,获得积分10
7分钟前
zzhui完成签到,获得积分10
8分钟前
科研通AI2S应助科研通管家采纳,获得10
9分钟前
科研通AI2S应助科研通管家采纳,获得10
9分钟前
hugeyoung完成签到,获得积分10
9分钟前
JoeyJin完成签到,获得积分10
9分钟前
高分求助中
(应助此贴封号)【重要!!请各用户(尤其是新用户)详细阅读】【科研通的精品贴汇总】 10000
Bandwidth Choice for Bias Estimators in Dynamic Nonlinear Panel Models 2000
HIGH DYNAMIC RANGE CMOS IMAGE SENSORS FOR LOW LIGHT APPLICATIONS 1500
Constitutional and Administrative Law 1000
The Social Work Ethics Casebook: Cases and Commentary (revised 2nd ed.). Frederic G. Reamer 800
Die Fliegen der Palaearktischen Region. Familie 64 g: Larvaevorinae (Tachininae). 1975 500
The Experimental Biology of Bryophytes 500
热门求助领域 (近24小时)
化学 材料科学 医学 生物 工程类 有机化学 生物化学 物理 纳米技术 计算机科学 内科学 化学工程 复合材料 物理化学 基因 遗传学 催化作用 冶金 量子力学 光电子学
热门帖子
关注 科研通微信公众号,转发送积分 5367991
求助须知:如何正确求助?哪些是违规求助? 4495993
关于积分的说明 13996504
捐赠科研通 4401019
什么是DOI,文献DOI怎么找? 2417571
邀请新用户注册赠送积分活动 1410305
关于科研通互助平台的介绍 1385947