Automated data processing and feature engineering for deep learning and big data applications: a survey

计算机科学 人工智能 机器学习 数据预处理 数据处理 大数据 管道(软件) 原始数据 特征工程 自动化 深度学习 数据挖掘 数据库 工程类 机械工程 程序设计语言
作者
Alhassan Mumuni,Fuseini Mumuni
标识
DOI:10.1016/j.jiixd.2024.01.002
摘要

Modern approach to artificial intelligence (AI) aims to design algorithms that learn directly from data. This approach has achieved impressive results and has contributed significantly to the progress of AI, particularly in the sphere of supervised deep learning. It has also simplified the design of machine learning systems as the learning process is highly automated. However, not all data processing tasks in conventional deep learning pipelines have been automated. In most cases data has to be manually collected, preprocessed and further extended through data augmentation before they can be effective for training. Recently, special techniques for automating these tasks have emerged. The automation of data processing tasks is driven by the need to utilize large volumes of complex, heterogeneous data for machine learning and big data applications. Today, end-to-end automated data processing systems based on automated machine learning (AutoML) techniques are capable of taking raw data and transforming them into useful features for Big Data tasks by automating all intermediate processing stages. In this work, we present a thorough review of approaches for automating data processing tasks in deep learning pipelines, including automated data preprocessing– e.g., data cleaning, labeling, missing data imputation, and categorical data encoding–as well as data augmentation (including synthetic data generation using generative AI methods) and feature engineering–specifically, automated feature extraction, feature construction and feature selection. In addition to automating specific data processing tasks, we discuss the use of AutoML methods and tools to simultaneously optimize all stages of the machine learning pipeline.

科研通智能强力驱动
Strongly Powered by AbleSci AI
更新
大幅提高文件上传限制,最高150M (2024-4-1)

科研通是完全免费的文献互助平台,具备全网最快的应助速度,最高的求助完成率。 对每一个文献求助,科研通都将尽心尽力,给求助人一个满意的交代。
实时播报
1秒前
1秒前
1秒前
nancy发布了新的文献求助10
2秒前
思源应助包容店员采纳,获得10
2秒前
AlexLee完成签到,获得积分10
3秒前
领导范儿应助otee采纳,获得10
3秒前
hehe应助wa采纳,获得10
5秒前
Anna发布了新的文献求助10
6秒前
科研通AI2S应助nancy采纳,获得10
9秒前
10秒前
张三坟应助zhengkuang采纳,获得70
12秒前
12秒前
莞莞类卿完成签到,获得积分10
13秒前
慕容雅柏完成签到 ,获得积分10
15秒前
15秒前
nancy完成签到,获得积分10
16秒前
20秒前
20秒前
21秒前
21秒前
Jay完成签到,获得积分10
21秒前
21秒前
22秒前
22秒前
22秒前
害怕的鱼完成签到 ,获得积分10
22秒前
mervin完成签到,获得积分10
23秒前
宋悦完成签到,获得积分10
25秒前
动听的冰海完成签到 ,获得积分10
26秒前
8R60d8应助花晨月夕采纳,获得10
26秒前
26秒前
zhangqhhh发布了新的文献求助10
27秒前
上山石头发布了新的文献求助10
27秒前
小沫发布了新的文献求助10
28秒前
眼药水发布了新的文献求助10
28秒前
Anna完成签到,获得积分10
29秒前
圆圆姐姐发布了新的文献求助10
30秒前
自信河马发布了新的文献求助10
30秒前
zhang完成签到,获得积分10
31秒前
高分求助中
The ACS Guide to Scholarly Communication 2500
Sustainability in Tides Chemistry 2000
Studien zur Ideengeschichte der Gesetzgebung 1000
TM 5-855-1(Fundamentals of protective design for conventional weapons) 1000
Threaded Harmony: A Sustainable Approach to Fashion 810
Pharmacogenomics: Applications to Patient Care, Third Edition 800
Genera Insectorum: Mantodea, Fam. Mantidæ, Subfam. Hymenopodinæ (Classic Reprint) 500
热门求助领域 (近24小时)
化学 医学 生物 材料科学 工程类 有机化学 生物化学 物理 内科学 纳米技术 计算机科学 化学工程 复合材料 基因 遗传学 催化作用 物理化学 免疫学 量子力学 细胞生物学
热门帖子
关注 科研通微信公众号,转发送积分 3082543
求助须知:如何正确求助?哪些是违规求助? 2735785
关于积分的说明 7538919
捐赠科研通 2385368
什么是DOI,文献DOI怎么找? 1264824
科研通“疑难数据库(出版商)”最低求助积分说明 612813
版权声明 597672