UHG

Andrew Ng Announces The Launch Of NeurIPS Data-Centric AI Workshop

If 80 per cent of machine learning work is data preparation, then ensuring data quality is the most important work of a machine learning team

Share

Andrew Ng Announces The Launch Of NeurIPS Data-Centric AI Workshop
Article Overview

DeepLearning.AI’s Andrew Ng recently announced the launch of the NeurIPS Data-Centric AI workshop. The workshop is expected to showcase some of the best academic research work related to data-centric AI. Academic researchers and practitioners can submit their research papers on or before September 30, 2021.

The organising committee includes Google Research’s Lora Aroyo, Stanford University professor Cody Coleman, Landing AI’s Greg Diamos, Harvard University professor Vijay Janapa Reddi, Eindhoven University of Technology researcher Joaquin Vanschoren, and Google’s machine learning product manager, Sharon Zhou

What is Data-Centric AI

Data-Centric AI, or DCAI, represents the recent transition from modelling to the underlying data used to train and evaluate models. DCAI aims to address the gap in tooling, best practices, and infrastructure for managing data in modern ML systems. Plus, it looks to offer high productivity and efficient open data engineering tools to make building, maintaining, and evaluating datasets cost-effective and seamless. 

The team strives to cultivate the DCAI community into a vibrant interdisciplinary field and tackle practical data problems with this event. The data problems include: data collection/generation, data labelling, data preprocess/augmentation, data quality evaluation, data debt, and data governance. The team believes that many of these areas are still in the early stages and hope to knit the gaps by bringing the ML community together. 

Call for Papers 

The journey of building and using datasets for AI systems is often artisanal — painstaking and expensive. The ML community lacks high productivity and efficient open data engineering tools. To accelerate creation and iteration, alongside increasing the efficiency of use and reuse by democratising data engineering and evaluation, remains a core challenge even to this day. 

“If 80 per cent of machine learning work is data preparation, then ensuring data quality is the most important work of an ML team and therefore a vital research area,” said the NeurIPS DSAI team. Further, they said human-labelled data has increasingly become the fuel and compass of AI-based software systems, while innovative efforts have mostly focused on models and code. 

However, in recent years, there has been an increased focus on scale, speed, and cost of building and improving datasets, which has, in turn, resulted in an impact on quality. Some of the major research work in the areas include ‘Response-based Learning for Grounded Machine Translation,’ ‘Crowdsourcing with Fairness, Diversity and Budget Constraints,’ Excavating AI, ‘Data Excellence: Better Data for Better AI,’ ‘State of the Art: Reproducibility in Artificial Intelligence, and others. 

“We need a framework for excellence in ‘data engineering’ that does not yet exist,” said the NeurIPS DCAI team, and noted that aspects like maintainability, reproducibility, reliability, validity and fidelity of datasets are often overlooked when releasing the dataset into the market. In this event, the team plans to highlight examples, case studies, and methodologies for excellence in data collection. 

The NeurIPS DCAI team said that building an active research community focused on data-centric AI is critical for defining the core problems and creating ways to measure progress in machine learning through data quality tasks.

Topics

The interested candidate can submit their papers on the following topics that include but are not limited to the following: 

New Datasets in areas: 

  • Speech, vision, manufacturing, medical, recommendation/personalisation 
  • Science 

Tools and methodologies that

  • Quantify and accelerate time to source high-quality data 
  • Ensure data is labelled consistently, such as label consensus 
  • Improve data quality more systematically. 
  • Automate the generation of high quality supervised learning training data from low-quality resources, such as forced alignment in speech recognition 
  • Produce uniform and low noise data samples, or remove labelling noise or inconsistencies from existing data 
  • Control what goes into the dataset and make high-level edits efficiently to very large datasets, like adding new words, languages, etc. 
  • Search techniques for finding suitably licensed datasets based on public resources 
  • Create training datasets for small data problems or rare classes in the long tail of big data problems 
  • Incorporate timely feedback from production systems into datasets 
  • Understand dataset coverage of important classes and editing them to cover newly identified important cases 
  • Import dataset by allowing easy combination and composition of existing datasets
  • Export dataset by making the data consumable for models and interface with model training and inference systems such as web dataset 
  • Enable composition of dataset tools like MLCube, Docker, Airflow 

Algorithms for working with limited labelled data and improving label efficiency 

  • Data selection techniques like active learning and core-set selection for identifying the most valuable examples to label 
  • Semi-supervised learning, few-shot learning, and weak supervision techniques for maximising the power of limited labelled data 
  • Self-supervised learning and transfer learning approaches for developing powerful representations used for many downstream tasks with limited labelled data
  • Novelty and drift detection to identify and spot when more data needs to be labelled 

Responsible AI development: 

  • Fairness, bias, diversity evaluation and analysis for dataset and algorithms/modelling 
  • Tools for ‘green AI hardware-software system’ design and evaluation 
  • Scalable, reliable training systems and methods 
  • Tools, methodologies, and techniques for private, secure ML training 
  • Efforts towards reproducible AI (data cards, model cars, etc.)

Instructions for submitting papers

  • Researchers can submit short papers (1-2 pages) and long papers (4 pages), addressing one or more of the topics  
  • Papers need to be formatted as per NeurIPS 2021 guidelines 
  • Papers will be peer-reviewed by the programme committee 
  • Accepted papers will be presented as lighting talks during the workshop 

Timeline 

  • Early submission deadline: 17 September 2021
  • Submission deadline: 30 September 2021 
  • Notification of acceptance: 22 October 2021 
  • Workshop: 14 December 2021 

Click here to submit your research paper. 

Related Posts
19th - 23rd Aug 2024
Generative AI Crash Course for Non-Techies
Upcoming Large format Conference
Sep 25-27, 2024 | 📍 Bangalore, India
Download the easiest way to
stay informed

Subscribe to The Belamy: Our Weekly Newsletter

Biggest AI stories, delivered to your inbox every week.
Flagship Events
Rising 2024 | DE&I in Tech Summit
April 4 and 5, 2024 | 📍 Hilton Convention Center, Manyata Tech Park, Bangalore
Data Engineering Summit 2024
May 30 and 31, 2024 | 📍 Bangalore, India
MachineCon USA 2024
26 July 2024 | 583 Park Avenue, New York
MachineCon GCC Summit 2024
June 28 2024 | 📍Bangalore, India
Cypher USA 2024
Nov 21-22 2024 | 📍Santa Clara Convention Center, California, USA
Cypher India 2024
September 25-27, 2024 | 📍Bangalore, India
discord-icon
AI Forum for India
Our Discord Community for AI Ecosystem, In collaboration with NVIDIA. 

Subscribe to Our Newsletter

The Belamy, our weekly Newsletter is a rage. Just enter your email below.

Subscribe to Our Youtube channel