India is building a database for companies to train AI models: Rajeev Chandrasekhar (Mint)

Mint: India is building a database for companies to train AI models: Rajeev Chandrasekhar. “India is building a large database of anonymized non-personal data for Indian companies and startups that are using artificial intelligence (AI), said Rajeev Chandrasekhar, minister of state (MoS) for Electronics and Information Technology, at the Global Fintech Fest (GFF), an industry event, held in Mumbai on Wednesday.”

Analytics India: Google AI researchers present a new method to train models, ‘DeepCTRL’

Analytics India: Google AI researchers present a new method to train models, ‘DeepCTRL’. “Google Cloud AI researchers have offered a unique deep learning training approach that incorporates rules so that the strength of the rules may be controlled at inference. DeepCTRL (Deep Neural Networks with Controllable Rule Representations) combines a rule encoder and a rule-based objective into the model, allowing for a shared representation for decision-making. Data type and model architecture are unimportant to DeepCTRL.”

University of California Riverside: Wildfire dataset could help firefighters save lives and property

University of California Riverside: Wildfire dataset could help firefighters save lives and property. “The dataset can be used to simulate the spread of wildfires to help firefighters plan emergency response and conduct evacuation. It can also help simulate how fires might spread in the near future under the effects of deforestation and climate change, and aid risk assessment and planning of new infrastructure development. The open-source dataset, named WildfireDB, contains over 17 million data points that capture how fires have spread in the contiguous United States over the last decade. The dataset can be used to train machine learning models to predict the spread of wildfires.”

Georgia Tech: Through Another’s Eyes: University Researchers, Facebook Release Massive Dataset to Expand Innovation in AI

Georgia Tech: Through Another’s Eyes: University Researchers, Facebook Release Massive Dataset to Expand Innovation in AI. “Imagine a collection of assistive technologies that could help a user learn a new skill, assist an elder individual with a task around the home, or help detect autism in early childhood. There exists an endless list of possibilities where artificial intelligence could impact humanity, but to do so it must see the world as we do — in the first person. A consortium of universities brought together by Facebook AI, including Georgia Tech, has collaborated to compile the largest dataset ever collected on egocentric computer vision — or computer vision from the first-person point of view.”

The Register: We spoke to a Stanford prof on the tech and social impact of AI’s powerful, emerging ‘foundation models’

The Register: We spoke to a Stanford prof on the tech and social impact of AI’s powerful, emerging ‘foundation models’. “Typically, these models are giant neural networks made up of millions and billions of parameters, trained on massive amounts of data and later fine-tuned for specific tasks. For example, OpenAI’s enormous GPT-3 model is known for generating prose from prompts, though it can be adapted to translate between languages and output source code for developers. These models – drawing from vast datasets – can therefore sit at the heart of powerful tools that may disrupt business and industries, life and work. Yet right now they’re difficult to understand and control; they are imperfect; and they exhibit all sorts of biases that could harm us. And it has already been demonstrated that all of these problems can grow with model size.”

Google Blog: A Dataset for Studying Gender Bias in Translation

Google AI Blog: A Dataset for Studying Gender Bias in Translation. “To help facilitate progress against the common challenges on contextual translation (e.g., pronoun drop, gender agreement and accurate possessives), we are releasing the Translated Wikipedia Biographies dataset, which can be used to evaluate the gender bias of translation models. Our intent with this release is to support long-term improvements on ML systems focused on pronouns and gender in translation by providing a benchmark in which translations’ accuracy can be measured pre- and post-model changes.”

What Happened When Google Threw All Voice Data To The Blender. Answer: SpeechStew (Analytics India)

Analytics India: What Happened When Google Threw All Voice Data To The Blender. Answer: SpeechStew. “Training large models is a massive challenge as it requires collecting and annotating vast amounts of data. It is particularly challenging in the case of speech recognition models. To overcome this challenge, a team from Google Research and Google Brain have introduced an AI model, SpeechStew. The model is trained on a combination of datasets to achieve state-of-the-art results on various speech recognition benchmarks.”

The Register: How Facebook uses public videos to train, deploy machine-learning models and harvest those eyeballs

The Register: How Facebook uses public videos to train, deploy machine-learning models and harvest those eyeballs . “Facebook this week revealed an internal project to create machine-learning models that can understand visual, audio, and written content from videos publicly uploaded to its social network. One of the models, known as Generalized Data Transformations (GDT), is now used on Instagram. Users viewing short video recordings, or Reels, can quickly find other Reels they might like to watch, thanks to an AI-powered recommender system that picks similar clips that might be interesting.”

Engadget: Google wants you to train its AI by lip syncing ‘Dance Monkey’ by Tones and I

Engadget: Google wants you to train its AI by lip syncing ‘Dance Monkey’ by Tones and I. “Google is asking users to help teach its AI how to speak. A new ‘Experiments with Google’ called LipSync asks users to lip sync a small part of ‘Dance Monkey’ by Tones and I, Android Police reports. LipSync, which is built by YouTube for Chrome on desktop, will score your performance. It will then feed the video to Google’s AI — it doesn’t record any audio.”

Carnegie Mellon University: Live-Streamed Game Collects Sounds To Help Train Home-Based Artificial Intelligence

Carnegie Mellon University: Live-Streamed Game Collects Sounds To Help Train Home-Based Artificial Intelligence . “From yawning to closing the fridge door, a lot of sounds occur within the home. Such sounds could be useful for home-based artificial intelligence applications, but training that AI requires a robust and diverse set of samples. A video game developed by Carnegie Mellon University researchers leverages live streaming to collect sound donations from players that will populate an open-source database.”

The Register: MIT apologizes, permanently pulls offline huge dataset that taught AI systems to use racist, misogynistic slurs

The Register: MIT apologizes, permanently pulls offline huge dataset that taught AI systems to use racist, misogynistic slurs. “The training set, built by the university, has been used to teach machine-learning models to automatically identify and list the people and objects depicted in still images. For example, if you show one of these systems a photo of a park, it might tell you about the children, adults, pets, picnic spreads, grass, and trees present in the snap. Thanks to MIT’s cavalier approach when assembling its training set, though, these systems may also label women as whores or bitches, and Black and Asian people with derogatory language. The database also contained close-up pictures of female genitalia labeled with the C-word.”

CNET: Your face mask selfies could be training the next facial recognition tool

CNET: Your face mask selfies could be training the next facial recognition tool. “Your face mask selfies aren’t just getting seen by your friends and family — they’re also getting collected by researchers looking to use them to improve facial recognition algorithms. CNET found thousands of face-masked selfies up for grabs in public data sets, with pictures taken directly from Instagram.”

Quantum Stat: 100s of datasets for machine learning developers (and counting)

From last month, but I just learned about it today. Quantum Stat: 100s of datasets for machine learning developers (and counting). “With the advent of deep learning and the necessity for more and diverse data, researchers are constantly hunting for the most up-to-date datasets that can help train their ML model. Currently, NLP data seems to be scattered across several 3rd party libraries, Reddit, or in the research arms of big tech. And while these mediums are useful, there doesn’t seem to be a central hub for housing NLP data that can be easily reached and searched by the ML engineer. As a result, we’ve created the ‘Big Bad NLP Database,’ the world’s largest data library in natural language processing:”