Saturday, April 22, 2023

Secret list of ChatGTP: Smithsonian Puts 4.5 Million High-Res Images Online and Into the Public Domain Making Them Free to Use


Inside the secret list of websites that make AI like ChatGPT sound smart

Washington Post: “AI chatbots have exploded in popularity over the past four months, stunning the public with their awesome abilities, from writing sophisticated term papers to holding unnervingly lucid conversations. Chatbots cannot think like humans: They do not actually understand what they say. They can mimic human speech because the artificial intelligence that powers them has ingested a gargantuan amount of text, mostly scraped from the internet. This text is the AI’s main source of information about the world as it is being built, and it influences how it responds to users. If it aces the bar exam, for example, it’s probably because its training data included thousands of LSAT practice sites. Tech companies have grown secretive about what they feed the AI. 

So The Washington Post set out to analyze one of these data sets to fully reveal the types of proprietary, personal, and often offensive websites that go into an AI’s training data. To look inside this black box, we analyzed Google’s C4 data set, a massive snapshot of the contents of 15 million websites that have been used to instruct some high-profile English-language AIs, called large language models, including Google’s T5 and Facebook’s LLaMA. (OpenAI does not disclose what datasets it uses to train the models backing its popular chatbot, ChatGPT) The Post worked with researchers at the Allen Institute for AI on this investigation and categorized the websites using data from Similarweb, a web analytics company. About a third of the websites could not be categorized, mostly because they no longer appear on the internet. 

Those are not shown. Hover over the boxes above to view the top sites in each category. We then ranked the remaining 10 million websites based on how many “tokens” appeared from each in the data set. Tokens are small bits of text used to process disorganized information — typically a word or phrase.

  • Note – you may search for a website mid way in the article – here

Smithsonian Puts 4.5 Million High-Res Images Online and Into the Public Domain Making Them Free to Use Open Culture

Open Culture: “That vast repository of American history that is the Smithsonian Institution evolved from an organization founded in 1816 called the Columbian Institute for the Promotion of Arts and Sciences. Its mandate, the collection and dissemination of useful knowledge, now sounds very much of the nineteenth century — but then, so does its name. 

Columbia, the goddess-like symbolic personification of the United States of America, is seldom directly referenced today, having been superseded by Lady Liberty. Traits of both figures appear in the depiction on the nineteenth-century fireman’s hat above, about which you can learn more at Smithsonian Open Access, a digital archive that now contains some 4.5 million images.”

Teeth-grindingly awful

Another nail in the coffin. Google’s MusicLM generates relatively convincing audio from text descriptions (via MeFi and MusicRadar). You know the kind of thing it’ll be used for: ‘Hans Zimmer-style epic soundtrack’, ‘romantic music for a beachside sunset’, ‘horror movie atmosphere’. Commercial composers can join the queue behind copywriters and artists / see also How to Spot AI-Generated Art, According to Artists.


In Battersea, Owen Hatherley takes a turn around London’s latest over-polished piece of privatised heritage, ‘a chaos of luxury investment vehicles’ / Hookland, ‘High Weirdness from the Lost County of England’ / a collection of London Housing Architects / the Boden Fortress in Sweden / ChronoPhoto, fun game (via A href=””>b3ta) / a good eye and ear at work at Visual Atelier 8 / Geometric Primes, beautiful design work / a collection of vintage posters.


Things You Can Buy suggests you take Perambulation #19, Maze Hill to Deptford, South London / staying South London, hopefully this slice of urban greenery is now preservedGorne Wood, ‘the closest Ancient Woodland to the City of London’ / related, Five Forests in Literature (at the excellent Peter Harrington Journal). 


A very complicated house? For sale at Savills – feels strangely sterile / Lautner for sale / Eton and all the murder / everyone dunks on the Rezvani Vengeance, also discussed, amusingly, at MeFi. See this Axios story on the evolution of the Ford pick-up truck / the Wonders of Street View (via Kottke) / the story of Where is my mind? / the amazing Open ReelEnsemble / cinematic music by Forest Little / post-rockery from Norway’s Liongeist.