Two Hundred Million Bluesky posts scrapped by 2 different groups
Failla A, Rossetti G (2024) “I’m in the Bluesky Tonight”: Insights from a year worth of social data. PLoS ONE 19(11): e0310330. https://doi.org/10.1371/journal.pone.0310330“Pollution of online social spaces caused by rampaging d/misinformation is a growing societal concern.
However, recent decisions to reduce access to social media APIs are causing a shortage of publicly available, recent, social media data, thus hindering the advancement of computational social science as a whole.
We present a large, high-coverage dataset of social interactions and user-generated content from Bluesky Social to address this pressing issue. The dataset contains the complete post history of over 4M users (81% of all registered accounts), totalling 235M posts. We also make available social data covering follow, comment, repost, and quote interactions.
Since Bluesky allows users to create and like feed generators(i.e., content recommendation algorithms), we also release the full output of several popular algorithms available on the platform, along with their timestamped “like” interactions.
This dataset allows novel analysis of online behavior and human-machine engagement patterns. Notably, it provides ground-truth data for studying the effects of content exposure and self-selection and performing content virality and diffusion analysis.”
- Hugging Face: “Dataset Card for 1 Million Bluesky Posts – This dataset contains 1 million public posts collected from Bluesky Social’s firehose API, intended for machine learning research and experimentation with social media data…Daniel van Strien@danielvanstrien.bsky.social – First dataset for the new @huggingface.bsky.social @bsky.app community organisation: one-million-bluesky-posts. 1M public posts from Bluesky’s firehose API. Includes text, metadata, and language predictions.. BUT then…I’ve removed the data from this dataset since there was a lot of community pushback about its creation/uploading. I will leave the dataset repository up to allow room for discussion of how datasets can be used to help improve Bluesky and allow people to build the tools they need to build their own open models and approaches to creating feeds that work for their needs. Please feel free to continue to leave feedback in the discussions here…”
- BUT….See also Dr. Casey Fiesler @cfiesler.bsky.social: “…Researchers have been using social media content without your consent for a LONG time. Not just AI/ML research, of course, all kinds. There is a non-zero chance that one of your tweets or reddit comments is quoted in a research paper somewhere. www.howwegettonext.com/scientists-l…And Twitter for a long time was absolutely the biggest source of social media data for research. @zey.bsky.socialonce called Twitter the “model organism” of social media research: researchers used Twitter because like the fruit fly, the platform and its users were just so easy to study…In 2016 my collaborator @profprof.bsky.social and I surveyed Twitter users about how they felt about researchers using their tweets. And one of the findings was that most of them had no idea this was happening. But when they found out… they cared. journals.sagepub.com/doi/10.1177/…“