Lmst

I'm slightly creeped out but not surprised. I was editing a music score on my laptop recently and I added an instruction to play the piece "robotic". The next time I logged into Indeed, the first job recommendation to come up is for Robotics Operator. Is Indeed scraping data from my recent documents for keywords?

Always check your firewall.

#scraping #datascraping

[위키피디아 25년 만의 대전환, AI 기업들과 유료 계약 체결

위키피디아가 25년 만에 처음으로 AI 기업들과 유료 계약 체결하며, AI 시대의 생존 전략을 모색하고 있다. AI 봇의 대량 스크래핑으로 인한 서버 비용 증가와 방문자 감소, 콘텐츠 품질 저하 등의 문제를 해결하기 위해 상업적 유료 계약을 체결한 것.

https://news.hada.io/topic?id=25976

#wikipedia #ai #openknowledge #datascraping #commercialcontract

Get Property Intelligence Powered by Real Estate Data Scraping Services

Explore how quality data collection can elevate your real estate strategy: https://www.hitechbpo.com/real-estate-data-scraping-services.php

#realestatedata #datascraping #propertyinsights #datasolutions

🚀 Want to dive into data scraping? Check out MediaCrawler by NanmiCoder! This powerful tool lets you harvest comments and content from popular platforms like Xiaohongshu, Douyin, Kuaishou, Bilibili, Weibo, Baidu Tieba, and Zhihu. Perfect for learning and research—just remember to use it responsibly! 📊💻

Explore more here: https://github.com/NanmiCoder/MediaCrawler

#DataScraping #TechTools #OpenSource

The “17.5 million Instagram user data leak” making rounds in 2026? Old news

The data from 2022 was already leaked in 2023.

We broke down all 3 dumps - same records

Don’t fall for clickbait reports!

Read: https://hackread.com/instagram-user-data-leak-scraped-records-2022/

#Instagram #DataLeak #Cybersecurity #Privacy #DataScraping

How Data Scraping Powers Dynamic Pricing

Data scraping helps businesses track market trends, competitor prices, and demand changes in real time. Access to structured, reliable data supports smarter pricing decisions, improves analytics, and strengthens AI models while maintaining data quality and compliance.

#DataScraping #BusinessIntelligence #Ecommerce

Tự động hóa trình duyệt miễn phí & tự lưu trữ **Doppelgänger** giúp khắc phục hạn chế của các nền tảng trả phí như Apify với: ✅ Không phí chạy tác vụ ✅ Tự lưu trữ, dữ liệu an toàn ✅ Hỗ trợ JSON và JavaScript linh hoạt. Cải thiện quy trình trích xuất dữ liệu bị thiếu hoặc chứa trang yêu cầu đăng nhập. Dự án mã nguồn mở, xây dựng trên Playwright. Thích hợp cho luồng tác vụ phức tạp & lặp lại.
#CongNghe #TirungTo #OpenSource #LapTrinh #DataScraping #Doppelgänger #TirungTrinhDuyet #PhanTichDuLieu

LinkedIn's 2025 Data Crisis: 4.3 Billion Records Leaked, Risks Rise https://www.webpronews.com/linkedins-2025-data-crisis-4-3-billion-records-leaked-risks-rise/ #cybersecurity #LinkedIn #DataTheft #scams #spam #DataScraping

Một tiêu chuẩn mới, Site Content Protocol (SCP), được đề xuất nhằm giải quyết các vấn đề trong việc thu thập dữ liệu cho AI. SCP cho phép website cung cấp nội dung có cấu trúc, tối ưu hóa riêng cho AI, cải thiện chất lượng dữ liệu, tăng hiệu quả và minh bạch pháp lý, thay vì cạo dữ liệu từ HTML thông thường.

#AI #DataScraping #WebDev #SCPProtocol #Efficiency
#ThuThapDuLieu #PhátTriểnWeb #GiaoThucSCP #HieuQua

https://www.reddit.com/r/programming/comments/1puyk3x/specification_addressing_ineffic

Công cụ mới giúp trích xuất phụ đề thủ công, chất lượng cao từ YouTube, lý tưởng để xây dựng bộ dữ liệu tinh chỉnh Llama/Mistral. Nó tự động phân biệt phụ đề do người viết và phụ đề tự động, đồng thời xử lý việc xoay IP để tránh bị chặn. Rất hữu ích cho các nhà phát triển AI!

#AI #YouTube #DataScraping #LLM #MachineLearning #DữLiệu #HọcMáy

https://www.reddit.com/r/LocalLLaMA/comments/1pt9njz/tool_for_scraping_highquality_youtube_datasets/

Từ một dự án freelancing scrape Substack, một người đã biến giải pháp 1 lần thành công cụ tự phục vụ, mở ra cơ hội thị trường. Câu chuyện chuyển đổi từ làm thuê sang tạo sản phẩm. #FreelanceTips #ProductBuilding #Substack #DataScraping #StartupViecles #TaoSanPham #KinhNghiemTuDo

https://www.reddit.com/r/SideProject/comments/1pqwve3/a_peopleperhour_gig_taught_me_to_think/

**AI nợ công: Làm thế nào các công cụ đào tạo LLM phá vỡ hợp đồng xã hội của mã nguồn mở**
AI học hỏi từ mã nguồn mở nhưng không hoàn thiện nghĩa vụ, gây bất cập cho cộng đồng. Các dự án LLM (Large Language Models) "dựng" dữ liệu công khai nhưng xem nhẹ trách nhiệm bảo mật, tôn vinh tác giả và lợi ích lâu dài của phần mềm mở. Cần tái định hướng để công nghệ phát triển bền vững.

#AI #Mãnguồnmở #Đàotạocôngnghệ #Bềnvững #ĐạođứcAI #OpenSource #SocialContract #TechEthics #AIdebt #DataScraping

h

https://winbuzzer.com/2025/12/08/new-york-times-sues-perplexity-ai-for-copyright-infringement-and-trademark-tarnishment-xcxwbn/

New York Times Sues Perplexity AI for Copyright Infringement and ‘Trademark Tarnishment’

#AI #Copyright #PerplexityAI #NYT #GenAI #SearchEngines #RAG #MediaLaw #IntellectualProperty #Hallucinations #Journalism #DataScraping

The development of modern AI heavily depends on reliable training data, and web scraping provides it at scale. Scraping gathers real-world content that helps LLMs understand grammar, sentiment, trends, and domain knowledge. With well-processed scraped datasets, AI models become more accurate, more knowledgeable, and more capable of solving real challenges.

https://www.webscreenscraping.com/web-scraping-role-ai-training-llm-models.php

#aiwebscraping #datascraping

Reddit Sues Perplexity and Data Scrapers for 'Industrial-Scale' AI Content Theft

#AI #Reddit #Perplexity #Lawsuit #DataScraping #Copyright #TechLaw #DMCA #AIEthics #BigTech #IntellectualProperty #SerpApi #Oxylabs #Litigation

https://winbuzzer.com/2025/10/23/reddit-sues-perplexity-and-data-scrapers-for-industrial-scale-ai-content-theft-xcxwbn

Reddit sues data scrapers and Perplexity over unauthorized content access: Reddit filed a lawsuit on October 22, 2025, against SerpApi, Oxylabs, AWMProxy, and Perplexity AI for circumventing security measures to scrape platform data. https://ppc.land/reddit-sues-data-scrapers-and-perplexity-over-unauthorized-content-access/ #Reddit #Lawsuit #DataScraping #Privacy #Cybersecurity

Top Data Extraction & Web Scraping Companies in 2026 | TagX

Discover the leading data extraction and web scraping companies for 2026 offering advanced AI-powered tools, scalable APIs, and automation services. Compare top providers like TagX, Octoparse, Scrapy, and more to choose the best solution for efficient data collection and insights.
#tagx #datascraping
#webscraping

https://www.tagxdata.com/top-data-extraction-and-web-scraping-companies-in-2026

Since it will be introduced in the EEA, Switzerland, Canada and Hong Kong in a few weeks,
make sure to opt out of LinkedIn's AI data scraping agreement (unless you want your data to be used as training data).

https://www.linkedin.com/mypreferences/d/settings/data-for-ai-improvement

#ai #linkedin #datascraping #psa #eu #europe #eea #switzerland #schweiz #suisse #canada #hongkong #information #important

Tuyển dụng: Vị trí scrape 300.000 tiêu đề sách PDF từ AbeBooks, tìm file từ Wayback Machine/Anna's Archive. Tổng 4TB dữ liệu sẽ được lưu trữ vào đĩa quang 128GB (Verbatim/Panasonic) để đảm bảo đọc được 100 năm. Ngân sách: $700 (chưa vật tư).

#TuyểnDụng #Scraping #LưuTrữDữLiệu #PDF #AbeBooks
#Hiring #DataScraping #DataArchiving #PDF

https://www.reddit.com/r/programming/comments/1o4te1o/hiring_scrape_300000_pdfs_and_archive_to_128_gb/

#DataScraping

Client Info