Lmst

Competitor Price Cache with Failover

Stay fresh even when the scrape target goes down.

#php #python #caching #scraping #failover #pricing #performance #reliability #growthhacks #viralcoding

https://www.youtube.com/watch?v=9orrqrZ0mFQ

Bocked 👎 bye bye #contenttheft #scraping

A screenshot of an account that scrapes and steals content for fun

P.S., the body of the parent #toot was created by a simple #shell #function:

function apod {
    #Today's NASA Astronomy Picture of the Day info-fetcher
    curl -sL 'https://apod.nasa.gov/apod/archivepix.html' \
        |grep -m1 "[0-9][0-9]:" \
        |sed 's/^/Date: /;
            s|: *<a href="|\nURL: https://apod.nasa.gov/apod/|;
            s/">/\nTitle: /; s/<.*$//'
    echo
    echo "#NASA #Astronomy #PictureOfTheDay"
}

#bash #ksh #mksh #shellScripting #unix #UnixShell #WebScraping #Scraping #HTML

#OpenStreetMap sorgt sich: Tausende KI-Bots erfassen Daten | heise online https://www.heise.de/news/OpenStreetMaps-sorgt-sich-Tausende-KI-Bots-erfassen-Daten-11156876.html #ArtificialIntelligence #scraping

Basic day when you have a small photography website that's been around for a while.

#Cloudflare #WAF #Firewall #Webmaster #AI #Scraping #AIMustDie #NoAI #No2AI

Screenshot showing a detail of Cloudflare web application firewall statistics, with 33440 requests mitigated by the firewall.

I found this site:
www.newbohemia.art

They claim to be a "human only" art site. They verify you create human only art.

Cool. I joined to see what they're about.

They allow ANY user to share your work on all of meta's platforms, xitter, pinterest, tiktok, and tumblr. You, the user, cannot stop anyone from posting your work on the worst sites for AI training.

I asked them about it, and they said: "We're looking into it. We may allow an individual to turn that off in the future."

I pointed out the hypocrisy and stupidity of this by default.

They banned me.

@eff @privacyguides
You should do a piece on these assholes.

@Curator You may wish to warn your users about this.

#newbohemiaart #art #ai #scraping #privacy #dataprivacy #stopthesteal #blockmeta #blockxitter #blockpinterest #blockticktok #blockai #artists

Scrapers vs. OpenStreetMap: Bots of some sort are hammering OSM's servers instead of just downloading the bulk free data dump
https://www.linkedin.com/posts/open-street-map_opendata-osm-openstreetmap-activity-7422084150360408064-ews2/
#openstreetmap #scraping #hosting #bots #osm #ai #-

OSM scraping: An #OpenStreetMap-themed account on LinkedIn is calling on journalists to investigate a surge in what seems to be coordinated #scraping from hundreds of thousands of IP addresses, suspected to be linked to #AI #data collection. This incident...
https://spatialists.ch/posts/2026/01/28-osm-scraping/ #GIS #GISchat #geospatial #SwissGIS

#Paper: Researchers scraped the public profiles of 3.5 billion #WhatsApp users simply by querying WhatsApp own public registry which assigns phone numbers to WhatsApp profiles.
They did it with merely 5 WhatsApp accounts all operating from the same IP, running thousands of processes and threads in parallel. WhatsApp's servers didn't rate limit them.
They thereby had a database profile pics <-> phone numbers, so that with face recognition software one could in principle automatically figure out someone's phone number from a photo of their face and vice versa (if they also put their own face as a profile pic in WhatsApp).

https://github.com/sbaresearch/whatsapp-census/blob/main/Hey_there_You_are_using_WhatsApp.pdf
#Scraping #Privacy

🚀 Mới ra mắt API “Article Extractor & AI Summarizer”! Tự động lấy nội dung sạch từ bất kỳ URL nào, loại bỏ quảng cáo, hỗ trợ JS fallback, rồi tạo tóm tắt AI, so sánh, viết lại bài viết. Hoạt động ổn trên đa số site tin tức/blog, giá rẻ trên RapidAPI. Mời cộng đồng chia sẻ nhu cầu tính năng, khó khăn khi scrape/summarize và ý tưởng cải tiến. #AI #API #Scraping #Summarizer #CôngNghệ #TinTức #MachineLearning #TechVietnam 🌐✨

https://www.reddit.com/r/SaaS/comments/1qpa2r9/i_built_an_article_extrac

🔥 Mới ra mắt Divparser – công cụ scraper AI chuyển bất kỳ trang web nào thành JSON sạch chỉ bằng một prompt. Đã được Google lập chỉ mục ngay và đang có người dùng thử. Nếu bạn quan tâm tới scraping, tự động hoá hay trích xuất dữ liệu AI, hãy cho phản hồi! #AI #Scraping #Automation #DataExtraction #TríTuệNhânTạo #ThuThậpDữLiệu #TựĐộng #CôngCụ

https://www.reddit.com/r/SaaS/comments/1qo2uvv/just_launched_divparser_last_week_an_aipowered/

Any solution to get more SERP results from Google? Any hack/tricks? #BuildInPublic #scraping #scrapers #python

Xây dựng công cụ thay thế Tavily, cho phép truy cập web trực tiếp cho hệ thống LLM địa phương mà không ẩn thông tin. Cho phép: 1) Tìm kiếm trên Bing/DuckDuckGo hoặc bất kỳ SERP nào qua scraping, 2) Tự chọn URL để lấy nội dung (không phụ thuộc xếp hạng từ nhà cung cấp), 3) Nhận nội dung dưới dạng HTML, Markdown hoặc văn bản thuần. Tặng 10K credits API miễn phí mỗi tháng. #LocalLLM #WebScraping #AI #RAG #CôngCụAI #TríTuệNhânTạo #Scraping #HệThốngLLM

https://www.reddit.com/r/LocalLLaMA/comments/1q

I'm slightly creeped out but not surprised. I was editing a music score on my laptop recently and I added an instruction to play the piece "robotic". The next time I logged into Indeed, the first job recommendation to come up is for Robotics Operator. Is Indeed scraping data from my recent documents for keywords?

Always check your firewall.

#scraping #datascraping

(Wie) Ist ein rechtskonformes Scraping von Webseiten möglich?

Das Scraping von Websites ist oftmals negativ besetzt. Betroffene führen den Datenschutz und die informationelle Selbstbestimmung als Argumente gegen dessen Zulässigkeit ins Feld. Oft zu Recht – aber eben nicht immer. Der Beitrag stellt am Beispiel des gegen den Anbieter KASPR von der CNIL 2024 verh(...)
https://www.dr-datenschutz.de/wie-ist-ein-rechtskonformes-scraping-von-webseiten-moeglich/

#Scraping #Webseite

LeadFoxy – công cụ scraping & xác thực lead ngay trong quá trình tìm kiếm. Hỗ trợ nguồn LinkedIn & web, kiểm tra SMTP/DNS real‑time, xuất CSV/Excel, 2 500 credits/tháng, warm‑up inbox không giới hạn và API REST. Dùng thử miễn phí, không cần thẻ. Cần phản hồi về tính năng “Newly Registered Domains”. #LeadFoxy #LeadGeneration #Scraping #Marketing #CôngCụ #TiếpThị

https://www.reddit.com/r/SaaS/comments/1qiu7rr/i_built_a_lead_gen_tool_that_verifies_emails/

Chúng mình đang xây dựng cộng đồng chia sẻ script scraping cho các website không có feed. Đừng lập trình một mình—hãy trao đổi script và mẹo với nhau! Tham gia Discord để kết nối và học hỏi.

#scraping #webscraping #côngcụ #cộngđồng #đồnghề #discord #sharecode #automation

https://www.reddit.com/r/selfhosted/comments/1qirs1h/selfhosting_your_own_rss_reader/

🛠️ Từ chán các công cụ scraping đắt đỏ, phức tạp, mình tự xây dựng app scrape liên hệ: nhanh, sạch, giá công bằng, không giới hạn ảo. Đã mở cho bạn bè, đối tác và nhận phản hồi tích cực. Bài học: đơn giản, minh bạch, giá hợp lý luôn được yêu thích. Bạn đã tự làm công cụ nào chưa? #scraping #côngcụ #độnghặtdữliệu #startup #phầnmềm #DIY #frustration

https://www.reddit.com/r/SaaS/comments/1qig943/the_lessons_i_learned_after_building_my_own/

Sau nhiều năm chạy các pipeline thu thập dữ liệu, chúng tôi nhận ra: script không thể mở rộng. Mỗi lần thay đổi website đều làm hỏng, lỗi im lặng làm hỏng dữ liệu; >100k SKUs => QA toàn thời gian. Vì vậy chúng tôi xây dựng “utility layer” tự động thích nghi, tự chữa lỗi, khớp SKU và cung cấp feed sạch, rồi ra mắt thành sản phẩm. Các bạn đang xử lý scraping + SKU ở quy mô lớn như thế nào? #scraping #SKU #DataEngineering #Automation #côngnghệ #thuậttoán #kekhaithôngtin

https://www.reddit.com/r/Sa

Part of #CoryDoctorow 's #39c3 talk about #AiBubble and #interoperability also relates to his forthcoming book on #Centaurs. Now adapted in part for a Guardian piece yestrday. But we find it HIGHLY problematic, and does not entail any mention of defending against #surveillance #capitalism nor protecting our rights against ongoing #BigData #Scraping ( for #AI monstrosities ) ! How's it possible to not realize that's at the core of predatory #Bigtech #colonialism and #monopoly !? @pluralistic

#scraping

Client Info