#semanticsearch

N-gated Hacker Newsngate
2025-05-20

🔍 A "semantic search engine" for academic papers, because apparently, reading titles is just too darn hard for today's scholars. 🤔 Built with AI magic and a sprinkle of open access, because who needs good old-fashioned research skills anymore? 🙄
arxivxplorer.com/

Hacker Newsh4ckernews
2025-05-20

Semantic search engine for ArXiv, biorxiv and medrxiv

arxivxplorer.com/

Miguel Afonso Caetanoremixtures@tldr.nettime.org
2025-05-16

"I presented a three hour workshop at PyCon US yesterday titled Building software on top of Large Language Models. The goal of the workshop was to give participants everything they needed to get started writing code that makes use of LLMs.

Most of the workshop was interactive: I created a detailed handout with six different exercises, then worked through them with the participants. You can access the handout here—it should be comprehensive enough that you can follow along even without having been present in the room.

Here’s the table of contents for the handout:

- Setup—getting LLM and related tools installed and configured for accessing the OpenAI API
- Prompting with LLM—basic prompting in the terminal, including accessing logs of past prompts and responses
- Prompting from Python—how to use LLM’s Python API to run prompts against different models from Python code
- Building a text to SQL tool—the first building exercise: prototype a text to SQL tool with the LLM command-line app, then turn that into Python code.
- Structured data extraction—possibly the most economically valuable application of LLMs today
- Semantic search and RAG—working with embeddings, building a semantic search engine
- Tool usage—the most important technique for building interesting applications on top of LLMs. My LLM tool gained tool usage in an alpha release just the night before the workshop!

Some sections of the workshop involved me talking and showing slides. I’ve gathered those together into an annotated presentation below.

The workshop was not recorded, but hopefully these materials can provide a useful substitute. If you’d like me to present a private version of this workshop for your own team please get in touch!"

simonwillison.net/2025/May/15/

#AI #GenerativeAI #LLMs #Programming #SoftwareDevelopment #Python #OpenAI #SQL #RAG #SemanticSearch

Data for Breakfastdata.blog@data.blog
2025-05-12

AI-powered Typo Hunting: Trust Your Docs, Readers Will

Our documentation has a trust problem, and I just found 142 reasons why. It started with a silly typo I noticed on one of the pages – something like “cotnact” instead of “contact”.  It was quick to fix, but it got me thinking: are there more?
Third‑party writing assistants are available as browser extensions, and we also have a spelling mistake checker available within Jetpack. With such tools, it’s easy to catch typos when editing pages, BUT it requires being on a specific page in edit mode.

Why it’s a problem

Typos can negatively impact our company’s credibility, giving the impression of negligence or lack of expertise.

Solution

I wanted a better approach—proactive rather than reactive.

Fortunately, WordPress.com public API made it easy to build an automated solution. Leveraging the WordPress.com API, I scanned all Jetpack.com support pages by sending their content to the GPT‑4o model (a multilingual, multimodal generative pre‑trained transformer developed by OpenAI) with this prompt:

prompt = """Your task is to check the provided text in American English for accidental typos. 
List all obvious typo errors in the provided text and propose a replacement.

Do not list any of those:
- punctuation errors,
- grammar errors,
- typos in html attributes,
- typos in code snippets,
- words including HTML special characters.
"""

Results and next steps

I ended up with 142 pages that required our attention. Some of the detected typos may be false positives, some may need a review by a native speaker, but many are accurately identified typos (“Keet”, “Nexdoor”, “perfomance”).

Cleaning up typos in the Jetpack.com documentation – work in progress.

Curious about the technical details? Here’s the code I used:

from openai import OpenAIimport jsonimport requestsimport pandas as pd client = OpenAI() def get_wp_posts(id, type):     # Base URL for the  WordPress.com API request    base_url = f"https://public-api.wordpress.com/rest/v1.1/sites/{id}/posts/?type={type}"     page = 1  # Start from the first page    # List to store the post IDs and URLs    posts_data = []    while True:        # Append the page number to the base URL        url = f"{base_url}&page={page}"        response = requests.get(url)        if response.status_code == 200:            data = response.json()            posts = data.get('posts', [])            if not posts:                break  # Break the loop if no posts are returned             for post in posts:                posts_data.append({'id': post['ID'], 'url': post['URL'], 'content': post['content']})                         page += 1  # Increment the page number for the next request        else:            print(f"Failed to retrieve data: {response.status_code} {response.text}")            break     # Convert list of posts to DataFrame    return pd.DataFrame(posts_data) def find_typos(x):         prompt = """Your task is to check the provided text in American English for accidental typos.     List all obvious typo errors in the provided text and propose a replacement.         Do not list any of those:      - punctuation errors,      - grammar errors,       - typos in html attributes,        - typos in code snippets,       - words including HTML special characters.    """         response = client.responses.create(        model="gpt-4o-2024-08-06",        input=[            {"role": "system", "content": prompt},            {"role": "user", "content": x}        ],        text={            "format": {                "type": "json_schema",                "name": "typos",                "schema": {                    "type": "object",                    "properties": {                        "typos": {                            "type": "array",                             "items": {                                "type": "string"                            }                        },                        "replacements": {                            "type": "array",                             "items": {                                "type": "string"                            }                        },                    },                    "required": ["typos", "replacements"],                    "additionalProperties": False                },                "strict": True            }        }    )    print(response.output_text)     return json.loads(response.output_text) df = get_wp_posts(20115252, "jetpack_support")df["typos"] = df["content"].apply(find_typos)

Your turn, it is.

Typos might seem small, but they speak volumes about professionalism and attention to detail. How confident are you about your own content? Have you thought about doing something similar on your site or blog? What approach did you take? Are you ready to try this method? Or maybe you have AI prompt ideas beyond spell‑checking? Let us know in the comments!

#ArtificialIntelligence #NaturalLanguageProcessing #SemanticSearch #WordPressCom

A screenshot of a Google Sheets document displaying potential typos in Jetpack support articles, including URLs and suggested replacements.
:rss: Qiita - 人気の記事qiita@rss-mstdn.studiofreesia.com
2025-04-23
Paolo Melchiorrepaulox@fosstodon.org
2025-04-01

I'm happy to share with you that I'll present my talk "A Pythonic Semantic Search" at Europython 2025. 🐍

See you in Prague, next July, for the largest Python conference in Europe. 🇪🇺

paulox.net/2025/07/16/europyth

#EuroPython #PyCon #Europe #Python #SemanticSearch #Django

2025-03-25

Lilbits: AI-enhanced search for Windows PCs and Amazon’s Big Spring Sale

Many of the latest laptop processors from Intel, AMD, and Qualcomm have high-performance neural processing units that are supposed to let you run AI applications locally without sending any data to the cloud. But so far Microsoft and PC makers have had a hard time coming up with things that you can actually do with all that on-device AI processing power.

Originally Microsoft’s biggest […]

#amazon #bargains #copilotPlusPc #lilbits #microsoft #mobileLinux #PineNote #semanticSearch

Read more: liliputing.com/lilbits-ai-enha

"PubMed and beyond: biomedical literature search in the age of artificial intelligence"

Great overview article on biomedical literature search resources.

thelancet.com/journals/ebiom/a

#research #medicine #AI
#semanticSearch #LLMsearch

Screenshot showing a partial list of biomedical literature search resources from a table in this article.
N-gated Hacker Newsngate
2025-03-08

👨‍💻Ah, the classic tale of a dev who bravely attempted to revolutionize with semantic search but ended up just building another Notion clone. 🚀 Maybe next time try solving a problem that isn't already solved by 12 other apps on the App Store.📚
tzx.notion.site/What-I-Learned

2025-03-07

Discovered on LinkedIn: meaningfully. That link goes to a GitHub repository. From the readme: “Meaningfully is a semantic search tool for text data in spreadsheets. Keyword searching in Excel or Google Sheets is painful because text data is displayed awkwardly and because keywords miss circumlocutions, typos, unexpected wording and foreign-language data. Semantic search solves all of that. […]

https://rbfirehose.com/2025/03/06/semantic-search-in-spreadsheet-data-meaningfully/

Paolo Melchiorrepaulox@fosstodon.org
2025-02-15

I'm happy to share that I'll be speaking at PyCon Italia 2025 🎉

I'll show you how to implement a semantic search with Python, Django and PostgreSQL 🤖

See you in Bologna from 29 May 2025 🇮🇹

Info 👇
paulox.net/2025/05/29/pycon-it

#PyCon #PyConItalia #PyConIT #Python #SemanticSearch #Django #PostgreSQL #PGvector #SentenceTransformers #OpenSource #FreeSoftware #AI #FOSS

CC @pycon

Harald KlinkeHxxxKxxx@det.social
2025-02-15

Publication::
Academic libraries face challenges in managing diverse data collections. A new ontology-driven knowledge base improves semantic search, integrating structured and unstructured data. Using SPARQL queries from LLM prompts, it enhances knowledge retrieval. Evaluated for accuracy and scalability, it advances computational archival science.
#DigitalHumanities #Ontology #SemanticSearch #AcademicLibraries
ieeexplore.ieee.org/document/1

2025-02-05

Spotted on Calishat Snaps: StartupSeeker, with a tag line of “Semantic Search for Startups”. Over 135K startups are indexed in the engine. I did a search for “agrivoltaics companies started before 2023” and got 23 useful results.

https://rbfirehose.com/2025/02/05/startupseeker-search-engine/

Microsoft DevBlogsmsftdevblogs@dotnet.social
2025-01-25

Discover how to use Microsoft.Extensions.VectorData to implement semantic search with Qdrant and Azure AI Search. 🧭 Let's explore how semantic search is reshaping data interpretation by focusing on meaning instead of just keywords. #SemanticSearch #DotNet #AI

2025-01-06

Does anyone know of an #OpenAccess full-text #PDF #search engine/tool using which I can search for relevant PDFs from a self-hosted #database?

Context: we have a curated database of #research articles but so far our search capability has been limited to tagged keywords or title and abstract field search only. We'd like to be able to search the entire PDF.

Side note: I know that PDFs are not a great way to store scientific information. I'd prefer not to use a proprietary #LLM if possible

#LexicalSearch #SemanticSearch #AskAcademia #academia #science #sciences #ScienceMastodon #AskFedi #OpenScience

Brandon H :csharp: :verified:bc3tech@hachyderm.io
2024-12-16

Client Info

Server: https://mastodon.social
Version: 2025.04
Repository: https://github.com/cyevgeniy/lmst