Ryan Marcus

Asst prof computer science @ UPenn. Machine learning for systems. Databases. He/him.

2025-06-13

@nholzschuch Absolutely -- emails are tricky since many conferences require the email listed on a paper to match your institution. But some institutions don't give permanent email addresses, as you said.

A good workaround I've found is creating a mail alias at the institution that can point to your institutional address while you are there, and can be repointed to your personal email afterwards.

But really, academic institutions need to support lifelong email addresses, or at least forwarding, for scientists. A new subdomain could be used to avoid polluting the undergrad email namespace.

2025-06-12

This is your yearly reminder that anyone who publishes CS papers should have a personal website that lists their current position, research interests, publications, and email address.

If you don't, it's basically impossible for me to invite you to a PC, invite you to give a talk, ask a question about your work, or recommend you to others when asked.

A meme featuring Bernie Sanders standing outdoors in a winter coat, speaking directly to the camera. The caption reads, β€œI am once again asking PhD students to make a damn website."
2025-06-03

OLAP workloads are dominated by repetitive queries -- how can we optimize them?

A promising direction is to do 𝗼𝗳𝗳𝗹𝗢𝗻𝗲 query optimization, allowing for a much more thorough plan search.

Two new SIGMOD papers! ⬇️

LimeQO (by Zixuan Yi), a π‘€π‘œπ‘Ÿπ‘˜π‘™π‘œπ‘Žπ‘‘-𝑙𝑒𝑣𝑒𝑙 approach to query optimization, can use neural networks or simple linear methods to find good query hints significantly faster than a random or brute force search.

πŸ“„ rm.cab/limeqo

For that one query that must go π‘Ÿπ‘’π‘Žπ‘™π‘™π‘¦ π‘“π‘Žπ‘ π‘‘, BayesQO (by Jeff Tao) finds superoptimized plans using Bayesian optimization in a learned plan space. It’s costly, but the results can train an LLM to speed things up next time.

πŸ“„ rm.cab/bayesqo

Zixuan's website: zixy17.github.io/
Jeff's website: speculative.tech/

Infographic describing LimeQO, a workload-level, offline, learned query optimizer. On the left, it shows a workload consisting of multiple queries (q₁ to qβ‚„), each with a default execution time (3s, 9s, 12s, 22s respectively). On the right, alternate plans (h₁, hβ‚‚, h₃) show varying execution times for each query, with some entries missing (represented by question marks). For example, q₁ takes 1s under hβ‚‚, much faster than the 3s default. A specific callout highlights that for q₃, plan h₃ reduced the time from 12s to 3s, but took 18s to find, resulting in a benefit of 9s gained / 18s search. The image poses the question: β€œWhere should we explore next to maximize benefit?” The image credits Zixuan Yi et al., SIGMOD '25, and provides a link: https://rm.cab/limeqoInfographic describing BayesQO, an offline, multi-iteration learned query optimizer. On the left, it shows a Variational Autoencoder (VAE) being pretrained to reconstruct query plans from vectors, using orange-colored plan diagrams. The decoder part of the VAE is retained. In the center and right, the image shows Bayesian optimization being performed in the learned vector space: new vectors are decoded into query plans, tested for latency, and refined iteratively. At the bottom, a library of optimized query plans is used to train a robot labeled β€œLLM,” which can then generate new plans directly. The caption reads: "We get a fast query, but also a library of high-quality plans. We can train an LLM to speed up the process for next time!" The image credits Jeff Tao et al., SIGMOD '25, and links to https://rm.cab/bayesqo
2025-06-01

The abstract deadline for SoCC '25 is about a month away! This year's event is fully online.

acmsocc.org/2025/papers.html

2025-05-29

Kept writing bad code today, an expert had to take over and guide my hand.

A small, colorful bird, a Bourke's parakeet, with a mix of pastel pink, blue, green, and gray feathers is perched on a person's arm while they type on a keyboard. The setting appears to be a workspace with a computer mouse, mouse pad, and other desk items visible in the background.
2025-05-24

"Database venues are just about LLMs now!" -- LLMs are certainly on the rise, but this claim isn't supported by the data.

Papers about LLMs are certainly growing very quickly, but last year there were:

* Almost 3x more papers about indexing,
* Almost 4x more papers about query optimization,
* An equal number of papers about transaction processing

"Only" 10% of database papers discussed LLMs in 2024.This percentage will almost certainly grow in future years. I'll register a guess that this number won't cross 30% by 2030.

Data collected from semantic scholar and labeled with `gpt-4.1-mini`. Labeled data is here: rmarcus.info/llm_paper_topics.

A table and a line chart depict trends in database research papers from conferences such as VLDB, SIGMOD, CIDR, and PODS.

The table on the left shows the number of papers from 2017 to 2024, categorized by topic: all papers, query optimization, transaction processing, indexing, and large language models (LLMs). It shows that overall paper counts vary each year, peaking at 778 in 2023. Query optimization papers consistently appear in large numbers, while LLM-related papers increase dramaticallyβ€”from 0 in 2017 to 57 in 2024β€”indicating growing interest in LLMs in database research.

The line chart on the right shows the historical trend from 1970 to 2024 for each category. Total papers increase exponentially over time. Query optimization, transaction processing, and indexing papers also rise gradually. LLM papers remain flat until around 2018, then begin a sharp upward trend, reflecting their recent emergence in the field.
2025-05-08

When modern analytic databases process `GROUP BY` queries, they tend to use a partitioning strategy for parallelism. The conventional wisdom is that partitioning has better scalability due to lower contention.

But is this wisdom still true in 2025? Penn undergrad Daniel Xue's discovered that a surprisingly simple, but purpose-built, concurrent hash table could provide performance on-par with more complex partitioning-based techniques.

Check out a preprint of Daniel's paper: arxiv.org/abs/2505.04153

A diagram showing a two-stage parallel processing system involving key-value pairs. On the left, a column labeled "K V" holds key-value pairs divided into morsels (small batches) for processing. Stage 1 is labeled "ticketing", where keys are matched against a shared hash table (middle column labeled "K T") to obtain a ticket (index). These tickets help arrange the data into a new table (right-middle column labeled "T K V"). Stage 2, labeled "update", uses the ticket to update values into a shared result vector (rightmost column labeled "V") at the index specified by the ticket. Blue and red colors distinguish different morsels processed concurrently.
2025-05-02
A meme featuring the "two buttons" format. In the top panel, a hand hovers hesitantly between two red buttons. One button is labeled "diatribe: peer review is broken!" and the other "our lab is proud to present...". In the bottom panel, a distressed man in a superhero outfit wipes sweat from his forehead, clearly anxious. The caption reads: "academics when 1 paper gets accepted and 1 paper get rejected". The meme highlights the conflicting emotions researchers face when dealing with peer review outcomes.
2025-04-17

The NSF GRFP, a training grant awarded to promising American students at public and private colleges looking to earn a PhD, was cut in half this year.

When faculty admit a PhD student, they are committing to raising ~$500k to fund that student. The GRFP provides students with three years of funding, making GRFP recipients extremely attractive admits.

Stacked bar chart titled "NSF GRFP Recipients" showing the number of recipients from private and public institutions for each year from 2015 to 2025. Each year is represented by a stacked bar with blue indicating private institutions and orange indicating public institutions. The total number of recipients remains relatively stable around 2000–2100 per year, with a noticeable peak in 2023 reaching above 2500 and a sharp drop in 2025 to about 1000. The distribution between private and public institutions varies slightly each year, with public institutions generally having a slightly higher share.
2025-03-24

The deadline for aiDM 2025 -- the SIGMOD workshop on Exploiting Artificial Intelligence Techniques for Data Management -- has been extended to this Friday, March 28th. If you were on the fence about a submission, now is your change to make it!

aidm-conf.org/#dates

2025-03-13

@zacchiro Yep. Unfortunately, you forgot to cite the more recent paper from Morons et al., and since they're close friends of mine, I had to give a strong reject.

I suggest reviewing the literature and make sure you cite all the morons associated with the program committee at least once!

2025-03-13

@dlaehnemann @ndw nothing super exciting, unfortunately -- just someone who used the cite key from the OP. I kindly suggested they rename it, so I imagine it won't make it into the camera ready.

2025-03-13

@ndw Yes! :D

I do think it would be fun if we named our citation keys with a sort of XKCD-hovertext flair. Probably someone would ruin it, though.

2025-03-12

@wollman Yes! :D

(actually, I don't think it is really a bug at all... perhaps a "surprise" but certainly not counter-productive behavior)

2025-03-12

Just a reminder that the names of your bibtex citations get included in the PDF (both as the link anchor name and in the metadata), so if you name a paper `morons_who_copied_us`, reviewers and readers will be able to see that...

2025-03-06

A very nice paper from UMD about catalog storage on data lakes. While I'm not totally sold on their solution (I have some doubts about the hierarchical data model), the discussion of various tradeoffs and design principles is top notch.

I think there's clear space for major innovation in the "lake house catalog" world, far beyond Hive and Iceberg.

arxiv.org/pdf/2503.02956

2025-02-15

Pair(akeet) programming.

2025-02-03

I made a simple tool to look for related database papers (VLDB, SIGMOD, CIDR, PODS) given a new paper's title and abstract using vector embeddings. It highlights authors who are currently in the reviewer pool.

I'm sure my horrible Python hack will break at some point, but give it a shot if you want:

rmarcus.info/blog/2025/02/02/r

A screenshot of the linked webpage.
2025-01-12

My hot take of the day is that we're pretty good at evaluating PhD applicants (at least better than random), but the number of qualified applicants greatly exceeds the number of available slots. So there exists pairs of students (X, Y) where X is admitted and Y isn't admitted, but the difference between their applications is trivial. Externally, this is (reasonably) seen as a trivial distinction in inputs leading to a non-trivial distinction in outputs.

2024-12-23

@krismicinski @chrisamaphone @regehr @dysfun There's a population of undergrads that are energy minimizers. They'll plot a course though a major that is as "easy" (from their point of view) as possible.

I suspect a lot of complaints from academics about what LLMs are doing stem from the fact that mindless LLM use is highly correlated with such energy minimization.

I used to strongly believe we needed to "patch" these routes through degree programs. Now I'm not sure what the actual consequences of such a "patch" would be.

Client Info

Server: https://mastodon.social
Version: 2025.04
Repository: https://github.com/cyevgeniy/lmst