#CSVDiff

Jan :rust: :ferris:janriemer@floss.social
2025-05-18

#Fuzzing along in #CSVDiff :awesome:

In the second screenshot I've highlighted some interesting parts:

Key field indices are 2 and 3, so when diffing the records, where key fields are highlighted, they'll be compared as `Modify`, because:
- key fields are equal between left and right record
- other fields are unequal between left and right record

The other two records on the right have no corresponding left record - so those are `Add`ed records

#Rust #FuzzTesting #RustLang #PropertyTesting

A screenshot of my terminal that shows output of CSV records that have been generated by a fuzzing library, called bolero (in Rust).

There is one `left` CSV record and three `right` CSV records. The fields of each record are composed of randomly generated bytes with different lengths. However, when looking at the second screenshot......one record on the left and the right stand out in that they share the same value in two fields at field index 2 and 3. One can see that those are the key fields of the CSVs, because at the very bottom of the terminal output, it says "key_fields: {3, 2}".
Jan :rust: :ferris:janriemer@floss.social
2025-05-02

Huh, seems like I really have been living on the bleeding edge (of #FormalVerification):

github.com/creusot-rs/creusot/

The verification in the prev toot is currently not possible in #Creusot due to missing specs for the `Hash` trait and HashMap more broadly. 😔

Oh well, seems like (at least currently!) I won't be able to fully verify the diffing algorithm of #CSVDiff.🥺

Options I have now are:
- Only verify parts of the algorithm (that don't depend on HashMap ops)
or
- Use fuzzing/property testing

Jan :rust: :ferris:janriemer@floss.social
2025-01-19

Just published a new version of csv-diff (v0.1.1) 🚀

lib.rs/crates/csv-diff

This fixes a nasty bug regarding sort order of modified csv records. 😖

Details in the MR/PR:
gitlab.com/janriemer/csv-diff/

Also, two new incoming PRs for #qsv, the #CSV toolkit:

The first updates to the latest csv-diff, fixing aforementioned bug:
github.com/dathere/qsv/pull/24

The second fixes a bug regarding conversion from column names to indices:
github.com/dathere/qsv/pull/24

#Rust #RustLang #OpenSource #CSVDiff

Jan :rust: :ferris:janriemer@floss.social
2025-01-18

Ouch, there is another bug and this time it is actually _in #CSVDiff itself_!

It happens with sorting the results of modified rows (urgh, I'm also not happy with the sorting code).😨

Thankfully, datatraveller1 already has found a reproducible example - thank you so much! ❤️

Bug:
github.com/dathere/qsv/issues/

I think I already found a solution, but needs rigorous testing first!

Potential solution:
github.com/dathere/qsv/issues/

#qsv #Bug #csv

Jan :rust: :ferris:janriemer@floss.social
2025-01-16

Nice, I think I found the bug! 🐛

See all the explanation and possible solution here:

=> github.com/dathere/qsv/issues/

Workaround is also present and explained, so should be no blocker for people.

Will prob provide a fix on the weekend. 🤞

#CSVDiff #qsv #Bug #Fix #Bugfix

Jan :rust: :ferris:janriemer@floss.social
2025-01-16

Uh ohhhh, someone reported a bug in qsv's `diff` command.😮 🙈

github.com/dathere/qsv/issues/

Hopefully, we can resolve this soon! 🤞🥺

I have a strong suspicion, but let's see... I need more info first from the OP.

#Bug #Issue #CSVDiff #Diff #CLI #qsv

Jan :rust: :ferris:janriemer@floss.social
2024-02-21

@shuttle I consequently use #TDD, where possible.

Yes, sure, #Rust prevents a lot of bugs at compile time already, but not logic bugs.

For example in #CSVDiff we have ~70 unit tests and ~12 integration tests. The only "bug report" we have ever gotten was due to a corrupted CSV file (being mistaken with a bug in diff):

See here (qsv):
github.com/jqnatividad/qsv/iss

csv-diff:
gitlab.com/janriemer/csv-diff

In the future I'd like to add property and mutation testing as well 🤓

#RustLang #Testing #UnitTest

Jan :rust: :ferris:janriemer@floss.social
2023-10-31

#CsvDiff has finally reached v0.1.0, it's first ever non-alpha/-beta release! 🎉

New features like getting at the headers from the diffresult have been needed for the following PR in qsv (which is in final review):
github.com/jqnatividad/qsv/pul

When merged, you'll be able to decide, whether the diffresult should output headers or not (see examples in the PR). :awesome:

Check out csv-diff's Changelog for the full details:
gitlab.com/janriemer/csv-diff/

#CSV #qsv #CLI #DataScience #DataEngineering

Jan :rust: :ferris:janriemer@floss.social
2023-10-19

And this is why #UnitTests and #TDD are awesome/necessary (even in #Rust/ #RustLang):

The original requirement:
figure out how many columns the _result_ of diffing two CSVs in #CsvDiff have.

Do you see the error-pattern?

It's
- when we have no diff
&&
- at least one CSV has headers

which makes sense, because I've implemented the feature in the diffing logic, but at that point header information is already lost (in some other thread).

Isn't that beautiful!?🥰

#SoftwareEngineering #UnitTest

A screenshot of my terminal showing the test output of running the command `cargo test` on the project `csv-diff`.

70 tests passed and 3 tests failed.
The 3 tests that failed are listed and have pretty descriptive names, which makes it fairly easy to spot the error pattern. The names of the 3 failing tests are:

1. diff_both_empty_but_one_has_header_and_the_other_has_none_both_with_correct_header_flag_no_diff

2. diff_both_empty_but_one_has_header_and_the_other_has_none_both_with_header_flag_true_no_diff

3. diff_empty_with_header_no_diff
Jan :rust: :ferris:janriemer@floss.social
2023-09-10

I can't reproduce the bug. ¯\_(ツ)_/¯

Neither in #CsvDiff ...
gitlab.com/janriemer/csv-diff/

...nor in #qsv
github.com/jqnatividad/qsv/pul

My assumption is that they have forgotten to specify the option --right-delimiter (or --left-delimiter, respectively), when executing `qsv diff`:
github.com/jqnatividad/qsv/iss

Anyway, we now have additional tests in csv-diff and qsv, so definitely a win, regardless of the outcome! 🎉

@floriann FYI

#Rust #RustLang #OpenSource #NotReproducible

Jan :rust: :ferris:janriemer@floss.social
2023-09-09

Oh noes, apparently I haven't considered different delimiters for the left and right #CSV in #CsvDiff.😱

Someone reported a bug in `qsv diff` (which uses csv-diff) with this scenario.

github.com/jqnatividad/qsv/iss

I'll have a look at it tomorrow.

Glad, csv-diff is actively used! ❤️

#Rust #RustLang #OpenSource #Bug #Bugs

Jan :rust: :ferris:janriemer@floss.social
2023-03-25

If you want to know, how to provide a large resource (such as an owned String) to a criterion benchmark, you can use the `iter_batched` method:

docs.rs/criterion/latest/crite

See an example of this in #CsvDiff

gitlab.com/janriemer/csv-diff/

3/3

#Rust #RustLang #Performance #Benchmark #AIIsNotIntelligent

Jan :rust: :ferris:janriemer@floss.social
2023-03-04

Not sure where this will lead to, but it sounds fun and exciting, so let's try! :awesome: :rust: :ferris:

#CSVDiff

A screenshot of my starship terminal that shows the Rust crate `csv-diff` and the current branch I'm on. The branch is called "insert-into-hashmap-in-parallel".
Jan :rust: :ferris:janriemer@floss.social
2023-03-04

Yay! Sorting the #csv diff result by columns has just been merged into #qsv! 🥳

github.com/jqnatividad/qsv/pul

#CSVDiff #DataScience #CLI #Data

Jan :rust: :ferris:janriemer@floss.social
2023-02-19

A new version of csv-diff is out (v0.1.0-beta.2) 🎉

lib.rs/crates/csv-diff

This version adds a method, which allows you to sort your diff result by columns (it was already possible to sort by lines).

See the changelog for an example:
gitlab.com/janriemer/csv-diff/

Sorting by columns will soon be integrated into qsv, the #CSV toolkit:
github.com/jqnatividad/qsv/iss

Thank you @jqnatividad for the idea of this feature! 💚

#Rust #RustLang #CSVDiff #DataScience #qsv #CLI

Jan :rust: :ferris:janriemer@floss.social
2023-01-08

@hyde Also check out `qsv`. 🙂

It's an actively maintained fork of xsv (xsv is not maintained anymore).

qsv is _very active_ in development.

And shameless plug in the end 😁
Just a few days ago, `csv-diff` got merged:
github.com/jqnatividad/qsv/pul

csv-diff is a crate for comparing CSVs with ludicrous speed:
gitlab.com/janriemer/csv-diff

So the new command `qsv diff` is now the fastest #CSV differ in the world! 🚀

#BlazinglyFast #CSV #CSVDiff #Data #CLI #Rust #RustLang

Jan :rust: :ferris:janriemer@floss.social
2023-01-05

Announcement 🎉 🥳

csv-diff will be integrated into qsv, the CSV toolkit soon! 🎉 :ferris:

PR:
github.com/jqnatividad/qsv/pul

Comparing the majestic million dataset with 1,000,000 rows x 12 columns takes less than 800ms and only about 150mb of RAM!
With this, it is the fastest #CSV differ in the world!🚀

See the following svg recording for a demo:

gist.githubusercontent.com/jan

csv-diff:
gitlab.com/janriemer/csv-diff

#Rust #RustLang #Data #Diff #CsvDiff #Difference #Performance #DataScience #Oxidization

Jan :rust: :ferris:janriemer@floss.social
2022-12-04

🥳 A new version of csv-diff has just been released! 🚀

docs.rs/csv-diff/latest/csv_di

csv-diff is the fastest CSV-diffing library in the world - written in #Rust

It can compare two 1,000,000 rows x 9 columns CSVs in < 600ms!

Note that this is still a beta release and the library itself is still very young.

#RustLang #Release #CSV #CSVDiff #Performance #DataScience #Data #Diff #Difference #OpenSource #Crate

Client Info

Server: https://mastodon.social
Version: 2025.04
Repository: https://github.com/cyevgeniy/lmst