Skip to content

feat: Allow nullable primary keys #14

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

gab23r
Copy link

@gab23r gab23r commented Apr 24, 2025

fixes #13

Motivation

Allow the user to define a nullable primary key column.

Changes

nullable will default to not primary_key is not provided by the user, otherwise we keep user choice.

@gab23r gab23r changed the title Feat: Allow nullable primary keys feat: Allow nullable primary keys Apr 24, 2025
@github-actions github-actions bot added the enhancement New feature or request label Apr 24, 2025
Copy link

codecov bot commented Apr 24, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 100.00%. Comparing base (54cbc75) to head (ad139a0).
Report is 4 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff            @@
##              main       #14   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files           36        37    +1     
  Lines         1788      1807   +19     
=========================================
+ Hits          1788      1807   +19     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Member

@borchero borchero left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the immediate contribution following your issue @gab23r, great to see that you're keen on making dataframely work for your use case 😁

I thought about the fact that we're not allowing NULLs in primary keys again and I wanted to ask: are you actually interested in allowing NULL value for primary keys or do you use primary_key=True as a "workaround" for unique=True?

In my mind, this is slightly different. Polars has no concept of a "primary key" so it makes sense that it doesn't treat NULL values specially e.g. when using n_unique() or group_by(). However, a NULL value generally expresses "I don't know the value". Hence, it does feel odd to allow NULL values in a primary key as an "unknown value" is typically not a good identifier for a row (after all, NULL == NULL also evaluates to NULL) 🤔 happy to hear your thoughts @AndreasAlbertQC @delsner

@AndreasAlbertQC
Copy link
Collaborator

AndreasAlbertQC commented Apr 24, 2025

However, a NULL value generally expresses "I don't know the value"

I think that is one half of it. The other half is that NULL means "no match found" when joining. If I think of a lookup table, a row with primary key being NULL would be the equivalent of just not having a row in the lookup table and then having to fill_null after joining onto it. I do see the logic of preferring to have this null-keyed row in the lookup table directly because it means that the logic for determining the default comes with the table. (I think we might be a bit in the weeds here :))

not a good identifier for a row (after all, NULL == NULL also evaluates to NULL)

I think the big implication is that performing a join on this primary key will not return rows, which makes my paragraph above obsolete. In practice, particularly the case of having a Collection of multiple frames with nullable primary keys could get weird because we do heavily rely on primary key joins there

@gab23r
Copy link
Author

gab23r commented Apr 24, 2025

I agree that having nullable primary key feels odd, but in my case I need to validate data from a supplier and I do not really control the content of the table. Unique=True wouldn't make the job. As my table has multiple primary keys ( the column can have multiple nulls )

@AndreasAlbertQC
Copy link
Collaborator

AndreasAlbertQC commented Apr 24, 2025

Unique=True wouldn't make the job. As my table has multiple primary keys ( the column can have multiple nulls )

I think one should be able to implement this pretty straightforwardly by using a custom rule that enforces the composite uniqueness across multiple keys. I feel like this solution might be preferable because it avoids some of the murkiness discussed above:

class MySchema(dy.Schema):
    x = dy.Integer(nullable=True) # part of composite unique key
    y = dy.Integer(nullable=True)  # part of composite unique key
    z = dy.Integer(nullable=False)

    @dy.rule(group_by=["x","y"])
    def composite_unique():
        return pl.len() == 1

While testing this, though, I just noticed that in the special case of null x/y values, this does not appear to work as expected, which I think is a bug:

# Should not pass validation
df = pl.DataFrame(
        {
            "x":[None,None],
            "y":[None,None],
            "z":[5,6],
        }
    )

# Passes validation!
MySchema.validate(df,cast=True)

Tracking this separately as #15

@gab23r
Copy link
Author

gab23r commented Apr 25, 2025

That would work indeed. I have no strong feelings about it even if I don't understand why we couldn't allow the user to have nullable primary keys if he explicitly asks for it. Are there technical things blocking for that ?

@borchero
Copy link
Member

@gab23r we have released #16 in v1.2.1 which allows for the custom workaround proposed by @AndreasAlbertQC. I think the simple fact that joins do not, by default, consider null values to be equal is a strong argument that we should not allow for nullable primary keys. Plenty of the functionality around collections would break if we lifted this requirement. I hope the custom workaround is sufficient for your needs.

That being said, I think we should still fail hard if the user provides both primary_key=True and nullable=True instead of silently ignoring it. Would you be interested in making another PR for that? :)

@gab23r
Copy link
Author

gab23r commented Apr 26, 2025

I understand. Raising an error would be better indeed. I can do a PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

nullable=True is ignored if primary_key is True
3 participants