Python

The Something Awful Forums > Discussion > Serious Hardware/Software Crap > The Cavern of COBOL > Python

«‹›229 »

Data Graham: Dec 28, 2009; 📈📊🍪😋

My health care software platform is written in POLIO

# ? Mar 23, 2021 15:57

Adbot: ADBOT LOVES YOU

# ? Apr 25, 2024 13:30

necrotic: Aug 2, 2005; I owe my brother big time for this!

xtal posted:

I honestly don't know that much

Yeah, couldn't tell at all.

# ? Mar 23, 2021 16:36

NinpoEspiritoSanto: Oct 22, 2013

Should be TypeScript obviously

# ? Mar 23, 2021 16:36

Jose Cuervo: Aug 25, 2004

To be clear, this is a research data set which lives on a HIPAA compliant server. It is currently a bunch of excel files that were pulled by the IT team who run the health system databases. I would like to turn the excel files into a database so that things are much more organized and I can learn about databases. Hence why I was asking for help.

Da Mott Man posted:

Its been a while working with sqlalchemy but if I remember correctly your id columns should have autoincrement=True and for speed of lookups you should have an index=True for any column you would use to select records.
You might also want to cascade delete child data if the patent is deleted for some reason.
I would also set the dateofbirth column to a DateTime instead of a string, it would be easier to query for patents of specific age or range of ages.

I have changed the column to DateTime, that was a typo.
Thanks for the keyword 'cascade' - that allowed me to look up what I wanted. I think the new definition of Patient with cascade='all, delete' will accomplish the desired behavior, right?

Python code:

class Patient(Base):
    __tablename__ = 'patients'

    MRN = Column(Integer, primary_key=True)
    first_name = Column(String)
    last_name = Column(String)
    date_of_birth = Column(Date) 

    diabetic = relationship("Diabetic", back_populates='patient', cascade='all, delete')
    lab_values = relationship('LabValues', back_populates='patient', cascade='all, delete')
    hospitalizations = relationship('Hospitalization', back_populates='patient', cascade='all, delete')

    def __repr__(self):
        return f'<Patient({self.last_name}, {self.first_name}' \
               f' ({self.MRN}, {self.date_of_birth})>'

Hollow Talk posted:

I'm not a fan of pro-forma integer id columns. I see these columns often, and they are useless more often than not (auto-incrementing columns are only really useful if you want to check for holes in your sequence, i.e. for deleted data, and that can be solved differently as well).

Patient already has a primary key, so the id column is superfluous. For the other tables, keys depend on how often you expect to load new data. If this only gets new data once a day, just make the combination of MRN + date the table's primary key (i.e. a compound key) -- SQLAlchemy lets you set multiple columns as primary keys, and it just makes a combined key out of it. If you update more frequently than daily, I would change date to a timestamp, and use that. Realistically, depending on your analysis needs, you will probably end up joining and either selecting the full history for a single patient, or the current status for all patients (or a subset thereof), which this would cover either way. For LabValues, you might need MRN + date + name (assuming name is whatever was tested in the lab).

Essentially, I am of the opinion that primary keys should have meaning on their own, which sequences (i.e. what SQLAlchemy will most likely create under the hood if you have an auto-incrementing integer column and if the underlying database supports sequences) most often do not.

Edit: This is on top what has already been said. Also, foreign keys look good to me.

As I elaborated above I am trying place the data stored in Excel files into a database. I expect at some point in the future to have more data to add to the database. The data timestamps are only specific to the date (i.e., I only know that a lab value arrived on 2020/03/21, and not at a specific time on that date).

Would the following change to the Diabetic class be along the lines of what you are suggesting?

Python code:

class Diabetic(Base):
    __tablename__ = 'diabetic'
    
    MRN = Column(Integer, ForeignKey('patients.MRN'), primary_key=True)
    date = Column(Date, primary_key=True)
    diabetic = Column(Boolean)    
    
    patient = relationship('Patient', back_populates='diabetic')

Can you explain what advantage there is to having the compound key?

M. Night Skymall posted:

You should probably not use the MRN as the identifier throughout your database. It's generally best practice with PHI to isolate out the MRN into a de-identification table and use an identifier unique to your application to identify the patient. It's easy enough to do a join or whatever if someone wants to look people up by MRN, but it's useful to be able to display some kind of unique identifier that isn't immediately under all the HIPAA restrictions as PHI. Even if it's as simple as someone trying to do a bug report and not having to deal with the fact that their bug report must contain PHI to tell you which patient caused it. Not vomiting out PHI 100% of the time in error messages, things like that. Just make some other ID in the patient table and use that as the foreign key elsewhere.

Would you have this same concern given the use case (a research database where the only people which access to it are on the IRB protocol - currently just me for now)?

# ? Mar 23, 2021 17:46

12 rats tied together: Sep 7, 2006

Jose Cuervo posted:

To be clear, this is a research data set which lives on a HIPAA compliant server. It is currently a bunch of excel files that were pulled by the IT team who run the health system databases. I would like to turn the excel files into a database so that things are much more organized and I can learn about databases. Hence why I was asking for help.

There's a pervasive attitude on this website and elsewhere throughout the internet that "important software" should not be written in anything dynamically typed. Usually no consideration is given to the awful paradigms/language sets that previous versions of "important software" have been written in.

# ? Mar 23, 2021 17:52

The March Hare: Oct 15, 2006; _{Je r�ve d'un}
Wayne's World 3; Buglord

Personally I'd go the extra mile and import it all into whatever schemaless JSON store is popular on hackernews right now, just to really rub it in.

(Also work on healthcare data w/ Python, its completely fine.)

# ? Mar 23, 2021 18:16

M. Night Skymall: Mar 22, 2012

Jose Cuervo posted:

Would you have this same concern given the use case (a research database where the only people which access to it are on the IRB protocol - currently just me for now)?

It doesn't matter what the data is used for. The history of MRN as PHI is pretty dumb in my opinion, but per guidance from the government your MRN is as much PHI as your DOB or name. Having a de-identification table isn't a big deal, you can still store all your PHI in the patients table along with your new unique identifier. It's really *just* to remove the awkwardness of having everything in your DB keyed to a piece of PHI. I mean you're right, it's just you and it probably won't affect much now. But making good decisions about your schema is much..much easier now than it is later, and there's basically no way you will live to regret de-identifying your data in advance, and many ways you can live to regret spreading the MRN all over your database.

Anyone who's looking at this going "No no not python!" is just forcing this data to be sucked into a PowerBI tool instead. Don't make this poor person use PowerBI, that's just mean.

Edit: MRN is problematic because it's specifically listed as an identifier. If you have an internal ID in use for patients besides MRN that'd also be better than using the MRN directly. One more dumb thing about MRNs, they're specific to the hospital EMR so they aren't guaranteed to be unique.

M. Night Skymall fucked around with this message at 18:46 on Mar 23, 2021

# ? Mar 23, 2021 18:17

bobmarleysghost: Mar 7, 2006

Is there a good library/framework for creating command line interactive programs?

# ? Mar 23, 2021 19:09

NinpoEspiritoSanto: Oct 22, 2013

bobmarleysghost posted:

Is there a good library/framework for creating command line interactive programs?

click.

Unless you mean TUI type apps in which case urwid is pretty solid.

# ? Mar 23, 2021 19:15

Hollow Talk: Feb 2, 2014

Jose Cuervo posted:

Would the following change to the Diabetic class be along the lines of what you are suggesting?
Python code:
class Diabetic(Base):
    __tablename__ = 'diabetic'
    
    MRN = Column(Integer, ForeignKey('patients.MRN'), primary_key=True)
    date = Column(Date, primary_key=True)
    diabetic = Column(Boolean)    
    
    patient = relationship('Patient', back_populates='diabetic') 
Can you explain what advantage there is to having the compound key?

Yes, that would be it.

Essentially, the reason why I dislike auto-increment integer columns is that they serve no particular purpose, and they provide no guarantees -- if you want to enforce constraints (i.e. to avoid duplicate data if you run your tool against the Excel files more than once), auto-incrementing columns don't help, because the only constraint for the database is that the primary key is unique (which a sequence ensures), plus any foreign key relations if you specify them. If you insert the same data again, the original option will happily do so, given this data:

Python code:

with engine.begin() as connection:
    _patient = {
        "MRN": 1,
        "first_name": "Hollow",
        "last_name": "Talk",
        "date_of_birth": "1970-01-01",
    }
    _diabetic = [
        {"MRN": 1, "date": datetime.date(2021, 3, 23), "diabetic": False},
        {"MRN": 1, "date": datetime.date(2021, 3, 23), "diabetic": False},
    ]

    connection.execute(insert(Patient).values(_patient))
    connection.execute(insert(Diabetic).values(_diabetic))

If you define an auto-increment column, this will insert even though the diabetic data row is duplicated. If you instead define your primary keys according to your actual uniqueness logic, python will throw an exception for the above duplicate:

Bash code:

sqlalchemy.exc.IntegrityError: (sqlite3.IntegrityError) UNIQUE constraint failed: diabetic.MRN, diabetic.date
[SQL: INSERT INTO diabetic ("MRN", date, diabetic) VALUES (?, ?, ?), (?, ?, ?)]
[parameters: (1, '2021-03-23', 0, 1, '2021-03-23', 0)]
(Background on this error at: [url]http://sqlalche.me/e/14/gkpj[/url])

In most cases*, I prefer my inserts to fail as early as possible when constraints are violated, so that I can fix my insert/select logic, rather than clean up duplicate and/or otherwise broken data afterwards. This lets you express what can and cannot be inserted more than once, and you don't add any technical columns that don't really serve any analytical purpose (nor any other purpose, really). For this use case, using the auto-increment column as primary key and not using any primary key are functionally identical.

Unrelated: If you can avoid it, try to stick to lowercase column names, so use mrn instead of MRN. Many databases don't care, but for those cases where they do, this saves you from lots of quoting (or general incompatibility).

*: This does not apply for data warehousing, where constraints might not be feasible. Or if you use something like Amazon Redshift, because "we'll just ignore all constraints because that will make things faster" is certainly one way to speed up inserts.

# ? Mar 23, 2021 19:25

bobmarleysghost: Mar 7, 2006

Bundy posted:

click.

Unless you mean TUI type apps in which case urwid is pretty solid.

click definitely looks like what I want, thanks!

# ? Mar 23, 2021 19:44

Jose Cuervo: Aug 25, 2004

M. Night Skymall posted:

It doesn't matter what the data is used for. The history of MRN as PHI is pretty dumb in my opinion, but per guidance from the government your MRN is as much PHI as your DOB or name. Having a de-identification table isn't a big deal, you can still store all your PHI in the patients table along with your new unique identifier. It's really *just* to remove the awkwardness of having everything in your DB keyed to a piece of PHI. I mean you're right, it's just you and it probably won't affect much now. But making good decisions about your schema is much..much easier now than it is later, and there's basically no way you will live to regret de-identifying your data in advance, and many ways you can live to regret spreading the MRN all over your database.

Anyone who's looking at this going "No no not python!" is just forcing this data to be sucked into a PowerBI tool instead. Don't make this poor person use PowerBI, that's just mean.

Edit: MRN is problematic because it's specifically listed as an identifier. If you have an internal ID in use for patients besides MRN that'd also be better than using the MRN directly. One more dumb thing about MRNs, they're specific to the hospital EMR so they aren't guaranteed to be unique.

I am on board with doing this, but I am having trouble wrapping my head around how I would accomplish this. Do I define a new column, say pid as the primary_key for the Patient model as below, and use pid as the foreign_key in the other models as below for the Diabetic model?

Python code:

class Patient(Base):
    __tablename__ = 'patients'

    pid = Column(Integer, primary_key=True)
    mrn = Column(Integer)
    first_name = Column(String)
    last_name = Column(String)
    date_of_birth = Column(Date) 

    diabetic = relationship('Diabetic', 
                            back_populates='patient', 
                            cascade='delete, all')


class Diabetic(Base):
    __tablename__ = 'diabetic'
    
    pid = Column(Integer, ForeignKey('patients.pid'), primary_key=True)
    date = Column(Date, primary_key=True)
    diabetic = Column(Boolean)  
    
    patient = relationship('Patient', back_populates='diabetic')

However, if I do this, then I need to keep a mapping between the MRN and the pid outside of the database, right? So now I have the MRN in two different places...

# ? Mar 23, 2021 20:03

Jose Cuervo: Aug 25, 2004

Hollow Talk posted:

Yes, that would be it.

Essentially, the reason why I dislike auto-increment integer columns is that they serve no particular purpose, and they provide no guarantees -- if you want to enforce constraints (i.e. to avoid duplicate data if you run your tool against the Excel files more than once), auto-incrementing columns don't help, because the only constraint for the database is that the primary key is unique (which a sequence ensures), plus any foreign key relations if you specify them. If you insert the same data again, the original option will happily do so, given this data:
Python code:
with engine.begin() as connection:
    _patient = {
        "MRN": 1,
        "first_name": "Hollow",
        "last_name": "Talk",
        "date_of_birth": "1970-01-01",
    }
    _diabetic = [
        {"MRN": 1, "date": datetime.date(2021, 3, 23), "diabetic": False},
        {"MRN": 1, "date": datetime.date(2021, 3, 23), "diabetic": False},
    ]

    connection.execute(insert(Patient).values(_patient))
    connection.execute(insert(Diabetic).values(_diabetic))
If you define an auto-increment column, this will insert even though the diabetic data row is duplicated. If you instead define your primary keys according to your actual uniqueness logic, python will throw an exception for the above duplicate:
Bash code:
sqlalchemy.exc.IntegrityError: (sqlite3.IntegrityError) UNIQUE constraint failed: diabetic.MRN, diabetic.date
[SQL: INSERT INTO diabetic ("MRN", date, diabetic) VALUES (?, ?, ?), (?, ?, ?)]
[parameters: (1, '2021-03-23', 0, 1, '2021-03-23', 0)]
(Background on this error at: [url]http://sqlalche.me/e/14/gkpj[/url])
In most cases*, I prefer my inserts to fail as early as possible when constraints are violated, so that I can fix my insert/select logic, rather than clean up duplicate and/or otherwise broken data afterwards. This lets you express what can and cannot be inserted more than once, and you don't add any technical columns that don't really serve any analytical purpose (nor any other purpose, really). For this use case, using the auto-increment column as primary key and not using any primary key are functionally identical.

Unrelated: If you can avoid it, try to stick to lowercase column names, so use mrn instead of MRN. Many databases don't care, but for those cases where they do, this saves you from lots of quoting (or general incompatibility).

*: This does not apply for data warehousing, where constraints might not be feasible. Or if you use something like Amazon Redshift, because "we'll just ignore all constraints because that will make things faster" is certainly one way to speed up inserts.

Got it. The case of trying to insert data which already exists in the database is something I was thinking about, and the way you suggested defining the compound key neatly solves that issue.

As a follow-on question, would you wrap the insert statement in a try-except block, and catch the "sqlalchemy.exc.IntegrityError" exception to deal with not crashing when trying to insert data which is already in the database?

# ? Mar 23, 2021 20:10

vikingstrike: Sep 23, 2007; whats happening, captain

nvm

# ? Mar 23, 2021 20:11

Bruegels Fuckbooks: Sep 14, 2004; Now, listen - I know the two of you are very different from each other in a lot of ways, but you have to understand that as far as Grandpa's concerned, you're both pieces of shit! Yeah. I can prove it mathematically.

M. Night Skymall posted:

You should probably not use the MRN as the identifier throughout your database. It's generally best practice with PHI to isolate out the MRN into a de-identification table and use an identifier unique to your application to identify the patient. It's easy enough to do a join or whatever if someone wants to look people up by MRN, but it's useful to be able to display some kind of unique identifier that isn't immediately under all the HIPAA restrictions as PHI. Even if it's as simple as someone trying to do a bug report and not having to deal with the fact that their bug report must contain PHI to tell you which patient caused it. Not vomiting out PHI 100% of the time in error messages, things like that. Just make some other ID in the patient table and use that as the foreign key elsewhere.

There's actually a more immediately practical reason - MRNs are assigned per hospital. Therefore, if a patient gets an exam at hospital 1, and goes to hospital 2, the MRN will not match in images taken at hospital 1 vs hospital 2.

# ? Mar 23, 2021 20:32

Hollow Talk: Feb 2, 2014

Jose Cuervo posted:

Got it. The case of trying to insert data which already exists in the database is something I was thinking about, and the way you suggested defining the compound key neatly solves that issue.

As a follow-on question, would you wrap the insert statement in a try-except block, and catch the "sqlalchemy.exc.IntegrityError" exception to deal with not crashing when trying to insert data which is already in the database?

That's an option, yes -- logging the respective rows or writing them into a debug table are both options depending on what is easiest to deal with. Using the engine as a context manager as above takes care of committing or rolling back the current transaction (and it starts a transaction, which can be nested etc.) so that you don't end up with partially written data, but it still raises the exception, so I'd also wrap that if you care most about not crashing.

# ? Mar 23, 2021 21:24

Foxfire_: Nov 8, 2010

Jose Cuervo posted:

I am on board with doing this, but I am having trouble wrapping my head around how I would accomplish this. Do I define a new column, say pid as the primary_key for the Patient model as below, and use pid as the foreign_key in the other models as below for the Diabetic model?

However, if I do this, then I need to keep a mapping between the MRN and the pid outside of the database, right? So now I have the MRN in two different places...

You have a meaningless-outside-your-application patient ID that uniquely identifies a patient. All of your deidentified tables reference that for knowing what patient a row is about. Then you have a table that maps from the the patient ID to all the PHI about that patient (MRN, names, DOB, ...). That table could be either inside the same database or elsewhere if you wanted to segregate the dataset into PHI and deidentified parts.

You might also need multiple PHI tables for some stuff, like if you want to store multiple different-system/different-site MRNs for one actual person. For each table in the schema, you should be able to classify it as PHI or not and shouldn't be mixing PHI and not-PHI together in the same table. If you've done stuff right, you ought to be able to hand someone all your non-PHI tables and they'd be able to follow information about a particular subject, but not have any PHI about them

# ? Mar 23, 2021 21:25

Da Mott Man: Aug 3, 2012

Jose Cuervo posted:

I have changed the column to DateTime, that was a typo.
Thanks for the keyword 'cascade' - that allowed me to look up what I wanted. I think the new definition of Patient with cascade='all, delete' will accomplish the desired behavior, right?

That looks correct to me.

# ? Mar 23, 2021 23:54

Zoracle Zed: Jul 10, 2001

Anyone else noted how awful it is googling for anything python-related these days? SEO means the first couple pages of results are all geeksforgeeks.com and other awful beginner tutorial spam. (Which, like, even for beginners, seem bad.)

Anyway, here's my actual question: in matplotlib.pyplot.plot there's this concept of a format string, so 'r--' means a red dashed line, etc. Somewhere in matplotlib, I assume, there has to be a format string parser that takes in 'r--' and spits out {color: 'r', linestyle: '--'} or whatever. Point me in the right direction, please? Whenever I write convenience plotting methods I end up fighting to balance the convenience of the very-compact format string style and actual structured argument handling.

Zoracle Zed fucked around with this message at 06:32 on Mar 24, 2021

# ? Mar 24, 2021 06:29

Framboise: Sep 21, 2014; To make yourself feel better, you make it so you'll never give in to your forevers and live for always.; Lipstick Apathy

As/for someone with very basic programming knowledge who wants to look into programming as a career: Is this Udemy course https://www.udemy.com/course/100-days-of-code/ worth looking into? Seems like a good deal for 63 hours of content, but if it's not good I don't want to put my time into it.*

* Specifically, it's on sale right now for $12.99 which sounds like a good deal, but that price for the bold claims they make sounds kinda sus to me.

Edit: there's also this one https://stacksocial.com/sales/the-2021-premium-python-certification-bootcamp-bundle from StackSocial that claims to be 41 hours of content worth $2585 for only 34.99 and again, sus.

Framboise fucked around with this message at 13:02 on Mar 24, 2021

# ? Mar 24, 2021 12:55

NinpoEspiritoSanto: Oct 22, 2013

Talk Python has some good practical courses. Can't speak about udemy but I did some Treehouse content a couple of years back and that was solid.

# ? Mar 24, 2021 13:38

DoctorTristan: Mar 11, 2006; I would look up into your lifeless eyes and wave, like this. Can you and your associates arrange that for me, Mr. Morden?

Zoracle Zed posted:

Anyone else noted how awful it is googling for anything python-related these days? SEO means the first couple pages of results are all geeksforgeeks.com and other awful beginner tutorial spam. (Which, like, even for beginners, seem bad.)

Anyway, here's my actual question: in matplotlib.pyplot.plot there's this concept of a format string, so 'r--' means a red dashed line, etc. Somewhere in matplotlib, I assume, there has to be a format string parser that takes in 'r--' and spits out {color: 'r', linestyle: '--'} or whatever. Point me in the right direction, please? Whenever I write convenience plotting methods I end up fighting to balance the convenience of the very-compact format string style and actual structured argument handling.

You want to know the kwargs alternative to the format string, or the location of the parser module? If the latter why not just call it with an invalid string and see where the exception gets thrown from?

# ? Mar 24, 2021 13:45

M. Night Skymall: Mar 22, 2012

Framboise posted:

As/for someone with very basic programming knowledge who wants to look into programming as a career: Is this Udemy course https://www.udemy.com/course/100-days-of-code/ worth looking into? Seems like a good deal for 63 hours of content, but if it's not good I don't want to put my time into it.*

* Specifically, it's on sale right now for $12.99 which sounds like a good deal, but that price for the bold claims they make sounds kinda sus to me.

Edit: there's also this one https://stacksocial.com/sales/the-2021-premium-python-certification-bootcamp-bundle from StackSocial that claims to be 41 hours of content worth $2585 for only 34.99 and again, sus.

Udemy courses are always "on sale" for like 10-15 dollars or something, regardless of what their claimed price is, so judge all the courses based on their standard pricing being 12 bucks. No idea about that particular one, I've done some Udemy courses and they were useful for getting up to speed on stuff. Highly rated ones like the one you linked tend to be good, but also MIT has a pretty good python course that at least used to be free. I think this one? You can take it on Edx, maybe this one. The edx course should be free if you just want to watch the videos and receive the assignments, you might have to click around a bit to convince it not to charge you. I don't know, it was a lot more obviously free when I did it many years ago.

Either way, Udemy typically just provides video lectures and problems without much interaction/grading, so for something as general as Python you can almost assuredly do just as well with the free offerings from coursera or edx or something I'm pretty sure. In the end it's going to be up to you to actually do the projects and learn things, programming's much more a skill you have to practice than knowledge you memorize or take notes on through lectures.

# ? Mar 24, 2021 13:53

Zoracle Zed: Jul 10, 2001

DoctorTristan posted:

You want to know the kwargs alternative to the format string, or the location of the parser module? If the latter why not just call it with an invalid string and see where the exception gets thrown from?

the latter, and that's exactly what I needed, hah!

# ? Mar 24, 2021 15:44

punished milkman: Dec 5, 2018; would have won

Zoracle Zed posted:

Anyone else noted how awful it is googling for anything python-related these days? SEO means the first couple pages of results are all geeksforgeeks.com and other awful beginner tutorial spam. (Which, like, even for beginners, seem bad.)

I can�t tell if I�m just getting old and pining for the imaginary good ol� days or what, but I definitely feel this. It goes beyond anything Python related, Google takes your search string and returns mostly trash for just about any subject now.

Anyway, I usually use site:stackoverflow.com with my Google searches which helps filter out the fluff (and potentially some other good stuff). The main annoyance here is that the results are not sorted by date so you�ll get answers from like 2009 which usually aren�t all that helpful.

# ? Mar 24, 2021 15:46

Phayray: Feb 16, 2004

punished milkman posted:

I can’t tell if I’m just getting old and pining for the imaginary good ol’ days or what, but I definitely feel this. It goes beyond anything Python related, Google takes your search string and returns mostly trash for just about any subject now.

Anyway, I usually use site:stackoverflow.com with my Google searches which helps filter out the fluff (and potentially some other good stuff). The main annoyance here is that the results are not sorted by date so you’ll get answers from like 2009 which usually aren’t all that helpful.

The thing annoying me about this recently is the top 1 or 2 hits are just pages that poorly scrape the top SO answer and put it on a page filled with shady ads. If you're used to just quickly clicking the first or second result, since it's almost always SO...surprise!

# ? Mar 24, 2021 16:45

Neslepaks: Sep 3, 2003

Yeah. Nobody's feeling lucky these days.

# ? Mar 24, 2021 17:53

Jose Cuervo: Aug 25, 2004

Foxfire_ posted:

You have a meaningless-outside-your-application patient ID that uniquely identifies a patient. All of your deidentified tables reference that for knowing what patient a row is about. Then you have a table that maps from the the patient ID to all the PHI about that patient (MRN, names, DOB, ...). That table could be either inside the same database or elsewhere if you wanted to segregate the dataset into PHI and deidentified parts.

You might also need multiple PHI tables for some stuff, like if you want to store multiple different-system/different-site MRNs for one actual person. For each table in the schema, you should be able to classify it as PHI or not and shouldn't be mixing PHI and not-PHI together in the same table. If you've done stuff right, you ought to be able to hand someone all your non-PHI tables and they'd be able to follow information about a particular subject, but not have any PHI about them

I guess I am struggling to understand how I would structure this. Based on what has been said I would only need the MRN of the patient and the health system the patient belongs to to uniquely identify them. In the definition of the Patient model below I have set the primary key to be a compound key (pid, mrn, and health_system).

Python code:

class Patient(Base):
    __tablename__ = 'patients'

    mrn = Column(Integer, primary_key=True)
    health_system = Column(String, primary_key=True)
    pid = Column(Integer)
    first_name = Column(String)
    last_name = Column(String)
    date_of_birth = Column(Date) 

    diabetic = relationship('Diabetic', 
                            back_populates='patient', 
                            cascade='delete, all')


class Diabetic(Base):
    __tablename__ = 'diabetic'
    
    pid = Column(Integer, ForeignKey('patients.pid'), primary_key=True)
    date = Column(Date, primary_key=True)
    diabetic = Column(Boolean)  
    
    patient = relationship('Patient', back_populates='diabetic')

However,
1. Where do I get the value for pid from? Do I need to generate that separately each time I want to enter a new patient?
2. Suppose I now have data from a given patient (I know their MRN and health system). In order to get their pid, do I first perform a search in Patient to find the pid, and then use the pid to insert their data into the relevant tables?

EDIT: If the above is correct, I just tried adding data to the database while adding "autoincrement=True" to the pid:

Python code:

class Patient(Base):
    __tablename__ = 'patients'

    health_system = Column(String, primary_key=True)
    mrn = Column(Integer, primary_key=True)   
    pid = Column(Integer, autoincrement=True)
...

However, this does not automatically add a new unique integer to the pid column each time a new patient is added. Not sure what I am doing incorrectly here.

Jose Cuervo fucked around with this message at 20:55 on Mar 24, 2021

# ? Mar 24, 2021 20:41

death cob for cutie: Dec 30, 2006; dwarves won't delve no more
too much splatting down on Zot:4

I need a confirmation that my hunch is right here, as it's Python-related but involves some stuff I don't normally have to think about :

I teach a Python web development course that uses bcrypt - CFFI is a dependency of bcrypt. I have a student on the new M1 Mac who, when running a Django project where bcrypt gets imported in views, gets a nice long exception that terminates with this:

code:

ImportError: dlopen(/Users/username/Python_stack/django/my_environments/django2/lib/python3.9/site-packages/_cffi_backend.cpython-39-darwin.so, 2): no suitable image found.  Did find:
    /Users/username/Python_stack/django/my_environments/django2/lib/python3.9/site-packages/_cffi_backend.cpython-39-darwin.so: mach-o, but wrong architecture
    /Users/username/Python_stack/django/my_environments/django2/lib/python3.9/site-packages/_cffi_backend.cpython-39-darwin.so: mach-o, but wrong architecture

CFFI appears to not have an ARM64 release, unless I'm misreading it - so short of compiling it herself for her architecture, she doesn't really have any workaround for this, right?

(it feels like the obvious answer is "yes, she's boned" but I also feel like I'm missing something here)

# ? Mar 24, 2021 21:46

necrotic: Aug 2, 2005; I owe my brother big time for this!

Yup. Im surprised a package like bcrypt has a dep that hasn't been packaged for the m1 yet, but without that you'll have to build it yourself and hope it doesn't need any changes to compile and work.

# ? Mar 24, 2021 21:58

death cob for cutie: Dec 30, 2006; dwarves won't delve no more
too much splatting down on Zot:4

I guess there's also having her run it under Rosetta but that's kind of tedious

# ? Mar 24, 2021 22:00

vikingstrike: Sep 23, 2007; whats happening, captain

Jose Cuervo posted:

I guess I am struggling to understand how I would structure this. Based on what has been said I would only need the MRN of the patient and the health system the patient belongs to to uniquely identify them. In the definition of the Patient model below I have set the primary key to be a compound key (pid, mrn, and health_system).
Python code:
class Patient(Base):
    __tablename__ = 'patients'

    mrn = Column(Integer, primary_key=True)
    health_system = Column(String, primary_key=True)
    pid = Column(Integer)
    first_name = Column(String)
    last_name = Column(String)
    date_of_birth = Column(Date) 

    diabetic = relationship('Diabetic', 
                            back_populates='patient', 
                            cascade='delete, all')


class Diabetic(Base):
    __tablename__ = 'diabetic'
    
    pid = Column(Integer, ForeignKey('patients.pid'), primary_key=True)
    date = Column(Date, primary_key=True)
    diabetic = Column(Boolean)  
    
    patient = relationship('Patient', back_populates='diabetic')   
However,
1. Where do I get the value for pid from? Do I need to generate that separately each time I want to enter a new patient?
2. Suppose I now have data from a given patient (I know their MRN and health system). In order to get their pid, do I first perform a search in Patient to find the pid, and then use the pid to insert their data into the relevant tables?

EDIT: If the above is correct, I just tried adding data to the database while adding "autoincrement=True" to the pid:
Python code:
class Patient(Base):
    __tablename__ = 'patients'

    health_system = Column(String, primary_key=True)
    mrn = Column(Integer, primary_key=True)   
    pid = Column(Integer, autoincrement=True)
...
However, this does not automatically add a new unique integer to the pid column each time a new patient is added. Not sure what I am doing incorrectly here.

I haven't been following in detail, but I suppose the answers to your questions are:

1) If (MRN, Health System) are how you uniquely identify a Patient, then you would have a (MRN, Health System) → PID mapping that is stored its own table. That mapping lives in its own table and the Patient table only includes your PID. You create the PID however you like. If you needed to use the MRN or Health System variables in something, you would join that PHI table into the non-PHI data on the PID mapping you've created.
2) You would lookup the correct PID for the Patient in the table that contains the mapping.

# ? Mar 24, 2021 22:34

Wallet: Jun 19, 2006

vikingstrike posted:

However, this does not automatically add a new unique integer to the pid column each time a new patient is added. Not sure what I am doing incorrectly here.

I also have not been following that closely but you may want to check this out since my recollection is that you're using SQLite.

# ? Mar 25, 2021 00:54

Foxfire_: Nov 8, 2010

Jose Cuervo posted:

What you're trying to seems generally right to me. Would toss in a unique constraint on pid if you want to use (MRN, HealthSystem) as the primary key. Also, MRNs probably should be strings, not integers. "012345" and "12345" are probably distinct MRNs.

For the autoincrement, like Wallet said, SQLite has weird autoincrement behavior and specific workarounds in sqlalchemy. Since you don't particularly care that the values are ascending integers as long as they're unique, a reasonable portable thing would be to externally make a UUID and use that. Or you could beat sqlite into giving you a unique integer, but you'll have something somewhat db engine dependent.

e: also be aware that foreign key constraints in SQLite by default do nothing.

Foxfire_ fucked around with this message at 05:07 on Mar 25, 2021

# ? Mar 25, 2021 05:04

Hollow Talk: Feb 2, 2014

Postgres even has a UUID type, but sqlite doesn't support that. :saddowns:

# ? Mar 25, 2021 18:20

Jose Cuervo: Aug 25, 2004

Hollow Talk posted:

Postgres even has a UUID type, but sqlite doesn't support that.

Ok, so to get around the autoincrement issue in SQLite I decided to simply generate the unique ID (integer) myself (see check_add_patient below).

I am now working on trying to put everything together with code to add data to the database:

Python code:

def check_add_patient(provider, MRN, df, session):
    """
    Checks the 'patients' table to see if the
    patient exists and adds them if they do not.
    Returns the patient.
    """
    pt = session.query(ddb.Patient) \
                .filter(and_(ddb.Patient.provider == provider, 
                             ddb.Patient.mrn == MRN)) \
                .first()
    if pt:
        return pt
    
    session.add(ddb.Patient(mrn=int(MRN), provider=provider, 
                            pid=session.query(ddb.Patient).count() + 1, 
                            first_name=df.iloc[0]['First_Name'],
                            last_name=df.iloc[0]['Last_Name']))
    session.commit()
    return session.query(ddb.Patient) \
                .filter(and_(ddb.Patient.provider == provider, 
                             ddb.Patient.mrn == MRN)) \
                .first()
    

def add_dose_data(provider, MRN, df, session):
    pt = check_add_patient(provider, MRN, df, session)
    print(pt)

    dosing = []
    for _, row in df.iterrows():
        pt.dosing.append(ddb.Dose(date=row['Admin_Date'],
                                  med_name='ESA',
                                  units='Units',
                                  value=row['ESA_Dose'],
                                  description=row['Description']))
        session.commit()


Session = sessionmaker(bind=engine)
for MRN, MRN_df in esa.groupby('MRN'):
    add_dose_data('Health_System', MRN, MRN_df, Session())

The check_add_patient function successfully checks to see if a patient is already in the 'patients' table and does not throw the IntegrityError because I am trying to add someone who already exists. Is the way I have written that function un-pythonic or not the best way for a database?

In the add_dose_data function, should I only perform a session.commit() call once I have added all the rows of data?

I have not been able to figure out how to use context managers with this code - the documentation seems to say I should be using a context manager... Any pointers?

Finally, is there try-except block a standard way to deal with the IntegrityError which is thrown when I try to add in a row of data which has previously been added?

Thanks for all of the help so far.

# ? Mar 25, 2021 19:59

QuarkJets: Sep 8, 2008

Simplest way to implement a context manager: open a single session outside of your for loop with one. Example:

Python code:

Session = sessionmaker(bind=engine)
with Session() as session:
    for MRN, MRN_df in esa.groupby('MRN'):
        add_dose_data('Health_System', MRN, MRN_df, session)

You were opening 1 session per patient, and you weren't closing them. With a context manager you can just open 1 session, do all of your work, and then the context manager closes it for you. Or if you want 1 session per patient, you can just use the context manager inside of the for loop instead of outside of it.

To get even finer-grain than that, you could pass the Session (e.g. the session maker) and create sessions on the spot with a context manager using Session.begin(); then when the context closes the transaction is also committed for you, or a rollback occurs if an exception is raised. Much more slick than doing your own commits or rollbacks. Example:

Python code:

# Note: this function expects a session maker now
# e.g. Session instead of Session()
# You should probably call it something else for easy reading, like session_maker
def check_add_patient(provider, MRN, df, Session):
    ...
    with Session.begin() as session:
        session.add(ddb.Patient(mrn=int(MRN), provider=provider, 
                                pid=session.query(ddb.Patient).count() + 1, 
                                first_name=df.iloc[0]['First_Name'],
                                last_name=df.iloc[0]['Last_Name']))
    # Session is committed and closed here, no need commit
    ...

# ? Mar 25, 2021 23:55

Jose Cuervo: Aug 25, 2004

QuarkJets posted:

Simplest way to implement a context manager: open a single session outside of your for loop with one. Example:
Python code:
Session = sessionmaker(bind=engine)
with Session() as session:
    for MRN, MRN_df in esa.groupby('MRN'):
        add_dose_data('Health_System', MRN, MRN_df, session)
You were opening 1 session per patient, and you weren't closing them. With a context manager you can just open 1 session, do all of your work, and then the context manager closes it for you. Or if you want 1 session per patient, you can just use the context manager inside of the for loop instead of outside of it.

To get even finer-grain than that, you could pass the Session (e.g. the session maker) and create sessions on the spot with a context manager using Session.begin(); then when the context closes the transaction is also committed for you, or a rollback occurs if an exception is raised. Much more slick than doing your own commits or rollbacks. Example:
Python code:
# Note: this function expects a session maker now
# e.g. Session instead of Session()
# You should probably call it something else for easy reading, like session_maker
def check_add_patient(provider, MRN, df, Session):
    ...
    with Session.begin() as session:
        session.add(ddb.Patient(mrn=int(MRN), provider=provider, 
                                pid=session.query(ddb.Patient).count() + 1, 
                                first_name=df.iloc[0]['First_Name'],
                                last_name=df.iloc[0]['Last_Name']))
    # Session is committed and closed here, no need commit
    ...

So (I think) I tried what you suggested, by wrapping the call to check_add_patient in a context manager and wrapping the .add() in a context manager as well:

Python code:

def check_add_patient(provider, MRN, df, session):
    """
    Checks the 'patients' table to see if the
    patient exists and adds them if they do not.
    Returns the patient.
    """
    pt = session.query(ddb.Patient) \
                        .filter(and_(ddb.Patient.provider == provider, 
                                     ddb.Patient.mrn == MRN)) \
                        .first()
    if pt:
        return pt
    
    session.add(ddb.Patient(mrn=int(MRN), provider=provider, 
                            pid=session.query(ddb.Patient).count() + 1, 
                            first_name=df.iloc[0]['First_Name'],
                            last_name=df.iloc[0]['Last_Name']))
    session.commit()
    return session.query(ddb.Patient) \
                    .filter(and_(ddb.Patient.provider == provider, 
                                 ddb.Patient.mrn == MRN)) \
                    .first()
    

def add_esa_data(provider, MRN, df, session_maker):
    """
    """
    with session_maker.begin() as session:
        pt = check_add_patient(provider, MRN, df, session)
    print(pt)

    dosing = []
    for _, row in df.iterrows():
        with session_maker.begin() as session:
            pt.dosing.append(ddb.Dose(date=row['Admin_Date'],
                                      med_name='ESA',
                                      units='Units',
                                      value=row['ESA_Dose'],
                                      description=row['Description']))


session_maker = sessionmaker(bind=engine)
for MRN, MRN_df in esa.groupby('MRN'):
    print(MRN)
    add_esa_data('Health_System', MRN, MRN_df, session_maker)

But running this produces a ResourceClosedError, which as far as I can tell is referring to the session being closed? But where is it being closed?

# ? Mar 26, 2021 01:44

QuarkJets: Sep 8, 2008

From the docs, "When you write your application, the sessionmaker factory should be scoped the same as the Engine object created by create_engine(), which is typically at module-level or global scope". So maybe I made the mistake of having you pass the sessionmaker to a function, when instead it should just be called directly; my bad

So since your engine is global scope, the sessionmaker should be global scope as well. Wherever you instantiate the engine, also instantiate the sessionmaker. Then just have your functions invoke the session_maker in a context manager whenever a session is needed.

Their example looks like this:

Python code:

from sqlalchemy import create_engine
from sqlalchemy.orm import sessionmaker

# an Engine, which the Session will use for connection
# resources
engine = create_engine('postgresql://scott:tiger@localhost/')

# a sessionmaker(), also in the same scope as the engine
Session = sessionmaker(engine)

# we can now construct a Session() and include begin()/commit()/rollback()
# at once
with Session.begin() as session:
    session.add(some_object)
    session.add(some_other_object)
# commits the transaction, closes the session

You should be able to create those context managers within your functions from a global scope sessionmaker

I'm also spotting an append() being used with data returned from a closed context manager; be careful with that, I think that's likely the source of your error. Try this:

Python code:

    with session_maker.begin() as session:
        pt = check_add_patient(provider, MRN, df, session)
        print(pt)

        dosing = []
        for _, row in df.iterrows():
            pt.dosing.append(...)

(did you want to do something with that dosing list? It's not currently being used)

QuarkJets fucked around with this message at 02:21 on Mar 26, 2021

# ? Mar 26, 2021 02:18

Adbot: ADBOT LOVES YOU

# ? Apr 25, 2024 13:30

Hughmoris: Apr 21, 2007; Let's go to the abyss!

Anyone ever futz with extending Python using C or Rust to gain performance?

I've been reading some articles on it and the idea seems intriguing but I'm sure the reality can be a headache.

# ? Apr 4, 2021 13:43

The Something Awful Forums > Discussion > Serious Hardware/Software Crap > The Cavern of COBOL > Python

«‹›229 »