Register a SA Forums Account here!
JOINING THE SA FORUMS WILL REMOVE THIS BIG AD, THE ANNOYING UNDERLINED ADS, AND STUPID INTERSTITIAL ADS!!!

You can: log in, read the tech support FAQ, or request your lost password. This dumb message (and those ads) will appear on every screen until you register! Get rid of this crap by registering your own SA Forums Account and joining roughly 150,000 Goons, for the one-time price of $9.95! We charge money because it costs us money per month for bills, and since we don't believe in showing ads to our users, we try to make the money back through forum registrations.
 
  • Post
  • Reply
Ihmemies
Oct 6, 2012

So uh in my youth I played World of Warcraft and did not study in university, altough I was enrolled there for quite many years. Later I studied to be a x-ray technician. I graduated in 2016 and now I am very loving bored indeed. I have tried to get work closer to home but no my commute is 95 kilometers 1 way (a long trip) still, after four. There simply are not jobs closer to home (at least for me).

Maybe coding wouldn't be so bad after all instead of this healthcare poo poo? Lots of junior openings nearby where I live. Tons. Massive amounts of all kinds of programming job openings. Maybe I could be a Data Engineer or something instead of fullstack?! I know some basic SQL but they do something with python so I need to learn python.

I have not done any programming in 10 years. I looked at options like Freecodecamp and did yesterday and today the first 2 exercises:
https://repl.it/@Ihmemies/MeagerNeatPerl
https://repl.it/@Ihmemies/fcc-time-calculator

The unit tests pass so they "work" -> send it. Maybe I'll do those FCC Python exercises because I don't know what else to do.

How to learn python? What would be important for a Data Engineer, where to concentrate? I need to be able to produce decent quality code to impress some dudes so they would hire me.

Then there is this: https://www.datacamp.com/tracks/data-engineer-with-python
And this: https://www.codewithrepl.it/python-projects-for-beginners.html
Maybe this: https://www.udemy.com/course/complete-python-bootcamp/

Adbot
ADBOT LOVES YOU

CarForumPoster
Jun 26, 2013

⚡POWER⚡

Ihmemies posted:

How to learn python? What would be important for a Data Engineer, where to concentrate? I need to be able to produce decent quality code to impress some dudes so they would hire me.

If you're learning to get a job, find a job req for a position that looks good and go down the list studying all of the needed skills/technologies. Build a project related to that job.

Python may be great for a data engineer position. IMO the best way to learn is to start with a personal project you care about and struggle through it and the tutorials needed to complete it. Ask questions here along the way. Starting with a problem in a domain youre knowledgeable about, like one of your hobbies or your current job, is a good bet.

Also download pycharm, it will teach you most of the good habits you need to succeed by yelling at you with squiggly lines.

unpacked robinhood
Feb 18, 2013

by Fluffdaddy

CarForumPoster posted:

TLDR: How can I figure out the minimum number of packages needed in a venv/my environment?

I'm not sure I understand your question but aren't you supposed to use layers for that stuff ?
Asking cause at some point I made a script that takes a requirement.txt and spits a layer.zip that you can push to AWS.
e: work proxy won't let me pastebin but I can upload it later.

unpacked robinhood fucked around with this message at 15:23 on Aug 26, 2020

death cob for cutie
Dec 30, 2006

dwarves won't delve no more
too much splatting down on Zot:4

CarForumPoster posted:

The venv folder that pycharm creates is 75MB with only pip20 and setuptools...is that normal?

I can't speak for anything else but I just looked at the folder sizes for two virtual environments I have - one has pretty much nothing and is 9.4MB, and one is an environment I use to teach Django and that's 48MB (Django is about 33MB of that). To test something out, I started a new PyCharm project and had it make a venv, and it's 14.9MB. It looks like PyCharm's process for creating a venv in shoving some extra stuff in there. pip list shows not much installed, but some of the folders inside the venv have some extra stuff. I'm not familiar enough with how venvs are made to explain why PyCharm's venv creation inflates the size of the environment, but there's a difference for sure. It shouldn't be 75MB, though. Unless it's a difference of OS?

CarForumPoster
Jun 26, 2013

⚡POWER⚡
Appreciate the replies. Ill def try creating my venvs outside of pycharm next time. I ended up giving up, moving to a linux development environment and got it deployed.

Hed
Mar 31, 2004

Fun Shoe
Can you talk about your Zappa experience? I’m curious about using it for a Django problem but last I looked the maintainer had kind of noped out from appearances.

CarForumPoster
Jun 26, 2013

⚡POWER⚡

Hed posted:

Can you talk about your Zappa experience? I’m curious about using it for a Django problem but last I looked the maintainer had kind of noped out from appearances.

I'm still an AWS novice. I prefer not to spend my time server-fuckling which is why I was drawn to Lambda/Zappa.

It kinda seems less actively supported to me as well but if you're doing the things it is designed to do, such as putting a Django site on Lambda, it can be pretty great. Having never made a serverless app before (our other Django site is on EBS) I fired up a cloudformation template and had a serverless django site with a corny bootstrap template, RDS DB, django CMS and Rest API on an actual domain with a ACM SSL cert in no time (maybe 1-2 days?). It went fast. It runs locally with no issues. I'd say the learning curve was maybe slightly less than with deploying to EBS via the AWS CLI.

When you're not doing what it does well, or when it just decides to gently caress you, its hard to diagnose whats up. Its not documented very well and the error handling isn't great. I've wasted easily 10+ hours with issues that turned out to have absoultely positively nothing to do with the root cause. That GitHub Readme is it as far as I can tell. It isnt ubiquitous enough that every error is well documented and the errors usually come in the form of a CloudWatch log that says "NoneType is not subscript-able".

I may have had less hard of a time if I properly set up AM2 from the start but I mostly develop on Win10 (yea I know)

KICK BAMA KICK
Mar 2, 2009

Feel like some might consider this golf-y compared to a full try/except, opinions?
Python code:
value = default
with contextlib.suppress(PossibleException):
    value = operation_that_might_raise()
Just came to mind cause it's the one actual PyCharm bug with a long-open report that I've encountered that they haven't fixed (the first assignment of value will get flagged as having no effect, not understanding that the second assignment is in a block where failure might be expected).

Phobeste
Apr 9, 2006

never, like, count out Touchdown Tom, man

KICK BAMA KICK posted:

Feel like some might consider this golf-y compared to a full try/except, opinions?
Python code:
value = default
with contextlib.suppress(PossibleException):
    value = operation_that_might_raise()
Just came to mind cause it's the one actual PyCharm bug with a long-open report that I've encountered that they haven't fixed (the first assignment of value will get flagged as having no effect, not understanding that the second assignment is in a block where failure might be expected).

From both an (admittedly probably meaningless) performance (much more important) readability standpoint I would only do this if it was extremely likely that in the normal course of operation that exception would happen. Like at least a 50/50 and maybe more often than not. It adds so much emphasis to saying yup this might happen, yup it for sure might, compared to a try/except (and towards perf, a try/except will typically be implemented to have little effect in the no-exception-raised path since that's the more common case).

Everything about try/except style exception handling is designed for exceptions as, well, exceptions - things that can happen that aren't intended but can still be accounted for on a "oh that's not what I wanted, I guess I'll try this other thing instead". That's why the performance is designed to have little or no impact on the no-exception pass, that's why the weight of the syntax is in the exception handler. Switching that up so that the work is done in either case doesn't make me think "oh this code is particularly conscientious", it makes me think "this exception must happen constantly, I wonder if somebody's using it for flow control".

Hollow Talk
Feb 2, 2014

Ihmemies posted:

So uh in my youth I played World of Warcraft and did not study in university, altough I was enrolled there for quite many years. Later I studied to be a x-ray technician. I graduated in 2016 and now I am very loving bored indeed. I have tried to get work closer to home but no my commute is 95 kilometers 1 way (a long trip) still, after four. There simply are not jobs closer to home (at least for me).

Maybe coding wouldn't be so bad after all instead of this healthcare poo poo? Lots of junior openings nearby where I live. Tons. Massive amounts of all kinds of programming job openings. Maybe I could be a Data Engineer or something instead of fullstack?! I know some basic SQL but they do something with python so I need to learn python.

I have not done any programming in 10 years. I looked at options like Freecodecamp and did yesterday and today the first 2 exercises:
https://repl.it/@Ihmemies/MeagerNeatPerl
https://repl.it/@Ihmemies/fcc-time-calculator

The unit tests pass so they "work" -> send it. Maybe I'll do those FCC Python exercises because I don't know what else to do.

How to learn python? What would be important for a Data Engineer, where to concentrate? I need to be able to produce decent quality code to impress some dudes so they would hire me.

Then there is this: https://www.datacamp.com/tracks/data-engineer-with-python
And this: https://www.codewithrepl.it/python-projects-for-beginners.html
Maybe this: https://www.udemy.com/course/complete-python-bootcamp/

Data Engineering is a broad field, but SQL is a good start. In terms of python, here is what I look for when hiring people into junior roles that don't necessarily require a lot of experience, but where I look for potential and interest more than anything:

- Interact with a database from Python (SQLalchemy Core or ORM help here)
- Get data from APIs (requests, httpx etc. help, as do knowledge of HTTP, REST etc.)
- Parse/write JSON
- Parse/write XML
- Parse/write CSV
- Loops and list/dict comprehensions (i.e. incremental changes over sequences of uniform data)
- Iterators/lazy evaluation
- Threading/Multiprocessing/Queues
- Dataframes (mainly stuff like pandas)
- I/O and file system interaction
- Regex

If you want to combine these things in a little project, maybe have a look if you can find an API or two which provide data about something you are interested in. Build a process that reads, parses, cleans, and transforms that data, maybe enriches it with some additional data, then write it all into a database (look at normalisation and table design as well). Do some calculations on top of the data, or read data from the database again for further enriching. Sports, gaming, stock markets, public transport etc. all work well for this kind of thing.

I think of data as potentially infinite sequences of key/value pairs (or sequences thereof), which means I ultimately want to operate on each entry/row individually, applying something to the row, and basically compose these transformations together to get a pipeline. Essentially, have a look at the idea of map/reduce/filter.

Basically, learn functional programming in a proper language. :v: I am half kidding, but the concepts help a ton, though python makes sense for the domain because data scientists etc use it a fair bit as well.

Hollow Talk fucked around with this message at 08:30 on Aug 27, 2020

Ihmemies
Oct 6, 2012

Thanks for good answers! I could make a demo project, following the lines of "code with repl.it python projects for beginners". I could use the Finnish meteorological institute's open data API to get some historical weather data for example, parse & store it and allow downloading customizable parts of the data as csv/xml/something. For example. And make some tables/charts from the data or otherwise process it.

Ihmemies
Oct 6, 2012

Eval is not very good. But it works here and this accepts no user input. So... could I use something else instead? Any ideas? I'm calling numpy functions with the dictionary key names..

death cob for cutie
Dec 30, 2006

dwarves won't delve no more
too much splatting down on Zot:4
Couldn't you just use the functions themselves as dictionary keys? I'm laying in bed and way too lazy to open repl.it on my phone, but I'm pretty sure Python dictionary keys can be any kind of object, including a function - you just wouldn't call it in the dictionary.

e: since you're returning the dictionary that sounds non-ideal - you could create a list of function/string tuples and iterate over that; call the first part of the tuple and update the dictionary using the key of the second part

death cob for cutie fucked around with this message at 10:07 on Aug 28, 2020

Ihmemies
Oct 6, 2012

Epsilon Plus posted:

Couldn't you just use the functions themselves as dictionary keys? I'm laying in bed and way too lazy to open repl.it on my phone, but I'm pretty sure Python dictionary keys can be any kind of object, including a function - you just wouldn't call it in the dictionary.

Yes this seems to be possible:



quote:

e: since you're returning the dictionary that sounds non-ideal - you could create a list of function/string tuples and iterate over that; call the first part of the tuple and update the dictionary using the key of the second part

And this possibly too.. I will try this now :v: Thanks for help. At least in this case even altough the original solution "works", it was so bad there was good reason to try to make it better.

Edit: I guess you meant something like this.

Ihmemies fucked around with this message at 10:42 on Aug 28, 2020

death cob for cutie
Dec 30, 2006

dwarves won't delve no more
too much splatting down on Zot:4
To be more specific I was thinking something like this:

code:
calc = {"mean": 0, "variance": 0, ...}

ops = [(np.mean, "mean"), (np.var, "variance") ...]
since if you're going to passing that dictionary anywhere else, having the functions themselves as dictionary keys or having them involved in any way is just going to be kinda confusing for whoever/whatever else is going to be handling them (IMO, at least)

shoeberto
Jun 13, 2020

which way to the MACHINES?

Ihmemies posted:

How to learn python? What would be important for a Data Engineer, where to concentrate? I need to be able to produce decent quality code to impress some dudes so they would hire me.
Absolutely do the things you're already looking at, but also, start learning about Pandas and Jupyter Notebook:
https://towardsdatascience.com/pandas-for-newbies-an-introduction-part-i-8246f14efcca
SQL is going to be super important, but Pandas is how data science people do the data science. The more you understand how people are going to actually *use* the data, the more you can understand the infrastructure required to support that. Plus Pandas dataframes are loving magic.



Just discovered this thread. Love Python. I work for a primarily PHP-oriented shop, where all of our ETL stuff was shoehorned into PHP for ages (in the interest of homogeneous projects, I guess). The past few years I've been trying to migrate that side of our infrastructure to Python and it's been so much fun.

Ihmemies
Oct 6, 2012

Epsilon Plus posted:

To be more specific I was thinking something like this:

code:
calc = {"mean": 0, "variance": 0, ...}

ops = [(np.mean, "mean"), (np.var, "variance") ...]
since if you're going to passing that dictionary anywhere else, having the functions themselves as dictionary keys or having them involved in any way is just going to be kinda confusing for whoever/whatever else is going to be handling them (IMO, at least)

I finally realized that too... So I did it like that! 26 lines of code took many hours. I guess it was a good exercise :v: ( https://repl.it/@Ihmemies/fcc-mean-var-std#mean_var_std.py )

The next one is using Pandas with a csv of 35000 census entries. I don't currently like pandas but maybe I'll learn to like them more after a while.

Ihmemies fucked around with this message at 07:02 on Aug 29, 2020

CarForumPoster
Jun 26, 2013

⚡POWER⚡

Ihmemies posted:

The next one is using Pandas with a csv of 35000 census entries. I don't currently like pandas but maybe I'll learn to like them more after a while.

Pandas is the bees knees for getting your data in order or doing some data/feature engineering before passing to the DL/ML/NLP library of your choice

Hed
Mar 31, 2004

Fun Shoe
How can I fire off a subprocess, move on with execution, and have a callback or do something small post-execution?

I have a bunch of finite state machines that I go through and update every second in the main program loop. Sometimes the FSMs, which are python objects, run a function that does a subprocess call out to a system binary that isn't computationally expensive but can easily take several seconds to complete waiting on a response. This sucks and blocks things up. Right now that blocking call is in a subprocess.run(). I've looked at subprocess.Popen() but it's kind of a fire and forget... I guess I could Popen, then track state and see when that is done and if it had the desired effect, but I feel like I'm missing something more obvious.

Something that allowed me to fire off a subprocess and do a callback or change a flag later so I can keep the main loop going would be awesome. I've looked at asyncio and Popen like I mentioned but I feel like I'm missing something obvious in the std lib.

Bruegels Fuckbooks posted:

I'd just use mmap for this - memory mapped files generally work well for this use case.

I forgot to respond to this but thank you--this worked splendidly for what was essentially sending over a sequence of integers.

QuarkJets
Sep 8, 2008

ThreadPoolExecutor from concurrent.futures (which is in vanilla Python) will let you attach a callback function to the Future object that gets returned when you submit a task, this is the easiest way to perform a "fire and forget" operation that doesn't block your main thread. This sounds like it's exactly what you want

Macichne Leainig
Jul 26, 2012

by VG

CarForumPoster posted:

Pandas is the bees knees for getting your data in order or doing some data/feature engineering before passing to the DL/ML/NLP library of your choice

Just use it for structured data of any kind. I deal with a ton of CSVs and filtering, sorting etc. which is all dead easy with Pandas. The worst part might be having to define the dtypes for a csv with many columns (and that's really not that bad).

abelwingnut
Dec 23, 2002


you prefer it over the 'csv' module? i've been recently learning how to deal with csv files in python and most things i found there deal with that module. i'm simply writing from existing csv and writing to new ones.

shoeberto
Jun 13, 2020

which way to the MACHINES?

abelwingnut posted:

you prefer it over the 'csv' module? i've been recently learning how to deal with csv files in python and most things i found there deal with that module. i'm simply writing from existing csv and writing to new ones.

Use the right tool for the job. Basic CSV processing doesn't require fancy Pandas wrapping. Pandas dataframes are more for major data manipulation - subsetting, reorganizing, sorting, filtering, summary stats, etc.

That said if your task is "read a CSV, drop a column, and write it out," pandas might save you some headache.

Zoracle Zed
Jul 10, 2001
xarray is pretty sick for multidimensional arrays where a 2d table just isn't enough

CarForumPoster
Jun 26, 2013

⚡POWER⚡

abelwingnut posted:

you prefer it over the 'csv' module? i've been recently learning how to deal with csv files in python and most things i found there deal with that module. i'm simply writing from existing csv and writing to new ones.

I use pandas for any data that is table-like. pd.read_csv() has some very useful params for dealing with CSV issues like: multi GB, corrupt, non comma delimited database dumps with line errors.

Lib and let die
Aug 26, 2004

Hey y'all -

Looks like I came here at just the right time, because I, too, am dealing with..."issues" with a CSV file.

So, the long and short of it is that I'm trying to automate my job on the sly, because gently caress breaking my back so Jeff CEO can buy another loving yacht. One of our vendors offers a Quote Report that we use to quote our customers in a variety of formats (all of which are Extremely hosed Up And Stupid, as I'll demonstrate shortly)

So I export this CSV version of the report and am expecting something simple, like:

code:
Quote Number, Quote Description, Customer Name, Customer ID
SAD123456, 24x7 REMOTE TECH SUPPORT, UNITED STATES POSTAL SERVICE, 0001
What I have instead is....((mostly)verbatim)

code:
Quote Detail Report,,,,,,,,,,,,,,,,,,,,,,,,
,,,,,,,,,,,,,,,,,,,,,,,,
Quote Number,,,,,SAD2020437,,,,,,,,,,,,,,,,,,,,,,,,
Quote Description,,,,,THECUSTOMERSNAMETHATIVEREDACTED,,,,,,,,,,,,,,,,,,,,,,,,
Reference ID,,,,,,,,,,,,,,,,,,,,,,,,
Quote Creation Date,,,,,"Aug 19, 2020",,,,,,,,,,,,,,,,,,,,,,,,
What in the gently caress can I do with Python to make sense of this miscarriage of a CSV file?

death cob for cutie
Dec 30, 2006

dwarves won't delve no more
too much splatting down on Zot:4
I think something like re.sub() handles this, but I haven't used it before so I can't provide an offhand example for you

a foolish pianist
May 6, 2007

(bi)cyclic mutation

Marx Was A Lib posted:

Hey y'all -

Looks like I came here at just the right time, because I, too, am dealing with..."issues" with a CSV file.

So, the long and short of it is that I'm trying to automate my job on the sly, because gently caress breaking my back so Jeff CEO can buy another loving yacht. One of our vendors offers a Quote Report that we use to quote our customers in a variety of formats (all of which are Extremely hosed Up And Stupid, as I'll demonstrate shortly)

So I export this CSV version of the report and am expecting something simple, like:

code:
Quote Number, Quote Description, Customer Name, Customer ID
SAD123456, 24x7 REMOTE TECH SUPPORT, UNITED STATES POSTAL SERVICE, 0001
What I have instead is....((mostly)verbatim)

code:
Quote Detail Report,,,,,,,,,,,,,,,,,,,,,,,,
,,,,,,,,,,,,,,,,,,,,,,,,
Quote Number,,,,,SAD2020437,,,,,,,,,,,,,,,,,,,,,,,,
Quote Description,,,,,THECUSTOMERSNAMETHATIVEREDACTED,,,,,,,,,,,,,,,,,,,,,,,,
Reference ID,,,,,,,,,,,,,,,,,,,,,,,,
Quote Creation Date,,,,,"Aug 19, 2020",,,,,,,,,,,,,,,,,,,,,,,,
What in the gently caress can I do with Python to make sense of this miscarriage of a CSV file?

Are they actually that bad, i.e. most fields missing, some fields aligned with their headers in rows, some misaligned? If you've just trimmed some stuff and "Quote Description" actually should be the description instead of the customer name in your example, then you might be able to go line by line through the records and do something like this:
code:
In [41]: test = 'Quote Creation Date,,,,,"Aug 19, 2020",,,,,,,,,,,,,,,,,,,,,,,,'

In [43]:  linelist = [word for word in test.split(",") if word != '']

In [45]: if 'Date' in linelist[0]:
    ...:     linelist = [linelist[0], linelist[1]+linelist[2]]
    ...:

In [46]: linelist
Out[46]: ['Quote Creation Date', '"Aug 19 2020"']

then fix any comma-separated dates as necessary (as above, just trivially), then assign your fields based on the resulting lists.

a foolish pianist fucked around with this message at 23:15 on Sep 2, 2020

Zoracle Zed
Jul 10, 2001
imo any ad hoc bandaid for hosed up data like that is going to be an accelerating spiral of immiserating maintainance work--just say no!

QuarkJets
Sep 8, 2008

This kind of data doesn't look like too much effort to work with. A lot of "data analysis" work is just figuring out the most effective way of ingesting the data. Sometimes that means dealing with spreadsheets that were inconsistently managed by a salesman with advancing dementia; that's part of the job

Wallet
Jun 19, 2006

Data in a dumb format isn't a big deal to ingest as long as you know what the dumb format is. Data in an inconsistent format is a bigger pain in the rear end.

shoeberto
Jun 13, 2020

which way to the MACHINES?

Marx Was A Lib posted:

Hey y'all -

Looks like I came here at just the right time, because I, too, am dealing with..."issues" with a CSV file.

So, the long and short of it is that I'm trying to automate my job on the sly, because gently caress breaking my back so Jeff CEO can buy another loving yacht. One of our vendors offers a Quote Report that we use to quote our customers in a variety of formats (all of which are Extremely hosed Up And Stupid, as I'll demonstrate shortly)

So I export this CSV version of the report and am expecting something simple, like:

code:
Quote Number, Quote Description, Customer Name, Customer ID
SAD123456, 24x7 REMOTE TECH SUPPORT, UNITED STATES POSTAL SERVICE, 0001
What I have instead is....((mostly)verbatim)

code:
Quote Detail Report,,,,,,,,,,,,,,,,,,,,,,,,
,,,,,,,,,,,,,,,,,,,,,,,,
Quote Number,,,,,SAD2020437,,,,,,,,,,,,,,,,,,,,,,,,
Quote Description,,,,,THECUSTOMERSNAMETHATIVEREDACTED,,,,,,,,,,,,,,,,,,,,,,,,
Reference ID,,,,,,,,,,,,,,,,,,,,,,,,
Quote Creation Date,,,,,"Aug 19, 2020",,,,,,,,,,,,,,,,,,,,,,,,
What in the gently caress can I do with Python to make sense of this miscarriage of a CSV file?
Here's what I would recommend to make life easier in the long run:
1. Install Jupyter Notebook locally. Look up some basic tutorials on using it with Python, but it's not anything other than a visual interactive Python shell with some GUI niceties thrown in. (https://jupyter.org/)
2. Get Pandas and its dependencies installed in your Python environment. Also plenty of tutorials out there for that but let me know if you want anything specific.
3. Fire up your Jupyter Notebook environment and ingest your data source into a Pandas dataframe, ie:
code:
df = pd.read_csv('path/to/my.csv')
And just start dicking around with it. Jupyter makes it really easy to visualize dataframes. So to see a quick summary of what your dataframe looks like, you can just give yourself head:
code:
df.head()
Usually that will provide some context on what's going on. And if there are extra bullshit columns, they're easy enough to remove:
code:
del df['my_bullshit_column']
Once you've sliced your dataframe down to the format you want, just write it back out:
code:
df.to_csv('path/to/my/fixed.csv')
This is a gross over-simplification of it, but it should be enough to get you going. And any of the code you write in your Jupyter notebook is going to be portable to your normal day-to-day code. Jupyter just makes it faster to iterate on the data manipulation side of things because it's so interactive and visual.

Lib and let die
Aug 26, 2004

Thanks for the suggestions, all! After reading, I've noticed a few things about this specific CSV file: it's always either 4 or 23 extra commas, and it seems as though this csv file is used to somehow create the (even more hosed up) Excel version of the quote. Our vendor must have some kind of proprietary software that reads the csv file and executes some kind of M Code in Excel to create the "pretty" (and I use that term EXTREMELY loosely) version of the quote that I guess would be customer-facing if we were reselling a direct contract?

As per your suggestion, shoeberto, I've got a nice little miniconda setup with a local jupyter notebook so I think I'll start with pandas.

Quite honestly, the only thing I really care about in the entirety of the CSV file is about halfway down into the file:

code:
"Total for Functional Location : 5385643   SOMETHINGAWFULDOTCOM, 62200 NORTH AVE STE 124, STATEN ISLAND, NY, US, 12345",,,,,,,,,,,,,,,,,,,,,,,,
Total:348.00
Eventually, this will be set up in a FOR loop so I can work with x number of contracts for a given customer at a time, but the for loop will come later once I determine how to make sense of a single site in this format.

I sanitized the file for any information that can be used to identify me or the companies involved (I definitely don't work for or adjectent to CISCO, I can say that with 100% accuracy - and to be clear, any additional spaces or linebreaks out of place were in the original file, i've done nothing but change names and addresses to protect the innocent), honestly, it's quite a wonder to behold, and it almost makes sense if you look at it as they format it in a context in which it would be more easily human readible.

Appreciate the input!

code:
Quote Detail Report,,,,,,,,,,,,,,,,,,,,,,,,
,,,,,,,,,,,,,,,,,,,,,,,,
Quote Reference Number,,,,,SAD2020437,,,,,,,,,,,,,,,,,,,
Quote Description,,,,,SOMETHINGAWFULDOTCOM,,,,,,,,,,,,,,,,,,,
Reference Id,,,,,,,,,,,,,,,,,,,,,,,,
Quote Creation Date,,,,,"Aug 19, 2020",,,,,,,,,,,,,,,,,,,
Quote Modification Date,,,,,"Aug 19, 2020 10:18 AM",,,,,,,,,,,,,,,,,,,
Currency,,,,,USD,,,,,,,,,,,,,,,,,,,
,,,,,,,,,,,,,,,,,,,,,,,,
Quote Expiry Date,,,,,"Dec 17, 2020",,,,,,,,,,,,,,,,,,,
Service Agreement Start Date,,,,,"Dec 01, 2020",,,,,,,,,,,,,,,,,,,
Created By,,,,,DONTDOXXMEBRO,,,,,,,,,,,,,,,,,,,
Master Agreement Number,,,,,,,,,,,,,,,,,,,,,,,,
Co-Term Contract Number,,,,,,,,,,,,,,,,,,,,,,,,
Opportunity ID,,,,,0003166814,,,,,,,,,,,,,,,,,,,
Selling Price Label,,,,,BP Cost Less All Discounts incl. All Discounts,,,,,,,,,,,,,,,,,,,
CM Svr/Gtwy Polling Data,,,,,,,,,,,,,,,,,,,,,,,,
Quote Status,,,,,Ready to order,,,,,,,,,,,,,,,,,,,
Sales Model,,,,,Indirect - RNEW,,,,,,,,,,,,,,,,,,,
SBA Number,,,,,,,,,,,,,,,,,,,,,,,,
Customer Name,,,,,,,,,,,,,,,,,,,,,,,,
Sold To,,,,,2643057,,,,,,,,,,,,,,,,,,,
Ship-To,,,,,5385643,,,,,,,,,,,,,,,,,,,
Company Name,,,,,SOMETHINGAWFULDOTCOM (101192),,,,,,,,,,,,,,,,,,,
Company Link ID,,,,,548 (Distrib),,,,,,,,,,,,,,,,,,,
Business Partner Link ID,,,,,933 (Tier1),,,,,,,,,,,,,,,,,,,
Business Partner Name,,,,,"Ferris Wheel Industries,Inc (655306)",,,,,,,,,,,,,,,,,,,
Business Partner Medal Level,,,,,Diamond,,,,,,,,,,,,,,,,,,,
 ,,,,, ,,,,,,,,,,,,,,,,,,,
Prior Contracts ID(s),,,,,51413675,,,,,,,,,,,,,,,,,,,
 ,,,,, ,,,,,,,,,,,,,,,,,,,
Quote Total Price,,,,,235.80,,,,,,,,,,,,,,,,,,,
,,,,,,,,,,,,,,,,,,,,,,,,
,,,,,,,,,,,,,,,,,,,,,,,,
,,,,,,,,,,,,,,,,,,,,,,,,
,,,,,,,,,,,,,,,,,,,,,,,,
"Functional Location : 1234567   SOMETHINGAWFULDOTCOM, 62200 NORTH AVE STE 124, STATEN ISLAND, NY, US, 12345",,,,,,,,,,,,,,,,,,,,,,,,
Sales Org: B001 Crisco Inc.,,,,,,,,,,,,,,,,,,,,,,,,
Parent QRN : SAD2020437,,,,,,,,,,,,,,,,,,,,,,,,
"Functional
Location",,,,,Maintenance Price Class  Description,"Service
Description","Service
Code","Svc
Qty","Product
Description","Product
Code","Prod
Qty",Identifier,"Billing
Frequency","Service Term
(months)",MPG,"Unit
List
Price","Total
List
Price",Service Unit BP Cost Less All Discounts ,Service Initial Invoice BP Cost Less All Discounts incl. All Discounts,Service Subsequent Invoice(s) BP Cost Less All Discounts incl. All ,Total BP Cost Less All Discounts incl. All Discounts,Start Date,End Date,"FED 
Flag"
,,,,,,,,,,,,,,,,,,,,,,,,
IP Office Support Services,,,,,,,,,,,,,,,,,,,,,,,,
5385643,,,,,Remote Technical Support 8 X 5 + APR NBD - CODEL,IPO C/D RTS 8X5 APR NBD - 500 V2 1YPP,271638,1,IPO IP500 V2 CNTRL UNIT,700476005,1,,Prepaid,12,1S,29.00,348.00,19.65,235.80,0.00,235.80,01-Dec-2020,30-Nov-2021,Y
"Total for Functional Location : 5385643   SOMETHINGAWFULDOTCOM, 62200 NORTH AVE STE 124, STATEN ISLAND, NY, US, 12345",,,,,,,,,,,,,,,,,,,,,,,,
Total:348.00,,,,,,,,,,,,,,,,,,,,,,,,
Grand Total  348.00,,,,,,,,,,,,,,,,,,,,,,,,
Quotation Key,,,,,,,,,,,,,,,,,,,,,,,,
,,"#                  Coverage for this MPC/Material has lapsed beyond the grace period, therefore, charges apply and are included in the Re-Initiation fee.",,,,,,,,,,,,,,,,,,,,,,
,,, ,,,,,,,,,,,,,,,,,,,,,
,,, ,,,,,,,,,,,,,,,,,,,,,
,, ,,,,,,,,,,,,,,,,,,,,,,
, ,,,,,,,,,,,,,,,,,,,,,,,
,Quotation Terms,,,,,,,,,,,,,,,,,,,,,,,
,,"Estimated Subsequent Invoice Amount is for informational purposes only & does not reflect
1) changes associated with utility measurements gathered at true-up
2) subsequent component additions to already covered product categories
3) Invoice amounts for items eligible for coverage in yearly increments.
Estimated subsequent invoice amounts are calculated assuming that coverage is still available in subsequent years.
The quoted items included in this report must be covered by a current, unexpired contract to generate a complete quote detail.",,,,,,,,,,,,,,,,,,,,,,
,,,,"This proposal is subject to, and incorporates by reference, the terms and conditions of the applicable Service Description(s), Terms of Use and SMSS in effect as of the date of this order, unless (i) Customer has a written master agreement in effect and executed by Crisco for such products and services, in which case such written master agreement will govern as to those products and services, or (ii) otherwise set forth herein.  Cloud and hosted services may require online registration to be activated.  “SMSS” means Crisco's Support, Managed Services and Subscription Services Terms (US) found at [url]https://[/url][redacted] or a successor site. “Service Description” means Crisco’s Service Agreement Supplement, statement of work, service description, Subscription License Supplement, or similar document, as posted on [redacted] and available upon request.  “Terms of Use” means the Crisco terms of use located at [url]https://[/url][data expunged] or a successor site. ",,,,,,,,,,,,,,,,,,,,

Lib and let die fucked around with this message at 16:03 on Sep 3, 2020

DoctorTristan
Mar 11, 2006

I would look up into your lifeless eyes and wave, like this. Can you and your associates arrange that for me, Mr. Morden?
Phone posting so can’t post sample code, but if the only part of the file you want is:

code:
[Load of crap and commas I don’t care about]
“Total for functional location: [I want everything before the closing quote]”
,,,,[shitload more commas],,,,
Total: [I want this number too]
[More commas and crap I don’t care about]
Then I would probably just load the file as a string and use a regex to extract the relevant parts. The csv reading options mentioned above all assume the file contains data in a (roughly) columnar format where each row contains repeated instances of the same few columns, and this file... isn’t that.

Macichne Leainig
Jul 26, 2012

by VG
I see another pattern too, a lot if it seems to be *key name*,,,,,*some kind of value here*. It would probably be awful but it looks like you could regex your way out of this.

shoeberto
Jun 13, 2020

which way to the MACHINES?
Ok bud, not gonna lie, I might have steered you wrong the first time around by suggesting Pandas. That's really for properly-structured tabular data.

In this case, if you can get a hold of the Excel file, you *may* be better off using openpyxl:
https://openpyxl.readthedocs.io/en/stable/
Assuming they are trying to use Excel to basically handle styling something that's not really tabular data, you are probably going to want to just access individual cells and extract the data into a normalized format.

Like you can do what others are suggesting and regex the hell out of it, but that's probably going to be somewhat fragile. If for some reason the Excel file actually merges all of that crap into meaningful cells, it's going to be easier to do like:
code:
wb = load_workbook('/path/to/my.xlsx')
sheet = wb['Sheet 1']
quote_reference_number = sheet['C37']
company_name = sheet['D82']
That way you can directly reference the cell IDs in the Excel file for extraction. (those are made up, obviously, but it'd be like column D row 82 = 'D82')

If, from there, you want to transform it into a normalized format, it becomes relatively straightforward, and you can use whatever you want. For example, you could make it into an array of dicts:
code:
def parse_sheet(sheet):
   return {
      'quote_reference_number' = sheet['C37'],
      ...
   }

rows = []
for file in my_files:
   wb = load_workbook(file)
   rows.append(parse_sheet(wb['Sheet 1']))
Then you can use a csv.DictWriter to spit it out to disk in a nice format. You can also throw it into a DataFrame if you're doing extra slicing-and-dicing, up to you. I had a project like this a few years back and openpyxl really removed a lot of the headache from it.

Dr Subterfuge
Aug 31, 2005

TIME TO ROC N' ROLL
Yeah that's a ~25 col wide excel sheet in csv format. Messing with it in terms of excel is probably the way to go.

Lib and let die
Aug 26, 2004

Dr Subterfuge posted:

Yeah that's a ~25 col wide excel sheet in csv format. Messing with it in terms of excel is probably the way to go.

Funny story, I had actually considered using the Excel format, even considered maybe doing some fancy PowerQuery stuff, but ended up going with the CSV file thinking I could somehow manipulate it to be more normal.

The thing with the spreadsheet (and why I quoted the good Dr specifically) is that it's also Extremely hosed Up and Stupid:


But, with what you've all offered here, I think I may end up using the spreadsheet after all since it will be handy to be able to reference the cells specifically. There are, as you mentioned ~25 functional columns here, so I think I can work with this. Since the visual layout won't matter, I'll likely just un-merge all the columns and access their values directly.

Thanks again!


e: I've also got the ability to export as an XML file, but it's a SOAP XML file, which based on what I've looked into so far, requires some kind of additional URL provided that's some sort of decryption or authorization key? I'm not really sure how to handle that type of XML, and it doesn't seem to behave nicely if I'm trying to read it as a normal XML file.

Lib and let die fucked around with this message at 19:08 on Sep 3, 2020

Dr Subterfuge
Aug 31, 2005

TIME TO ROC N' ROLL
Hahaha what an abomination of a sheet. At least the values are always stored in the first cell of a group of merged cells.

Adbot
ADBOT LOVES YOU

Macichne Leainig
Jul 26, 2012

by VG
The fact that a spreadsheet like that is possible is both fascinating and frightening.

  • 1
  • 2
  • 3
  • 4
  • 5
  • Post
  • Reply