Newsletter 20 2024-11-01

Hi Everyone,

One of the things I hated the most when I was an academic was rejection. I still get angry thinking about some of the comments I received on papers and grants over the years. (I do stat consulting on NIJ grants, and I may even need to stop that, the juice is not worth the squeeze.)

Not really advice, just a note that academic rejection is something that is not a normal part of other work cultures. And I am personally better off not having to deal with that emotionally anymore on a regular basis.

JOBS

Job board link Recent ones include:

project manager at Meadows Mental Health
USAA decision scientist focused on fraud
Chan-Zuckerberg initiative wants someone with survey stats experience

I tend to focus on tech jobs, but even folks with a qualitative background can apply for positions like project or product manager.

TECH TIP

So there are two different main ways to scrape web results. One way is to use requests (the python library to make GET/POST calls) directly (if there is an API under the hood, or you can just parse out plain HTML). For example, in python this will return data from the Raleigh open data website (that is an ESRI server under the hood):

import requests

parm = {'where':'1=1','outFields':'*','outSr': '4326','resultRecordCount':10,'f':'json'}
url = ('https://services.arcgis.com/v400IkDOw1ad7Yad/'
       'arcgis/rest/services/Police_Incidents/FeatureServer/0/query')

res = requests.get(url,parm).json() # returns a dictionary with data

The other way, if you have complicated javascript or ajax frameworks, is to use an emulated browser. I have migrated most of my code to use a library now in python called playwright when I need to use the emulated browser approach.

Here is an example function of going to Austin PDs site to look up criminal incidents, and scrape through them day by day. Here you can see I do different actions, such as hit enter, input values, tab/arrow around the page etc. (For those who have used selenium before, playwright you don't need to worry about updating the chromedriver on a regular basis.)

Typical warning though with scraping -- be nice to the websites (do not spam them), and make sure you are not violating any TOS by scraping the data. If you change .launch(headless=False) to True (and put some time.sleep steps in between the actions) you can follow along the browser steps that are being taken.

import pandas as pd
import time
from playwright.sync_api import sync_playwright

# Example function getting incident data from Austin PD website
def get_austin(begin,end,init_sleep=3,page_sleep=1,verbose=False):
    url = 'https://services.austintexas.gov/police/reports/search2.cfm'
    playwright = sync_playwright().start()
    browser = playwright.chromium.launch(headless=False)
    context = browser.new_context()
    page = context.new_page()
    dl = pd.date_range(begin,end)
    data = []
    try:
        page.goto(url)
        time.sleep(init_sleep)
        btn = page.get_by_role("button",name="Acknowledge")
        btn.press("Enter")
        for d in dl:
            ds = d.strftime('%m/%d/%Y')
            if verbose:
                print(f'Getting day {ds}')
            la = page.locator(':text("Advance Search")')
            la.press("Enter")
            time.sleep(page_sleep)
            start_date =  page.query_selector('id=inputStartDate')
            start_date.fill(ds)
            start_date.press('ArrowDown')
            start_date.press('Enter')
            city =  page.query_selector("[name='city']")
            city.press('Enter')
            city.press('ArrowDown')
            city.press('Enter')
            btnS = page.get_by_role("button",name="Submit")
            btnS.press('Enter')
            time.sleep(init_sleep)
            res =  page.content() # now need to parse out the table info
            soup = bs(res,'html.parser')
            r1 = soup.findAll(string=re.compile(' Report Number'))
            ld = [parse_table(r) for r in r1]
            data.append(pd.DataFrame(ld))
            gb = page.locator(':nth-match(:text("New Search"),1)')
            gb.press("Enter") # Goes back to advanced search
        df = pd.concat(data)
        playwright.stop()
        return df.drop_duplicates(ignore_index=True)
    except Exception:
        er = traceback.format_exc()
        playwright.stop()
        raise Exception(er)

Best, Andy Wheeler

The Alt-Ac Job Beat Newsletter Post 20

JOBS

TECH TIP