Hacking Oregon's Hidden Political Connections

Hacking Oregon’s Hidden Political Connections

A TotalGood project

v0.0.4


Material


Agenda:

For Hack Oregon we explored the data in unusual ways

  1. Pandas as a DB
  2. Find Connections (FKs, PKs, other DBs)
  3. TFIDF on a DB table
  4. TFIDF similarity
  5. Similarity Similarity

Intro: 1

Pandas as a relational DB

  • Identify foreign keys automatically
  • Use FKs to do join SQL-like queries

Intro: 2

Intersect large sets

  • AM emails in BehindTheCurtain DB?
  • 10 GB mysql dump » dozens of CSVs
  • Load 50M emails efficiently
  • Intersect emails with public records

Intro: 3

Restructure a DB

  • Why?
  • How?
  • Restructure (TFIDF)
    • Raw python
    • Sklearn

Intro: 4

TFIDF to detect similarity between records

  • cluster Oregon PACs by their “mission”
  • d3 force-directed graph of PAC similarity
  • compare to DG of financial transactions

Intro: 5

Similarity between similarity matrices

SAY
(TFIDF)

vs.

DO
(Transactions)


3. Restructure DB

Why?

  • Squish fields into a string?
  • Vectorizing later anyway, right?

Because

  • Dimensions are vaguely defined/understood
  • Information “smear” across fields/dimensions

3. Restructure DB: How?

  1. Ignore numbers/dates
  2. Stringify each field
  3. Stem words
  4. Ignore words (are you sure?)
  5. Concatenate
  6. Split
  7. Vectorize/Count

3. Restructure DB: TFIDF

  • Must be sparse to fit in memory
  • Explicit python builtins: Counter, defaultdict
  • sklearn

4. TFIDF Similarity

Large dimensions are scary

  • Everything is far apart
  • Euclidean distance is meaningless
  • Our brains fail

4. TFIDF Similarity

Vector distances


4. TFIDF Similarity

Cosine Similarity

(similarity = 1/distance)

  • Equivalent:
    • Pierson Correlation
    • v_1 dot v_2 (projection)
    • angle between v1 and v2
  • Bounded: [-1, +1]

5. Similarity Similarity

Cluster Oregon PACs by their “mission”

  • d3 force-directed graph of PAC similarity
  • compare to DG of financial transactions

Thank You!


Written on October 27, 2015