aswan

Documentation Status codeclimate codecov pypi DOI

collect and organize data into a T1 data depot named after the Aswan Dam

Collect and compress data from the internet for later parsing

  • quick, parallel, customizable to collect

  • compressed to store

  • quick to sync with a remote store

    • sync to continue collecting

    • sync to parse

  • immutable collection

To Setup a Remote

set the environment variables ASWAN_AUTH_HEX and ASWAN_AUTH_PASS according to the zimmauth package, and ASWAN_REMOTE with the name of the default remote.

Concepts

  • objects

    • saved by collection events

  • events

    • collection

    • registration (v2: registration for parsing)

    • (v2) parsing

  • runs

    • manual run vs automated run

      • makes manual adding of urls easy but revertible

    • has unique id

    • generates events

    • linked to a specific version of the code

      • ideally commit hash + pip freeze

  • statuses

    • determined by base status + runs integrated

    • contains

      • what urls need to be collected

      • (v2) what collected objects need to be parsed

    • sqlite file, constantly trimmed

Structure

  • objects

    • 00, 01, …

  • runs

    • run-hash

      • context.yaml

        • commit-hash, pip-freeze, …

      • events.zip

  • statuses

    • status-hash

      • context.yaml

        • parent-status, integrated

      • db.sqlite.zip

  • current-run

    • context.yaml

    • events

      • these to be compressed into ../runs

    • status.sqlite

  • there is a ‘TEST’ status

    • cannot be integrated whatever is based on it

    • a test run can be made on it…

when starting a run:

  • check if current-run is empty

    • if not, fail with

  • find latest status

    • if it has not integrated all past runs, create a new status that has

  • start collection (+ registration)

  • either stops or breaks, all events and objects are saved to disk

  • if properly stops, move and compress stuff

    • based on one that was the starter, and current run id

Pre v1.0 laundry list

  • parallelize push / pull

  • parsing/connection/broken session error docs

  • transferring / ignoring cookies

  • template projects

    • oddsportal

      • updating thingy, based on latest match in season

    • footy

    • rotten

    • boxoffice

Indices and tables