aswan

Documentation Status codeclimate codecov pypi DOI

collect and organize data into a T1 data depot named after the Aswan Dam

Collect and compress data from the internet for later parsing

  • quick, parallel, customizable to collect

  • compressed to store

  • quick to sync with a remote store

    • sync to continue collecting

    • sync to parse

  • immutable collection

To Setup a Remote

set the environment variables ASWAN_AUTH_HEX and ASWAN_AUTH_PASS according to the zimmauth package, and ASWAN_REMOTE with the name of the default remote.

Concepts

  • objects

    • saved by collection events

  • events

    • collection

    • registration (v2: registration for parsing)

    • (v2) parsing

  • runs

    • manual run vs automated run

      • makes manual adding of urls easy but revertible

    • has unique id

    • generates events

    • linked to a specific version of the code

      • ideally commit hash + pip freeze

  • statuses

    • determined by base status + runs integrated

    • contains

      • what urls need to be collected

      • (v2) what collected objects need to be parsed

    • sqlite file, constantly trimmed

Structure

  • objects

    • 00, 01, …

  • runs

    • run-hash

      • context.yaml

        • commit-hash, pip-freeze, …

      • events.zip

  • statuses

    • status-hash

      • context.yaml

        • parent-status, integrated

      • db.sqlite.zip

  • current-run

    • context.yaml

    • events

      • these to be compressed into ../runs

    • status.sqlite

  • there is a ‘TEST’ status

    • cannot be integrated whatever is based on it

    • a test run can be made on it…

when starting a run:

  • check if current-run is empty

    • if not, fail with

  • find latest status

    • if it has not integrated all past runs, create a new status that has

  • start collection (+ registration)

  • either stops or breaks, all events and objects are saved to disk

  • if properly stops, move and compress stuff

    • based on one that was the starter, and current run id

Pre v1.0 laundry list

  • parallelize push / pull

  • parsing/connection/broken session error docs

  • transferring / ignoring cookies

  • template projects

    • oddsportal

      • updating thingy, based on latest match in season

    • footy

    • rotten

    • boxoffice

Installation:

using pip

pip install aswan

Quickstart

from aswan import __version__

API

aswan Package

Data collection manager

Functions

add_url_params(url, params)

get_json(url[, params])

get_soup(url[, params, browser, headless, ...])

run_simple_project(urls_for_handlers, name)

Classes

AswanDepot(name[, local_root])

BrokenSessionError

BrowserHandler()

BrowserJsonHandler()

BrowserSoupHandler()

ConnectionError

ConnectionSession(depot_path[, is_browser, ...])

ObjectStore(root[, hash_fun, compression, ...])

class for storing and retrieving objects downloaded

ParsedCollectionEvent(cev, store)

Project(name[, local_root, distributed_api, ...])

ProxyAuth(user, password)

ProxyBase()

RequestHandler()

RequestJsonHandler()

RequestSoupHandler()

Statuses()

WebExtHandler()

Class Inheritance Diagram

digraph inheritancebe73e70b4a { bgcolor=transparent; rankdir=LR; size="8.0, 12.0"; "ABC" [fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",tooltip="Helper class that provides a standard way to create an ABC using"]; "ActorBase" [fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled"]; "ABC" -> "ActorBase" [arrowsize=0.5,style="setlinewidth(0.5)"]; "AswanDepot" [URL="index.html#aswan.AswanDepot",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top"]; "RemoteMixin" -> "AswanDepot" [arrowsize=0.5,style="setlinewidth(0.5)"]; "BrokenSessionError" [URL="index.html#aswan.BrokenSessionError",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top"]; "BrowserHandler" [URL="index.html#aswan.BrowserHandler",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top"]; "UrlHandlerBase" -> "BrowserHandler" [arrowsize=0.5,style="setlinewidth(0.5)"]; "BrowserJsonHandler" [URL="index.html#aswan.BrowserJsonHandler",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top"]; "_JsonMixin" -> "BrowserJsonHandler" [arrowsize=0.5,style="setlinewidth(0.5)"]; "BrowserHandler" -> "BrowserJsonHandler" [arrowsize=0.5,style="setlinewidth(0.5)"]; "BrowserSoupHandler" [URL="index.html#aswan.BrowserSoupHandler",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top"]; "_SoupMixin" -> "BrowserSoupHandler" [arrowsize=0.5,style="setlinewidth(0.5)"]; "BrowserHandler" -> "BrowserSoupHandler" [arrowsize=0.5,style="setlinewidth(0.5)"]; "ConnectionError" [URL="index.html#aswan.ConnectionError",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top"]; "ConnectionSession" [URL="index.html#aswan.ConnectionSession",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top"]; "ActorBase" -> "ConnectionSession" [arrowsize=0.5,style="setlinewidth(0.5)"]; "DepotBase" [fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled"]; "ObjectStore" [URL="index.html#aswan.ObjectStore",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="class for storing and retrieving objects downloaded"]; "ParsedCollectionEvent" [URL="index.html#aswan.ParsedCollectionEvent",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top"]; "Project" [URL="index.html#aswan.Project",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top"]; "ProxyAuth" [URL="index.html#aswan.ProxyAuth",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="ProxyAuth(user: str, password: str)"]; "ProxyBase" [URL="index.html#aswan.ProxyBase",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top"]; "RemoteMixin" [fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled"]; "DepotBase" -> "RemoteMixin" [arrowsize=0.5,style="setlinewidth(0.5)"]; "RequestHandler" [URL="index.html#aswan.RequestHandler",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top"]; "UrlHandlerBase" -> "RequestHandler" [arrowsize=0.5,style="setlinewidth(0.5)"]; "RequestJsonHandler" [URL="index.html#aswan.RequestJsonHandler",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top"]; "_JsonMixin" -> "RequestJsonHandler" [arrowsize=0.5,style="setlinewidth(0.5)"]; "RequestHandler" -> "RequestJsonHandler" [arrowsize=0.5,style="setlinewidth(0.5)"]; "RequestSoupHandler" [URL="index.html#aswan.RequestSoupHandler",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top"]; "_SoupMixin" -> "RequestSoupHandler" [arrowsize=0.5,style="setlinewidth(0.5)"]; "RequestHandler" -> "RequestSoupHandler" [arrowsize=0.5,style="setlinewidth(0.5)"]; "Statuses" [URL="index.html#aswan.Statuses",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top"]; "UrlHandlerBase" [fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled"]; "WebExtHandler" [URL="index.html#aswan.WebExtHandler",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top"]; "UrlHandlerBase" -> "WebExtHandler" [arrowsize=0.5,style="setlinewidth(0.5)"]; "_JsonMixin" [fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled"]; "_SoupMixin" [fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled"]; }

Release Notes

v0.0.0

  • first release of aswan, yay!!

v0.0.1

  • revert to 3.8 for ray

v0.1.0

integrate atqo

v0.1.1

rm ray from basic

v0.2.0

major simplification

v0.3.0

for the devs

v0.3.1

even better

v0.4.0

limit parallelism and some efficiency

v0.4.1

improve response handling

v0.4.2

align with dz

v0.4.3

silence

v0.4.4

silence

v0.4.5

hope

v0.5.0

adapt to atqo

v0.5.1

fix parallel processing

v0.5.10

logging and slight remote fixes

v0.5.11

dependency fix

v0.5.12

dependency fix

v0.5.13

add brotli req

v0.5.14

webext handler

v0.5.2

expose monitor better

v0.5.3

monitor improvement, more parallelization

v0.5.4

rm recursion

v0.5.5

caching in depot

v0.5.6

cant download to Path

v0.5.7

abs path

v0.5.8

post_status kwarg and error handling

v0.5.9

better pulling

Indices and tables