aswan
collect and organize data into a T1 data depot named after the Aswan Dam
Collect and compress data from the internet for later parsing
quick, parallel, customizable to collect
compressed to store
quick to sync with a remote store
sync to continue collecting
sync to parse
immutable collection
To Setup a Remote
set the environment variables ASWAN_AUTH_HEX
and ASWAN_AUTH_PASS
according to the zimmauth package, and ASWAN_REMOTE
with the name of the default remote.
Concepts
objects
saved by collection events
events
collection
registration (v2: registration for parsing)
(v2) parsing
runs
manual run vs automated run
makes manual adding of urls easy but revertible
has unique id
generates events
linked to a specific version of the code
ideally commit hash + pip freeze
statuses
determined by base status + runs integrated
contains
what urls need to be collected
(v2) what collected objects need to be parsed
sqlite file, constantly trimmed
Structure
objects
00, 01, …
runs
run-hash
context.yaml
commit-hash, pip-freeze, …
events.zip
statuses
status-hash
context.yaml
parent-status, integrated
db.sqlite.zip
current-run
context.yaml
events
these to be compressed into ../runs
status.sqlite
there is a ‘TEST’ status
cannot be integrated whatever is based on it
a test run can be made on it…
when starting a run:
check if current-run is empty
if not, fail with
find latest status
if it has not integrated all past runs, create a new status that has
start collection (+ registration)
either stops or breaks, all events and objects are saved to disk
if properly stops, move and compress stuff
based on one that was the starter, and current run id
Pre v1.0 laundry list
parallelize push / pull
parsing/connection/broken session error docs
transferring / ignoring cookies
template projects
oddsportal
updating thingy, based on latest match in season
footy
rotten
boxoffice
Installation:
using pip
pip install aswan
Quickstart
from aswan import __version__
API
aswan Package
Data collection manager
Functions
|
|
|
|
|
|
|
Classes
|
|
|
|
|
class for storing and retrieving objects downloaded |
|
|
|
|
|
|
|
|
Class Inheritance Diagram
digraph inheritancebe73e70b4a { bgcolor=transparent; rankdir=LR; size="8.0, 12.0"; "ABC" [fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",tooltip="Helper class that provides a standard way to create an ABC using"]; "ActorBase" [fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled"]; "ABC" -> "ActorBase" [arrowsize=0.5,style="setlinewidth(0.5)"]; "AswanDepot" [URL="index.html#aswan.AswanDepot",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top"]; "RemoteMixin" -> "AswanDepot" [arrowsize=0.5,style="setlinewidth(0.5)"]; "BrokenSessionError" [URL="index.html#aswan.BrokenSessionError",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top"]; "BrowserHandler" [URL="index.html#aswan.BrowserHandler",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top"]; "UrlHandlerBase" -> "BrowserHandler" [arrowsize=0.5,style="setlinewidth(0.5)"]; "BrowserJsonHandler" [URL="index.html#aswan.BrowserJsonHandler",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top"]; "_JsonMixin" -> "BrowserJsonHandler" [arrowsize=0.5,style="setlinewidth(0.5)"]; "BrowserHandler" -> "BrowserJsonHandler" [arrowsize=0.5,style="setlinewidth(0.5)"]; "BrowserSoupHandler" [URL="index.html#aswan.BrowserSoupHandler",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top"]; "_SoupMixin" -> "BrowserSoupHandler" [arrowsize=0.5,style="setlinewidth(0.5)"]; "BrowserHandler" -> "BrowserSoupHandler" [arrowsize=0.5,style="setlinewidth(0.5)"]; "ConnectionError" [URL="index.html#aswan.ConnectionError",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top"]; "ConnectionSession" [URL="index.html#aswan.ConnectionSession",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top"]; "ActorBase" -> "ConnectionSession" [arrowsize=0.5,style="setlinewidth(0.5)"]; "DepotBase" [fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled"]; "ObjectStore" [URL="index.html#aswan.ObjectStore",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="class for storing and retrieving objects downloaded"]; "ParsedCollectionEvent" [URL="index.html#aswan.ParsedCollectionEvent",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top"]; "Project" [URL="index.html#aswan.Project",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top"]; "ProxyAuth" [URL="index.html#aswan.ProxyAuth",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="ProxyAuth(user: str, password: str)"]; "ProxyBase" [URL="index.html#aswan.ProxyBase",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top"]; "RemoteMixin" [fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled"]; "DepotBase" -> "RemoteMixin" [arrowsize=0.5,style="setlinewidth(0.5)"]; "RequestHandler" [URL="index.html#aswan.RequestHandler",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top"]; "UrlHandlerBase" -> "RequestHandler" [arrowsize=0.5,style="setlinewidth(0.5)"]; "RequestJsonHandler" [URL="index.html#aswan.RequestJsonHandler",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top"]; "_JsonMixin" -> "RequestJsonHandler" [arrowsize=0.5,style="setlinewidth(0.5)"]; "RequestHandler" -> "RequestJsonHandler" [arrowsize=0.5,style="setlinewidth(0.5)"]; "RequestSoupHandler" [URL="index.html#aswan.RequestSoupHandler",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top"]; "_SoupMixin" -> "RequestSoupHandler" [arrowsize=0.5,style="setlinewidth(0.5)"]; "RequestHandler" -> "RequestSoupHandler" [arrowsize=0.5,style="setlinewidth(0.5)"]; "Statuses" [URL="index.html#aswan.Statuses",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top"]; "UrlHandlerBase" [fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled"]; "WebExtHandler" [URL="index.html#aswan.WebExtHandler",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top"]; "UrlHandlerBase" -> "WebExtHandler" [arrowsize=0.5,style="setlinewidth(0.5)"]; "_JsonMixin" [fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled"]; "_SoupMixin" [fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled"]; }Release Notes
v0.0.0
first release of aswan, yay!!
v0.0.1
revert to 3.8 for ray
v0.1.0
integrate atqo
v0.1.1
rm ray from basic
v0.2.0
major simplification
v0.3.0
for the devs
v0.3.1
even better
v0.4.0
limit parallelism and some efficiency
v0.4.1
improve response handling
v0.4.2
align with dz
v0.4.3
silence
v0.4.4
silence
v0.4.5
hope
v0.5.0
adapt to atqo
v0.5.1
fix parallel processing
v0.5.10
logging and slight remote fixes
v0.5.11
dependency fix
v0.5.12
dependency fix
v0.5.13
add brotli req
v0.5.14
webext handler
v0.5.2
expose monitor better
v0.5.3
monitor improvement, more parallelization
v0.5.4
rm recursion
v0.5.5
caching in depot
v0.5.6
cant download to Path
v0.5.7
abs path
v0.5.8
post_status kwarg and error handling
v0.5.9
better pulling