History of Spark origins. Strucutred Streaming is already used extensively.
ML Framework of spark
Two challenges communitcation between algos-spark and marring ML gang schedule vs Spark parallel.
- Initial approach: a python UDF applied row by row.
- Vectorized approach: execute a python UDF on batches of rows.
Tasks are independent and completely paralel with no comms. Distib ML needs task intercommunication - tasks are not completely parallel.
Unification by project Hydrogen
- Load phase - independent tasks
- Gang phase - intercommunication
- Sink phase - independent tasks
Zaharia Talk ML Flow
- collect raw data
- prep data prep
Facebook, Uber, Google have custom standardization to hte livecycles, but they are limited to a few algos which where hand integrated and tied to companies infrasturct. Spark thus attempts to establish public standard allowing to integrate publically.
Introducing ML Flow:
- works with any ML lib and lan
- cross cloud support
- scales to big data
Example without ML Flow:
- Running experiments with different params, versions, code changes.
- Dev looses understanding what happend and handing over is impossible.
- ML Flow adoption can track all above info by tracking ML jobs and allows reproduction
- data not ready
- data is siloed
- dsci n deng are siloed
Schema enforcement, ACID transactions, query perf,
3# DBrics fruntime for ML
- LTS for OSS frameworks for ML engines
Dsci n Deng siloes
eng givs platform to dsci. dsci codes model scrip deng deploys
uses mlflow, delta, ml runtime
issues iwth prediction mlflow to analyze past perf revert to older data find non normalized data was loaded review trasactions and rerun previous job but with normalized job
Deep Dive Spark SQL
SQL: Structured Query Language Schema: name, type, nullability Structure importance: storage optimization, calc optimization tungsten Basics
Comon Stat Pitfalls In Dsci
Car stopping distance:
- L-reg regressing on to the other.
- lin reg speed -> distance or distance -> speed
Treatment A vs B:
- need to control for hard cases to find overall better treatment.
- on the other hand sometimes it is not good idea to control for that
Carboraters and axle ratio:
- non split data has one lin reg
- split has different correlation
Resolution of both is due to causation:
- humans reason via causation
- data may not contain causal info
- correct data inference requires causal model
Car stopping: speed causes stopping distance Kidney stone size -> Treatment -> outcome -> outcome
- treatment: -> pH -> outcome
- Axle Ratio -> cylinders
- Carbourators -> cylinders
Recommended book: The Book Of Why:
P(Y X) != P(Y do(X))
Smoking causal module:
- genes -> cancer; genes -> smoking -> tar -> cancer
- caucality is critical
- confounding var
- PGM helps
- do calculus clarifies
SimilarWeb Interaction-Based Feature Extraction
know unique user traffic have website when we know the user gender and website when we don’t know due to too many websites and users the problem is large PCA or matrix factorization did not work selected feature representing a bucket of gender distribution projected the problem into that space and was able to resolve the problem attempting to look at user level prediction was not possible due to large noise
Integrates Spark with Avro. Supports structures streaming, schema management, schema retention, Confluent Kafka Avro schemas,
OASIS—Collaborative Data Analysis Platform Using Apache Spark
improves over zeppelin: security and scale extracted data can be vised and shared wihtin team will be OSSed soon
Streaming Random Forrest
- streaming has extra challenges
- output is trained model
Stationary data: data distribution unknown but time static Non-stat: unknown and time dynamic => features are dynamic
Labeling can be available early, delayed, categorize concept drift
StreamingDecisionTree: standard, intuitively splits seeing even small amount of data Online bagging:
StreamingLogisticRegressionWithSGD has similar API Training:
Indexing and Ranking via Spark
hard to understand for me personally
click through data relevance - search term denoting ranking of the search results
Idea of the method in the talk:
rank datasets on Elasticsearch using LTR extract relevance features boostrap on dataset load time the search …
Spatio Temporal Modeling
factor analysis (looking for latent variables) - prelimirary regresison time - auto correlation regression spatial correlation time - auto correlation regression factor analysis (looking for latent variables) - prelimirary regresison
this single iteration across the domains gives good enough results
Morpheus version of Neo4J.
Graph dbs have growing usage in the industry. Nodes have name, properties and releationships. Relationships have type, properties. Neo4J on some backend db like Hadoop, Hive, Spark. Attempt to create GQL
- Load CSV into Hive.
- Create views per realtionships.
- Init Cypher via query on relationships.
- Execute Neo4j graph merge tool.
Query “match customer interact customerRep where type in cancel”.
- Create new graph with same schema
Graphs and tables can be loaded to Cypher queried and stored back to graph or table. Thus we can compose queries into other quieries. e.g. match socialNetwork sn match person p where p.email = sn.email
- graph matching: returns results
- graph construction: takes outputs and constructs new querable graph
RDMS vs Graph: nodes may have no props or relationships.
Cypher: sql for graphs and start of gql Cypher for spark: runs on spark SQL (dataframe)
Jacek’s Spark SQL Bucketing
Bucketing is similar to partitioning with providing num of buckets. It pre-schuffles the tables for future join. More joins more gains. Enabled by default. For file based data source: DataFrameWriter#bucketBy(numBuckets, colName, colNames). In SQL one can use distributedBy.
bucketed = large.write.bucketBy(4, "id").mode(SaveMode.Overwrite).saveAsTable() bucketed.join(..)
Thanks to bucketing SrotMergeJoin will exchanges as we pre-shuffled.
Skewing: Will help the same but less. Choosing num buckets: based on knowledge of data or experimentation. No easy answers to simple looking questions. Known when bucketed: there is a query on that.
Intro similar to keynote explanation.
mlflow.log_param("layers", layers) mlflow.log_metric("standard deviation", std)
Example Airbnb Pricing
model = ElasticNet(...) # linear model mlflow.lgon_param("aplhap", alpha) model.fit(data, ...) mlflow.log_metric("mae", mae) mlflow.log_tag(".."
- params config model
- metrics model perf
- tags and notes
- artifacts: files associated
- source: git revision id
# ls conda.yaml main.py model.py dependencies.txt MLProject
MLProject is similar to a Dockerfile. Defines a entrypoint and defines which params can be passed to a model.
Thanks to MLProject file MLFlow allows to reload older model with different input params or different data.
MLFlow Model Packaging Format
Allows using different frameworks to implement model and different target infrastructures like Docker, Spark and others.
$ ls MLmodel estimator save_model.pb variables.pb
MLModel can be source for generation of several flavours. Spark, Mleap and pyfunc Generic flavours provide decoupling …
Pyfunc object is saved as a directory were single instance need to implement predict funciton.
mlflow pyfunc serve -m mlruns/0/XXXX/artifacts/keras_model -p 50050 curl localhost... # or directly predict mlflow pyfunc predict -m ... -i ./test_data.csv
Generate Spark UDF
Github advanced examples:
- Hyperparm tuning
Social Media Influencers Detection, Analysis and Recommendation (Social Bakers)
- Badly studcutred data.
- large dataset
- need to scale
Influencers on Spark developed in around a year on Databricks Spark. Search and recommendation engine over datbs of pub profiles providing perf metrics and attributes of influencers. Predicting:
- img rec: age, gender
- looking in text for location name.
- calc confidence
- calc relevance score, atrribute score
- Stores and
- base predictor, ensembling predictions, merging anlayses
- mongo transform, elastic search transform
- UI quieries this backend
- succesfull scaling speedup
- deng vs dsci
- using %run notebook inclusion
- AWS athena helps with debugging
- overloading Mongo - had to start batching to avoid too many parallel request on the db
- versioning, CO/CD for notebooks very very needed