Vaclav Kosar's face photo

Vaclav Kosar

Vaclav Kosar is a senior software developer. This blog is focused on interesting technology, tips and news.

Spark and AI Summit

03 Oct 2018

ML Hydrogen

History of Spark origins. Strucutred Streaming is already used extensively.

ML Framework of spark

Two challenges communitcation between algos-spark and marring ML gang schedule vs Spark parallel.

Execution Models

Tasks are independent and completely paralel with no comms. Distib ML needs task intercommunication - tasks are not completely parallel.

Unification by project Hydrogen

Zaharia Talk ML Flow

ML Livecyc:

Facebook, Uber, Google have custom standardization to hte livecycles, but they are limited to a few algos which where hand integrated and tied to companies infrasturct. Spark thus attempts to establish public standard allowing to integrate publically.

Introducing ML Flow:

Example without ML Flow:

Unified Analytics


D-Bricks Delta

Schema enforcement, ACID transactions, query perf,

3# DBrics fruntime for ML

Dsci n Deng siloes

eng givs platform to dsci. dsci codes model scrip deng deploys

Example Dataflicks

uses mlflow, delta, ml runtime

issues iwth prediction mlflow to analyze past perf revert to older data find non normalized data was loaded review trasactions and rerun previous job but with normalized job

Deep Dive Spark SQL

SQL: Structured Query Language Schema: name, type, nullability Structure importance: storage optimization, calc optimization tungsten Basics

Comon Stat Pitfalls In Dsci

Car stopping distance:

Treatment A vs B:

Carboraters and axle ratio:

Resolution of both is due to causation:

Car stopping: speed causes stopping distance Kidney stone size -> Treatment -> outcome -> outcome

Blood pH:


Recommended book: The Book Of Why:

Smoking causal module:


SimilarWeb Interaction-Based Feature Extraction

know unique user traffic have website when we know the user gender and website when we don’t know due to too many websites and users the problem is large PCA or matrix factorization did not work selected feature representing a bucket of gender distribution projected the problem into that space and was able to resolve the problem attempting to look at user level prediction was not possible due to large noise


Integrates Spark with Avro. Supports structures streaming, schema management, schema retention, Confluent Kafka Avro schemas,

OASIS—Collaborative Data Analysis Platform Using Apache Spark

improves over zeppelin: security and scale extracted data can be vised and shared wihtin team will be OSSed soon

Streaming Random Forrest

Stationary data: data distribution unknown but time static Non-stat: unknown and time dynamic => features are dynamic

Labeling can be available early, delayed, categorize concept drift


StreamingDecisionTree: standard, intuitively splits seeing even small amount of data Online bagging:

StreamingLogisticRegressionWithSGD has similar API Training:

Indexing and Ranking via Spark

hard to understand for me personally

click through data relevance - search term denoting ranking of the search results

Idea of the method in the talk:

rank datasets on Elasticsearch using LTR extract relevance features boostrap on dataset load time the search …

Spatio Temporal Modeling

factor analysis (looking for latent variables) - prelimirary regresison time - auto correlation regression spatial correlation time - auto correlation regression factor analysis (looking for latent variables) - prelimirary regresison

this single iteration across the domains gives good enough results

Neo4J Morpheus

Morpheus version of Neo4J.

Graph dbs have growing usage in the industry. Nodes have name, properties and releationships. Relationships have type, properties. Neo4J on some backend db like Hadoop, Hive, Spark. Attempt to create GQL



Graphs and tables can be loaded to Cypher queried and stored back to graph or table. Thus we can compose queries into other quieries. e.g. match socialNetwork sn match person p where =


RDMS vs Graph: nodes may have no props or relationships.


Cypher: sql for graphs and start of gql Cypher for spark: runs on spark SQL (dataframe)

Jacek’s Spark SQL Bucketing

Bucketing is similar to partitioning with providing num of buckets. It pre-schuffles the tables for future join. More joins more gains. Enabled by default. For file based data source: DataFrameWriter#bucketBy(numBuckets, colName, colNames). In SQL one can use distributedBy.


bucketed = large.write.bucketBy(4, "id").mode(SaveMode.Overwrite).saveAsTable()

Thanks to bucketing SrotMergeJoin will exchanges as we pre-shuffled.


Skewing: Will help the same but less. Choosing num buckets: based on knowledge of data or experimentation. No easy answers to simple looking questions. Known when bucketed: there is a query on that.

ML Flow

Intro similar to keynote explanation.


mlflow.log_param("layers", layers)
mlflow.log_metric("standard deviation", std) 

Example Airbnb Pricing

model = ElasticNet(...) # linear model
mlflow.lgon_param("aplhap", alpha), ...)
mlflow.log_metric("mae", mae)


MLProject Structure

# ls

MLProject is similar to a Dockerfile. Defines a entrypoint and defines which params can be passed to a model.

Reusing models

Thanks to MLProject file MLFlow allows to reload older model with different input params or different data.

MLFlow Model Packaging Format

Allows using different frameworks to implement model and different target infrastructures like Docker, Spark and others.

$ ls

MLModel can be source for generation of several flavours. Spark, Mleap and pyfunc Generic flavours provide decoupling …

Pyfunc object is saved as a directory were single instance need to implement predict funciton.

mlflow pyfunc serve -m mlruns/0/XXXX/artifacts/keras_model -p 50050
curl localhost...
# or directly predict
mlflow pyfunc predict -m ... -i ./test_data.csv

Generate Spark UDF



Github advanced examples:

Social Media Influencers Detection, Analysis and Recommendation (Social Bakers)

Influencers on Spark developed in around a year on Databricks Spark. Search and recommendation engine over datbs of pub profiles providing perf metrics and attributes of influencers. Predicting:

Predictor example:



Subscribe: Twitter , Facebook , RSS
Share on: Twitter , Facebook , Google+ , LinkedIn , Reddit .

Report any trackers to [email protected].