Extending Piwik At R7.com

Extending Piwikat r7.com

Phase 1 – Collecting data

Adding some cloud and modern scalability to a traditional LAMP stack

leonardo lorieri, r7.com system architect, 'lorieri at gmail.com', feb/2012

Why Piwik ?

- Open Source = flexible, understandable, free!

- Great interface

- Mobile app

- REST API

- Developers knows the market needs

- Efficient in small machines

- Lots of possible improvements

- Lots of improvements already in the roadmap

- Great and supportive community (Thank you all!)

Our Plan, goals and trade-offs

- Don't change original code - reduces development and maintenance costs

- Count only visits and page views - to be fast and focused (even though you still can use the .js tracker, it is easy to get lost in the UI's beauty and all its functionalities)

- Handle odd unexpected traffic peaks - from tv announcements

- Count not only websites - media delivery, internal searches, debugs

- At least 99% of accuracy

- Have numbers to compare with other analytics tools

- We've lost P3P for now

Our big problem - The TV Effect

from Gaiser's presentation at http://www.slideshare.net/rgaiser/r7-no-aws-qcon-sp-2011

Traffic peak during a TV Show

Regular Piwik Setup

based on Rodrigo Campos presentation http://www.slideshare.net/xinu/capacity-planning-for-linux-systes

- Apache/Nginx - Php - Mysql

Bigger Piwik Setup


- Apache/Nginx - Php

- MySql

Regular Php Scaling Piwik Setup


- Apache/Nginx - Php

- MySql Replication (slave for backup only, piwik is not "slave ready")

Load balancer/Nginx

Two problems, one easy solution

Problem: Data Collection for the TV Effect Easy solution: make it asynchronous

Problem: Data processing Hard solution: huge ($$$) servers and complex tunings

Asynchronous Piwik Setup


- Nginx - NOT even Php

- MySql Master - Apache+Php for Admin UI - Archive cron

Load balancer/Nginx

- MySql Slave - Perl/Python worker to process logs

(manages user cookies)

(user cookie)

- accesses logs

Visits

REST API

<img src=> request

Admin/Reports

Nginx (more details later)

- Small virtual machines can handle thousands requests per second

- Visits divided in logs by virtual hosts

- HttpUserIdModule - automatically creates and handles user id cookies

- HttpLogModule - formats log as NSCA combined (logging cookies and referrers)

- HttpEmptyGifModule - respond an empty gif

- HttpHeadersModule - expires -1;

(all modules available in ubuntu's nginx-extras package)

- Logrotate - unix tool to rotate logs - each 5 minutes for us

Log processing "Worker" (more details later)

- Copy and uncompress available logs

- Format them as a REST API request - force date visit - force client ip - force idVisitor - force User Agent - User log's Referrer as URL - Use referrer as page tittle (useful to log multiple hostnames)

- Send request to Piwik server in parallel - we are using 270 concurrent requests that makes 1300 requests per second

Mysql Master (more details later)

- The only machine that has to be huge, saving money

- Piwik admin and reports interface is here - could be somewhere else, but the machine is huge anyway

- Mysql tuning

- Raid tuning

- Linux networking tuning - same in all machines, to handle too many tcp concurrent connections

Php tuning (more details later)

- Max execution time - Max INPUT TIME <--- bug in php reports it as max execution time - Max memory limit - Apc - Apc shm_size Some problems:

- Consider Apache, it is slower than nginx, but more stable and much easier to debug, easier to control concurrency - MYSQLi is more stable and has better debugging than Mysql_pdo - mod_php is more stable and easier to debug than fastcgi

Piwik tuning

Follow the rules: http://piwik.org/faq/new-to-piwik/#faq_137

- disable unused plugins - since the cookies comes from nginx, you can set in the config.ini:

[Tracker]trust_visitors_cookies=1

Handling TV Effectnginx requests/second > maximum requests on peaks of traffic - autoscaling guarantees it - autoscaling provides scheduled capacity changes

total requests in a day < (rest api requests/second) * (all seconds in a day) - even though the peaks requests vary in 1000% in a short time, the total amount of traffic is easily handled when it comes in a queue with fixed requests/second rate, it will only take some more time to catch up

maximum apache concurrent requests > maximum concurrent worker connections - the program that process the logs cannot make more requests than the apache can handle, we configure apache to 1000 concurrent requests and configure the worker to input 260 concurrent requets. so apache has some free slots to other admin tasks.

mysql max connections > apache concurrent requests - otherwise you will get "too many connections"

archive.php performance > rest api input rate - you can't input more that than archieve.php will be able to handle, otherwise you will endup with logs that you will never be able to process

Our real setup, how we deployed itAWS autoscaling for the Nginx machines - easy high availability, increases and decrease collecting machines automatically saving money. - logrotate runs when a machine is "terminated", to make sure none requests were lost

AWS SNS - To easily notifies when a new log files is ready to use, making it easy to synchronize the files processing - Notifies to multiple queues - Having multiple queues, we can use same logs to multiple analysis tools. We use one for web analysis and another for flash player debug

AWS SQS - Easy queue service, so we don't need expensive and complex high availability setups for it

AWS S3 - Cheap and virtually unlimited storage - Very easy access to files - Durable, Amazon guarantees better durability than regular data centers

Nginx - embedded perl script to get real IP on amazon (perl module is also include in ubuntu's package)

Logrotate - Added a s3cmd command (package also available on ubuntu) to upload the log to a S3 bucket, and added an AWS CLI command to send a notification to SNS once it is finished.

Our setup diagram

VisitsELBElastic Load Balancer

nginx autoscaling pool

S3 bucketSNSNotifications

SQSqueues

Otherworkers/processorsfor other projects

worker BigAssMySql

mysql connection

mysql slave, apache, piwik api, python-boto, python-twisted

mysql master, piwik

Piwik Users

one file per virtualhostper machine,for each 5 minutes

one notificationper s3 file

Datacenter

Our Worker - Part 1

Our choice for the REST API was based in same PHP scaling philosophy: Small standalone processes easy to multiply. Also, as in the MySQL replication, it is easier and healthy to process lot of small pieces than to freeze the servers with huge processes.

To input the requests in parallel we used python twisted as shown in this blog post:http://oubiwann.blogspot.com/2008/06/async-batching-with-twisted-walkthrough.html

We installed apache and piwik in the mysql slave machine (of course the php is connecting in the master's mysql), then we tuned apache, mysql and tcp connections (as shown before). We access the REST API using http://127.0.0.1/.

From the twisted blog post mentioned early, we changed the maxRun to 260, added some logging and error handling (we check if a gif was returned and its size, otherwise we log the failed request to be reprocessed later), and we implemented the callLater mentioned in the blog's comments for 0.03 seconds.

To get messages and files from Amazon, we are using python-boto

Our Worker - Part 2Work Flow:

- check for new messages in the AWS SQS queue - if there is an message, it means a new file is available, the message contains a s3 file path - with the s3 file path, download it, uncompress it - transform the NSCA log into a REST API request URL - put the URL in an array - delete the message in the queue

- run twisted reactor for that array, making requests in the Piwik server in parallel - if a request fails, log it to be reprocessed later, alarm it in the monitoring system (we use zabbix btw, for more information: http://lorieri.github.com/zabbix/)

Note: It is good to have one SNS and SQS for each virtual host if you have too many. Python details later

Better costs management- Contributing to Piwik sharing our ideas, brings more ideas and more improvements, and one of its consequences is to reduce costs

- CPU on Amazon is cheap and you pay as you use by time

- Traffic on Amazon is cheap, and you pay as you use, no long term contracts

- By dividing work we can better manage resources, like having only one or two huge machines for the MySQLs and lots of small virtual nginx on autoscaling setup. It is easy to decouple the workers processing to other machines

- Not changing Piwik's code reduces maintenance and development costs

- High Availability on Amazon is easy and cheap

- Storage durability on Amazon is automatic and cheap

- Storage retrieval and management on Amazon is very easy and fast

- Distribution control on Amazon is easy and cheap

- Having an easy way to access the logs, makes it simple to replay traffic, so you can run tests asmuch as you need, and test as many tools as you want, improving resources usage and reducing costs

- Amazon reduces their prices and improve services all the time

Not only web analytics

We are also using Piwik to log video plays

Once an user hits the play button in the flash player, it triggers aGET request similiar to this:

http://player.mysite.com/CATEGORY/VIDEONAME

And we use the Video name as the Action's Page Tittle

It will appear in the Piwik's Actions interface divided by category,and by Videos Name in the actions page tittles

Real NumbersSorry, we can't provide real numbers, but we can do tests and show how far we can go.

- Collecting data Nginx: as much requests per second as we need, just a matter of adding more nginx cheap virtual machines

- REST API Running outside the master machine we've got 1500 requests/s, our Mysql Master has 2 quad core cpus, 64GB memory and Raid 10

- Download of logs If you run inside amazon, the traffic is free, the bandwidth is huge and the latency is small. We download the logs outside amazon and it is not our bottleneck yet

- Distribution tasks control SNS and SQS do it for us, not a bottleneck yet

- We are still testing how many data we can archive for a month or two, it is already possible to archive one hour of 1000 requests/s in 30 minutes (considering S3 download and uncompress), enough to log 50 million a day. But the tests are in too early stages.

CODE OR GTFO!

It is hard to show all the code.Most of tools are regular tools from Ubuntu and Amazon, and some others are relevant only for us. But some code and few links can help a lot.

Unfortunately I can't teach everything and how to use all Amazon tools, but the key points will be shown, like how to get the real user ip address on Amazon and some of the linux and mysql tuning.

MySql tuning details - RaidRaid: http://hwraid.le-vert.net/Our Raid: http://hwraid.le-vert.net/wiki/LSIMegaRAIDSAS

Our commands:

Check battery status:/usr/sbin/megacli -AdpBbuCmd -GetBbuStatus -a0 | grep -e '^isSOHGood'|grep ': Yes'

Turn on the write cache:/usr/sbin/megacli -LDInfo -LAll -aAll|tee /tmp/chefraidstatus |grep 'Default Cache Policy: WriteBack' Turn on the cache:/usr/sbin/megacli -LDSetProp Cached -LALL -aALL Turn off the cache in case the battery is not good/usr/sbin/megacli -LDSetProp Cached -LALL -aALL Turn on the HDD cache/usr/sbin/megacli -LDSetProp EnDskCache -LAll -aAll

Turn on the adaptive cache/usr/sbin/megacli -LDSetProp ADRA -LALL -aALL

DO NOT FORGET TO MONITOR THE RAID: There are tool for it in the website above

MySql tuning details - Innodball mysql tunings you can find here: http://www.slideshare.net/matsunobu/linux-and-hw-optimizations-for-mysql-7614520

Our tuning: table_cache=1024tmp_table_size=6Gmax_heap_table_size=6Gthread_cache=16query_cache_size=1Gquery_cache_limit=4Mdefault-storage-engine = InnoDBexpire_logs_days = 5ignore-builtin-innodbplugin-load=innodb=ha_innodb_plugin.somax_binlog_size = 1024Mskip-name-resolveinnodb_flush_log_at_trx_commit=2innodb_thread_concurrency=32 # we have 16 cpu threadsinnodb_buffer_pool_size = 40G # we have 64G of memoryinnodb_flush_method=O_DIRECTinnodb_additional_mem_pool_size=100Minnodb_log_buffer_size = 18Minnodb_log_file_size = 300Minteractive_timeout = 999999wait_timeout = 999999

Linux tuning/etc/sysctl.conf:

vm.swappiness = 0net.core.somaxconn = 1024net.ipv4.tcp_rmem = 4096 4096 16777216net.ipv4.tcp_wmem = 4096 4096 16777216net.ipv4.tcp_timestamps = 0net.ipv4.tcp_sack = 1net.ipv4.tcp_window_scaling = 1net.ipv4.tcp_fin_timeout = 20net.ipv4.tcp_keepalive_intvl = 30net.ipv4.tcp_keepalive_probes = 5net.ipv4.tcp_tw_reuse = 1net.core.netdev_max_backlog = 5000net.ipv4.ip_local_port_range = 2000 65535fs.file-max=999999

/etc/security/limits.conf # max open files* - nofile 999999

/etc/default/nginxULIMIT="-n 999999"

Nginx confs - Getting Real User IP on AWS ELB

# apt-get install nginx-extras

To get real user IP in a Elastic Load Balancer setup, added those lines inside the http context on /etc/nginx/nginx.conf:

perl_set $ip 'sub {

my $r=shift; local $_ = $r->header_in("X-Forwarded-For");

# XXX only works well because we know the AWS network uses 10.x.x.x ip addresses # Thanks Zed9h my $ip0 = m{.*\b( (?: \d| 1[1-9]| [2-9]\d| [12]\d{2} )\.\d+\.\d+\.\d+ )\b}xo && $1;

# $ip0 ne $ip1 && "$ip0 ne $ip1\t\t$_"; # debug $ip0 || $r->remote_addr }';

(Thanks Zed for the Perl script)

Nginx confs - Adding a virtual host (1/2)

create a file on /etc/nginx/sites-available/VHOST.conf

server { listen 80; ## listen for ipv4; this line is default and implied server_name VHOST.MYSITE.com; root /usr/share/nginx/www; index index.html index.htm;

userid on; userid_name uid; userid_domain MYSITE.com; userid_expires max;

set $myid $uid_got;

location = /crossdomain.xml { echo "<?xml version=\"1.0\"?><!DOCTYPE cross-domain-policy SYSTEM \"http://www.macromedia.com/xml/dtds/cross-domain-policy.dtd\"><cross-domain-policy><allow-access-from domain=\"*\" /></cross-domain-policy>"; expires modified +24h; access_log off; error_log /var/log/nginx/error.log; }

location / { if ($uid_got = ""){ set $myid $uid_set; }

expires -1;# return 204; #use this if you want an empty response empty_gif; #use this if you want an empty gif response }

location /healthcheck { try_files $uri $uri $uri =404; access_log off; error_log /var/log/nginx/error.log; } location /nginx_status { stub_status on; access_log off; allow 127.0.0.1; deny all; access_log off; error_log /var/log/nginx/error.log; }

# !!!!!!!!!!!!!!!!!!!! # the log format is for Amazon AWS only, if you have the real IP, change # the ip variable to $remote_addr

log_format VHOST '$ip - $remote_user [$time_local] ' '"$request" $status $body_bytes_sent ' '"$http_referer" "$http_user_agent" ' '"$myid"';

# /mnt is the AWS's fastest partition access_log /mnt/log/nginx/VHOST.access.log VHOST; error_log /mnt/log/nginx/VHOST.error.log;}

Nginx confs - Adding a virtual host (2/2)

Testing Nginx

$ curl localhost/result must be a gif

$ curl -I localhost/cookie must be set

Php tuning details

# apt-get install php-apc create a file /etc/php5/conf.d/piwik.ini

memory_limit = 15Gmax_execution_time = 0max_input_time = 0apc.shm_size = 64

* Piwik tuning on previous slides

AWS SNS It is out of the scope to teach how to create an Autoscaling group, a SNS, a S3 bucket and SQS queue.

We will only show we using them. Create a SNS topic on Amazon "MYTOPIC", and attach a SQS queue on it "MYQUEUE"

Install SNS client and unzip it somewhere, let's say /usr/local/bin:Download from Amazon the file: SimpleNotificationServiceCli-2010-03-31.zip

Install JDK:# apt-get install openjdk-6-jdk

Create a .conf file with a key and secret to access the SNS. let's say /usr/local/sns.conf:AWSAccessKeyId=XXXXXXXXXXXXAWSSecretKey=XXXXXXXXX

Create a source file on /usr/local/sns_env.source:export JAVA_HOME=/usr/lib/jvm/java-6-openjdk/export PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/bin/SimpleNotificationServiceCli-1.0.2.3/bin/ export AWS_SNS_HOME=/usr/local/bin/SimpleNotificationServiceCli-1.0.2.3/export EC2_REGION=us-east-1export AWS_CREDENTIAL_FILE=/usr/local/sns.conf

AWS S3 and s3cmd

Create a S3 bucket on Amazon

$ apt-get install s3cmd$ s3cmd --configure$ cp ~/.s3cmd.cfg /usr/local/s3cmd.cfg

Log rotate (1/3)

Script was ripped from Ubuntu's init and logrotate and put in a cronjob. You can use logrotate though. in the crontab:*/5 * * * * nice /bin/bash /usr/loca/bin/VHOST.sendS3.sh >> /mnt/log/VHOST.send.log 2>&1 sendS3.sh: #!/bin/bashdate #print date to the logDEBUGS3=`mktemp`atexit() { rm -f $DEBUGS3}trap atexit 0

BUCKET="MYBUCKET"PROJECT="MYVHOST"ARCHIVEDIR="/mnt/MYVHOST/"S3CMD_CONF="/usr/local/s3cmd.cfg"ORIGINPATH="/mnt/log/nginx/VHOST.access.log"SNS_ENV="/usr/local/sns_env.source"SNS_TOPIC="MYTOPIC" HOST=`hostname` #we use instance-id on amazonDATE=$(date --utc +%Y%m%d_%H%M%S)DATEDIR=$(date --utc +%Y/%m/%d)

POSTPATH="$PROJECT/$DATEDIR/$PROJECT-$DATE-$HOST.log"LOCALPATH="$ARCHIVEDIR/$POSTPATH"GZLOCALPATH="$LOCALPATH.gz"REMOTEPATH="s3://$BUCKET/$POSTPATH.gz"

Log rotate (2/3)

echo "->Trying file: $REMOTEPATH"

LOCALDIR="$(dirname "$LOCALPATH")"

#sleep 1 recomended by nginx's wikimkdir -p "$LOCALDIR" &&mv "$ORIGINPATH" "$LOCALPATH" &&{ [ ! -f /var/run/nginx.pid ] || kill -USR1 `cat /var/run/nginx.pid` ; } &&sleep 1 &&gzip "$LOCALPATH" &&{ MD5=$(/usr/bin/md5sum "$GZLOCALPATH" | awk '{ print $1 }') ; }

#try 3 timesif [ -z "$MD5" ]then echo "ERROR ON MD5" OK=1else OK=$(/usr/bin/s3cmd -d --no-progress -c "$S3CMD_CONF" put "$GZLOCALPATH" "$REMOTEPATH" 2>&1 |grep -q "DEBUG: MD5 sums: computed=$MD5, received=\"$MD5\"";echo $?) if [ "$OK" -eq "1" ] then OK=$(/usr/bin/s3cmd -d --no-progress -c "$S3CMD_CONF" put "$GZLOCALPATH" "$REMOTEPATH" 2>&1 |grep -q "DEBUG: MD5 sums: computed=$MD5, received=\"$MD5\"";echo $?) if [ "$OK" -eq "1" ] then /usr/bin/s3cmd -d --no-progress -c "$S3CMD_CONF" put "$GZLOCALPATH" "$REMOTEPATH" 2>&1|tee "$DEBUGS3" OK=$(grep -q "DEBUG: MD5 sums: computed=$MD5, received=\"$MD5\"" "$DEBUGS3"; echo $?) fi fifi

Log rotate (3/3)

# if ok, publish a message on SNS if [ "$OK" = "0" ]then source "$SNS_ENV" echo -n '-> Message: ' sns-publish "$TOPIC" --message "$REMOTEPATH" OK=${PIPESTATUS[0]}fi

echo "OK=$OK"#for monitoring#/usr/bin/zabbix_sender -s "$HOST" -z XXXXXX.com -k XXXXXX -o "$OK"

Rotating and uploading logs on reboot and shutdown

This is a protection for the Autoscaling group where machines are created andterminated all the time

create a file on /etc/init.d/VHOST.sendme.sh

#!/bin/bash/bin/echo TERMINATED `date --utc` >> /mnt/log/nginx/VHOST.access.log/usr/bin/nice -20 /bin/bash /usr/local/bin/VHOST.sendS3.sh

Then execute:

# update-rc.d VHOST.sendme.sh stop 21 0 6 .

(the dot in the end of the line is required)

Worker details (1/3)

I'm not a developer, my worker python code is too ugly to be shown. It is very similar to the blog post mentioned early, the only addition is download S3 files and ready messages on SQS, the functions are similar as those:

Connect to S3 and SQS from boto.sqs.connection import SQSConnectionfrom boto.s3.connection import S3Connectionfrom boto.sqs.message import RawMessage # for SNS messagesimport json

print "connecting to sqs"logging.info("connecting to sqs")connsqs = SQSConnection('xxxxxxxxxxxx', 'xxxxxxxxxxxxx')

print "connecting to s3"logging.info("connection to s3")conns3 = S3Connection('xxxxxxxxxxxxxxxx', 'xxxxxxxxxxxxx')


Reading and deleting SQS messages, and putting results in an array:

print "getting queue"logging.info("getting queue")my_queue = connsqs.get_queue('MYQUEUE')my_queue.set_message_class(RawMessage) #raw messages from SNS

maxmsgs = 10msgs = []msg = my_queue.read()while msg: logging.info("getting message") msgsingle = json.loads(msg.get_body())['Message'] logging.info(msgsingle) msgs.append(msgsingle)

logging.info("deleting message") my_queue.delete_message(msg) if len(msgs) < maxmsg : logging.info("getting more messages") msg = my_queue.read() else: msg = False


Getting files from S3 and putting lines in an array:

msg_data[]lines = []filename = '/tmp/tmppiwikpy.%s.txt' % os.getpid()for msg_data in msgs: llog = "trying file "+msg_data logging.info(llog) if "s3://MYBUCKET/" in msg_data : s3obj = msg_data.replace("s3://MYBUCKET/","") llog = "downloading "+msg_data logging.info(llog) key = conns3.get_bucket('MYBUCKET').get_key(s3obj) key.get_contents_to_filename(filename) llog = "decompressing file"+msg_data logging.info(llog) fgz = gzip.open(filename, 'r'); line = fgz.readline() while line: lines.append(line) line = fgz.readline() llog = "closing and deleting temporary file"logging.info(llog)fgz.close()os.remove(filename)

Piwik REST API

Check it here:http://piwik.org/docs/tracking-api/#toc-tracking-api-rest-documentation

Your worker script has to create an url like this: http://127.0.0.1/piwik.php?action_name=NAME&idsite=XX&rand=RANDOMNUMBER&rec=1&url=URL&cip=USER_IP&token_auth=YOUR_PIWIK_ADMIN_TOKEN&_id=COOKIE_FROM_NGINX&cdt=DATE_OF_VISIT The cookie from Nginx is the first 16 characters from its md5 sum, as the same as Piwik does internally Date of visit must be in UTC

Others / Next steps

- If you deploy it not in Amazon, it makes sense to send the log lines to a queue and have lots of small workers reading and replicating them into Piwik. It is easier to handle, skip or reprocess a failed line than an entire log file

- We still have window to improve, we are not using SSDs cards, we didn't do any partition or sharding

- Visit logs and Action logs will need to be changed in order to make the database cheaper and more scalable.

- Our next step will be try to improve the archives, probably our next bottleneck.

What is missing on Piwik

- split read and write connections to mysql database, so we can have the benefits of mysql replications, like run a dedicated slave to the archive.php selects and a dedicated slave to non-admin users.

- create a database per website. It is easier to maintain and reduces the mysql indexes sizes to fit them in memory. You can partition the tables by idsite, it helps.

- people read this presentation and send feedbacks :) please use Piwik's forums for this

- feature request: optionally have a Mysql connection per website or be able to configure Piwik's interface to import data from other Piwik installations, having all websites on a single place. Doing that we can have smaller databases for each website. (Zabbix have simliar feature)

- feature request: have mysql optional connections profiles, so we can set smaller buffers for smaller tasks, improving the memory usage

Thanks Piwik !

Now we have modern analytics for old problems

and a modern scaling setup for a traditional LAMP stack

Thanks Zed (aka Carlo) for all programming support,Gaiser for all Amazon tips, Matt for all Piwik tips.

And thanks to R7 Managers Denis and Vechiato to believe and provide time andresources to make it happen and R7 Director Brandi to review this and allow us to share.

Extending Piwik At R7.com

Technology

Transcript of Extending Piwik At R7.com