K 10 svn:author V 7 trociny K 8 svn:date V 27 2011-03-29T20:58:25.229759Z K 7 svn:log V 5774 MFC r219351, r219354, r219369, r219370, r219371, r219372, r219373, r219385, r219482, r219620, r219669, r219721, r219813, r219814, r219815, r219816, r219817, r219818, r219821, r219830, r219831, r219832, r219833, r219837, r219844, r219864, r219873, r219879, r219882, r219884, r219887, r219900: r219351 (pjd): Allow to checksum on-the-wire data using either CRC32 or SHA256. r219354 (pjd): Allow to compress on-the-wire data using two algorithms: - HOLE - it simply turns all-zero blocks into few bytes header; it is extremely fast, so it is turned on by default; it is mostly intended to speed up initial synchronization where we expect many zeros; - LZF - very fast algorithm by Marc Alexander Lehmann, which shows very decent compression ratio and has BSD license. r219369 (pjd): Provides three states for pjdlog_initialized, so we can also tell that this is fist initialization ever. r219370 (pjd), r219385 (pjd): - Turn on printf extentions. - Load support for %T for pritning time. - Add support for %N for printing number in human readable form. - Add support for %S for printing sockaddr structure (currently only AF_INET family is supported, as this is all we need in HAST). - Disable gcc compile-time format checking as this will no longer work. r219371 (pjd): Use %S to print IP address and port number. r219372 (pjd): - Log size of data to synchronize in human readable form (using %N). - Log synchronization time (using %T). - Log synchronization speed in human readable form (using %N). r219373 (pjd): Print some of the numbers in human readable form (using %N). r219482: Make workers inherit debug level from the main process. r219620 (pjd): In command line options allow size to be specified using k/M/G/T suffixes. r219669 (pjd): Remove #include needed for debugging. r219721: For secondary, set 2 * HAST_KEEPALIVE seconds timeout for incoming connection so the worker will exit if it does not receive packets from the primary during this interval. Reported by: Christian Vogt Tested by: Christian Vogt r219813 (pjd): If there is any traffic on one of out descriptors, we were not checking for long running hooks. Fix it by not using select(2) timeout to decide if we want to check hooks or not. r219814 (pjd): When creating connection on behalf of primary worker, set pjdlog prefix to resource name and role, so that any logs related to that can be identified properly. r219815 (pjd): Add snprlcat() and vsnprlcat() - the functions I'm always missing. They work as a combination of snprintf(3) and strlcat(3) - the caller can append a string build based on the given format. r219816 (pjd): Use snprlcat() instead of two strlcat(3)s. r219817 (pjd): Log when we start hooks checking and when we execute a hook. r219818 (pjd), r219821 (pjd): In hast.conf we define the other node's address in 'remote' variable. This way we know how to connect to secondary node when we are primary. The same variable is used by the secondary node - it only accepts connections from the address stored in 'remote' variable. In cluster configurations it is common that each node has its individual IP address and there is one addtional shared IP address which is assigned to primary node. It seems it is possible that if the shared IP address is from the same network as the individual IP address it might be choosen by the kernel as a source address for connection with the secondary node. Such connection will be rejected by secondary, as it doesn't come from primary node individual IP. Add 'source' variable that allows to specify source IP address we want to bind to before connecting to the secondary node. r219821 (pjd): Forgot to commit this as a part of r219818. r219830 (pjd): Detect situation where resource internal identifier differs. This means that both nodes have separately managed resources that don't have the same data. r219831 (pjd): Be pedantic and free nvout before exiting. r219832 (pjd): Increase debug level of "Checking hooks." message. r219833 (pjd): Remove stale comment. Yes, it is valid to set role back to init. r219837 (pjd): Before handling any events on descriptors check signals so we can update our info about worker processes if any of them was terminated in the meantime. This fixes the problem with 'hastctl status' running from a hook called on split-brain: 1. Secondary calls a hooks and terminates. 2. Hook asks for resource status via 'hastctl status'. 3. The main hastd handles the status request by sending it to the secondary worker who is already dead, but because signals weren't checked yet he doesn't know that and we get EPIPE. r219843 (pjd): Fix typo. r219844 (pjd): Initialize localcnt on first write. This fixes assertion when we create resource, set role to primary, do no writes, then sent it to secondary and accept connection from primary. r219864 (pjd): White space cleanups. r219873 (pjd), r219873 (pjd): The proto API is a general purpose API, so don't use 'hast' in structures or function names. It can now be used outside of HAST. r219879: For requests that are sent only to remote component use the error from remote. r219882: After synchronization is complete we should make primary counters be equal to secondary counters: primary_localcnt = secondary_remotecnt primary_remotecnt = secondary_localcnt Previously it was done wrong and split-brain was observed after primary had synchronized up-to-date data from secondary. r219887 (pjd): Add pjd copyright. r219900 (pjd): Don't create socketpair for connection forwarding between parent and secondary. Secondary doesn't need to connect anywhere. Approved by: pjd (mentor) END